GitHub - microsoft/LLMLingua: To speed up LLMs' inference an...

GitHub - microsoft/LLMLingua: To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

GitHub - microsoft/LLMLingua: To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

github.com

RelatedInsightsHighlights

Thumbnail of www-x-com-svpino-status-1838940239186067589-5aa492b581fe4453

You can now optimize and make any open-source LLM faster: 1. pip install llmcompressor 2. apply quantization with 1 line of code Two benefits: 1. Your LLM will run faster during inference time. 2. You will save a ton of money on... See more

Santiago

x.com

Fastest inference engine for LLMs! LMCache is an LLM serving engine that reduce Time to First Token (TTFT) and increase throughput, especially under long-context scenarios. 100% Open Source https://t.co/xHAFz7d9v8

Sumanth

x.com

We're excited to introduce LLM Compressor, a library to compress LLMs for faster inference with vLLM. Our team used it to create fully quantized models like Llama 3.1 405B, recovering full accuracy and cutting costs 4x. Now, we're contributing it to the vLLM... See more

Red Hat AI

x.com