GitHub - microsoft/LLMLingua: To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

We're excited to introduce LLM Compressor, a library to compress LLMs for faster inference with vLLM.
Our team used it to create fully quantized models like Llama 3.1 405B, recovering full accuracy and cutting costs 4x.
Now, we're contributing it to the vLLM community!
(1/6)