updated 1y ago
DeepSpeed-FastGen
- TL;DR
LLMLingua utilizes a compact, well-trained language model (e.g., GPT2-small, LLaMA-7B) to identify and remove non-essential tokens in prompts. This approach enables efficient inference with large language models (LLMs), achieving up to 20x compression with minimal performance loss.
... See morefrom GitHub - microsoft/LLMLingua: To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss. by microsoft
Nicolay Gerold added
Shortwave — rajhesh.panchanadhan@gmail.com [Gmail alternative]
74 highlights
Nicolay Gerold and added
People need to be more thoughtful building products on top of LLMs. The fact that they generate text is not the point.
by Linus Lee
2 highlights