DeepSpeed-FastGen
Prompt Engineering for LLMs
oreilly.com
TL;DR
LLMLingua utilizes a compact, well-trained language model (e.g., GPT2-small, LLaMA-7B) to identify and remove non-essential tokens in prompts. This approach enables efficient inference with large language models (LLMs), achieving up to 20x compression with minimal performance loss.
... See more
LLMLingua utilizes a compact, well-trained language model (e.g., GPT2-small, LLaMA-7B) to identify and remove non-essential tokens in prompts. This approach enables efficient inference with large language models (LLMs), achieving up to 20x compression with minimal performance loss.
... See more
microsoft • GitHub - microsoft/LLMLingua: To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
For tasks that demand low latency, GPT‐4.1 nano is our fastest and cheapest model available. It delivers exceptional performance at a small size with its 1 million token context window, and scores 80.1% on MMLU, 50.3% on GPQA, and 9.8% on Aider polyglot coding—even higher than GPT‐4o mini. It’s ideal for tasks like classification or autocompletion.
Introducing GPT-4.1 in the API
Key Innovations in Model Architecture
Two innovations have driven a lot of the recent improvements in Generative AI:
Two innovations have driven a lot of the recent improvements in Generative AI:
- Encoder-decoder system: two neural networks that compress and then expand the data
- Attention mechanism: ensures that important information isn't lost in the compression and expansion process, by "paying attention" to the important pi
Introduction to Generative AI
As we explore new applications for large language models and consider how well they can optimize our communication, AI challenges us to reflect on the qualities we truly value in our prose. How do we measure the caliber of writing, and how well does AI perform?
Laura Hartenberger • What AI Teaches Us About Good Writing
R1’s leap in capability and efficiency wouldn’t be possible without its foundation model, DeepSeek-V3, which was released in December 2024. V3 itself is big—671 billion parameters (by comparison, GPT4-o is rumored to be 1.8 trillion, or three times as big)—yet it’s surprisingly cost-effective to run. That’s because V3 uses a mixture of experts (MoE
... See more