DeepSpeed-FastGen

RelatedHighlights

Prompt Engineering for LLMs

TL;DR

LLMLingua utilizes a compact, well-trained language model (e.g., GPT2-small, LLaMA-7B) to identify and remove non-essential tokens in prompts. This approach enables efficient inference with large language models (LLMs), achieving up to 20x compression with minimal performance loss.

LLMLingua: Compressing Prompts for Accelerated Inference of La

microsoft • GitHub - microsoft/LLMLingua: To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

For tasks that demand low latency, GPT‐4.1 nano is our fastest and cheapest model available. It delivers exceptional performance at a small size with its 1 million token context window, and scores 80.1% on MMLU, 50.3% on GPQA, and 9.8% on Aider polyglot coding—even higher than GPT‐4o mini. It’s ideal for tasks like classification or autocompletion.

Introducing GPT-4.1 in the API

Key Innovations in Model Architecture

Two innovations have driven a lot of the recent improvements in Generative AI:

Encoder-decoder system: two neural networks that compress and then expand the data

Attention mechanism: ensures that important information isn't lost in the compression and expansion process, by "paying attention" to the important pi

Introduction to Generative AI

As we explore new applications for large language models and consider how well they can optimize our communication, AI challenges us to reflect on the qualities we truly value in our prose. How do we measure the caliber of writing, and how well does AI perform?

Laura Hartenberger • What AI Teaches Us About Good Writing

R1’s leap in capability and efficiency wouldn’t be possible without its foundation model, DeepSeek-V3, which was released in December 2024. V3 itself is big—671 billion parameters (by comparison, GPT4-o is rumored to be 1.8 trillion, or three times as big)—yet it’s surprisingly cost-effective to run. That’s because V3 uses a mixture of experts (MoE

Prompt Engineering for LLMs

microsoft • GitHub - microsoft/LLMLingua: To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

Introducing GPT-4.1 in the API

Introduction to Generative AI

Laura Hartenberger • What AI Teaches Us About Good Writing

Evan Armstrong • What Actually Matters (And What Doesn’t) for DeepSeek