DeepSpeed-FastGen
GitHub - mit-han-lab/streaming-llm: Efficient Streaming Language Models with Attention Sinks
mit-han-labgithub.com
Great read - "Understanding LLMs: A Comprehensive Overview from Training to Inference"
The journey from self-attention mechanism to the final LLMs.
This paper reviews the evolution of large language model training techniques and inference deployment... See more
One of the best tutorial-style repos since @karpathy's minGPT! GPT-Fast: a minimalistic, PyTorch-only decoding implementation loaded with best practices: int8/int4 quantization, speculative decoding, Tensor parallelism, etc. Boosts the "clock speed" of LLM OS by 10x with no model change!
We need more minGPTs and... See more
Jim Fanx.com



New blog post: how to make LLMs go fast! Want to understand how people are making LLMs go brrrrr? This post is a survey of lots of different LLM inference optimizations, ranging from "everyone uses this in prod" to "I cooked this up last week (but it seems to work)" https://t.co/mZwpqJghNq