Machine Learners Guide to Real World - 2️⃣ Concepts from Operating Systems That Found Their Way in LLMs
Google Deepmind used similar idea to make LLMs faster in Accelerating Large Language Model Decoding with Speculative Sampling. Their algorithm uses a smaller draft model to make initial guesses and a larger primary model to validate them. If the draft often guesses right, operations become faster, reducing latency.
There are some people speculating ... See more
There are some people speculating ... See more