What Actually Matters (And What Doesn’t) for DeepSeek

RelatedHighlights

Welcome to the post-training era for startups

Training LLMs can be divided into two major phases: pre-training and post-training. The pre-training phase is an extremely expensive process that involves training a general model from a large corpus of data. Even in the case of DeepSeek, a single run of training costs $6 million, while it’s estimated th

Evan Armstrong • What Actually Matters (And What Doesn’t) for DeepSeek

A key advantage of these RL advancements is their universal applicability across any open-source model. This flexibility allows organizations to future-proof their AI investments by using the best current models, and reusing the data and workflow to retrain when a better model comes up. For instance, a customer support AI could adopt newer foundati

Evan Armstrong • What Actually Matters (And What Doesn’t) for DeepSeek

Traditional LLM fine-tuning requires extensive labeled datasets, creating barriers for smaller teams. DeepSeek R1 RL techniques address this by enabling models to fine-tune on smaller, specialized datasets, which are easier for smaller teams to collect. This is especially valuable in domains like math, where outcomes can be automatically verified a

Evan Armstrong • What Actually Matters (And What Doesn’t) for DeepSeek

On top of that, V3 embraced multi-token prediction (MTP). Rather than predicting text one word at a time and inspired by Meta’s FAIR (Fundamental AI Research) team’s ideas toward building "Better & Faster Large Language Models via Multi-token Prediction," it predicts multiple words simultaneously. Finally, a trick called FP8 training

Evan Armstrong • What Actually Matters (And What Doesn’t) for DeepSeek

R1’s leap in capability and efficiency wouldn’t be possible without its foundation model, DeepSeek-V3, which was released in December 2024. V3 itself is big—671 billion parameters (by comparison, GPT4-o is rumored to be 1.8 trillion, or three times as big)—yet it’s surprisingly cost-effective to run. That’s because V3 uses a mixture of experts (MoE

Evan Armstrong • What Actually Matters (And What Doesn’t) for DeepSeek

DeepSeek’s distillation techniques let R1’s capabilities trickle down into smaller, more budget-friendly versions of the model. You can even run a distilled variant locally on your MacBook Pro with just one line of code.

Evan Armstrong • What Actually Matters (And What Doesn’t) for DeepSeek

Perhaps R1’s biggest breakthrough is the confirmation that you no longer need enormous data centers or thousands of labelers to push the limits of LLMs. If you can define what “correctness” means in your domain —whether it’s coding, finance, medical diagnostics, or creative writing— you can apply reasoning-oriented RL to train or fine-tune your own

Evan Armstrong • What Actually Matters (And What Doesn’t) for DeepSeek

Most large language models (LLMs) rely on reinforcement learning (RL) to refine how “helpful and harmless” they sound. Notoriously, OpenAI has used cheap labor in Kenya to label and filter toxic outputs, fine-tuning its models to produce more acceptable language.

DeepSeek R1 took a different path: Instead of focusing on sounding right, it zeroes in