#ai
We trained GPT‐4.1 to reliably attend to information across the full 1 million context length. We’ve also trained it to be far more reliable than GPT‐4o at noticing relevant text, and ignoring distractors across long and short context lengths
Introducing GPT-4.1 in the API
We also recommend using Predicted Outputs(opens in a new window) to reduce latency of full file rewrites.
Introducing GPT-4.1 in the API
GPT‐4.1 also scores 87.4% on IFEval, compared to 81.0% for GPT‐4o. IFEval uses prompts with verifiable instructions (for example, specifying content length or avoiding certain terms or formats).
Introducing GPT-4.1 in the API
For API developers looking to edit large files, GPT‐4.1 is much more reliable at code diffs across a range of formats. GPT‐4.1 more than doubles GPT‐4o’s score on Aider’s polyglot diff benchmark
, and even beats GPT‐4.5 by 8%abs
, and even beats GPT‐4.5 by 8%abs
Introducing GPT-4.1 in the API
Through efficiency improvements to our inference systems, we’ve been able to offer lower prices on the GPT‐4.1 series.GPT‐4.1 is 26% less expensive than GPT‐4o for median queries, and GPT‐4.1 nano is our cheapest and fastest model ever
Introducing GPT-4.1 in the API
In IFEval
, models must generate answers that comply with various instructions.
, models must generate answers that comply with various instructions.
Introducing GPT-4.1 in the API
For tasks that demand low latency, GPT‐4.1 nano is our fastest and cheapest model available. It delivers exceptional performance at a small size with its 1 million token context window, and scores 80.1% on MMLU, 50.3% on GPQA, and 9.8% on Aider polyglot coding—even higher than GPT‐4o mini. It’s ideal for tasks like classification or autocompletion.
Introducing GPT-4.1 in the API
deprecating GPT‐4.5 Preview in the API, as GPT‐4.1 offers improved or similar performance on many key capabilities at much lower cost and latency
Introducing GPT-4.1 in the API
One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome thes... See more