Sublime

An inspiration engine for ideas

AllPeopleCollectionsArticlesAudioBooksFilesHighlightsImagesLinksNotesTextTweetsVideosSocial

DeFi AI

Closing the Gap to Closed Source LLMs – Open-Sourcing 70B Abacus Giraffe! The best-performing open-source model on MT-bench in key categories We are super excited to be open-sourcing our best model for Enterprise AI use cases - the 70B 32K context length Abacus Giraffe model!... See more

Bindu Reddy

x.com

would love if you could bench deepseek r1 distills. you can do this with llama.cpp (more accurate) or https://t.co/POZPRrv92O. also would be interesting if there is any difference in speed between gguf and mlx runtimes. if your time allows, i'd love to see numbers for 14b, 32b, 70b models, at different quantizations (say, 4, 8 bits)... See more

banteg x.com

Thumbnail of www-x-com-tom-doerr-status-1884636369634328861-6f20725ab8f94e82

Ragas: LLM evaluation tools https://t.co/DQSGByCiB1

Tom Dörr

x.com

GPT-5 Pro Is Now On LiveBench GPT-5 Pro and High are both top of LiveBench and are similar to each other Overall, GPT-5 thinking models are excellent at problem-solving, but can become unbearably slow as they occasionally enter a death loop. And yeah, we haven't... See more

Bindu Reddy

x.com

We’re releasing PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research, as part of our Preparedness Framework. Agents must replicate top ICML 2024 papers, including understanding the paper, writing code, and executing experiments. https://t.co/CvYcDdk0nI

OpenAI

x.com

🚀Introducing The LLM Inference Provider Leaderboard https://t.co/f1LzVce9yb - a live-updated, unbiased eval of API Inference products. Featuring: @abacusai, @anyscalecompute, @DeepInfra, @DecartAI, @FireworksAI_HQ, @LeptonAI, @togethercompute, @perplexity_ai, @replicate, as well as @OpenAI and @AnthropicAI... See more

Martian x.com

After thinking about this problem for months, I am so happy to finally introduce DetailBench! It answers a simple question: How good are current LLMs at finding small errors, when they are *not* explicitly asked to do so? (Yes, the graph is right!) https://t.co/I9EIq1CF3W

Xeophon

x.com

Noticed a spike in @Zai_org's new GLM-4.5 model usage on @OpenRouterAI, so we ran our own eval using Roo Code. ✅ Scored 86, slightly better than Qwen3-Coder 💸 Cost us about $27 to run 📊 Solid value for the performance Are you seeing similar results?... See more

Roo Code

x.com