If you use the same open-source model, you’d expect identical performance wherever you run it. Turns out not. A team at Artificial Analysis ran GPQA Diamond (16×), AIME25 (32×), and...
If you use the same open-source model, you’d expect identical performance wherever you run it. Turns out not. A team at Artificial Analysis ran GPQA Diamond (16×), AIME25 (32×), and IFBench (8×) to compare performance; the numbers here show the median score across runs.What’s surprising is that AWS and Azure are consistently behind on performance.... See more