For example, GPT‑4.1 scores 54.6% on SWE-bench Verified, which is better than GPT-4o by 21.4% and 26.6% over GPT‑4.5. We have similar results on other benchmarking tools shared by OpenAI, but how does it compete against Gemini models. According to benchmarks shared by Stagehand, which is a production-ready browser automation framework, Gemini 2.0 Flash has the lowest error rate (6.67%) along with the highest exact‑match score (90%), and it’s also cheap and fast. On the other hand, GPT‑4.1 has a higher error rate (16.67%) and costs over 10 times more than Gemini 2.0 Flash. In another data shared by Pierre Bongrand, who is a scientist working on RNA at Harward, GPT‑4.1 offers poorer cost-effectiveness than competing models. According to the benchmarks, these models are far better than the existing GPT‑4o and GPT‑4o mini, particularly in coding. Models like Gemini 2.0 Flash, Gemini 2.5 Pro, and even DeepSeek or o3 mini lie closer to or on the frontier, which suggests they deliver higher performance at a lower or comparable cost. We're seeing similar results in coding benchmarks, with Aider Polyglot listing GPT-4.1 with a 52% score, while Gemini 2.5 is miles ahead at 73%.
This Cyber News was published on www.bleepingcomputer.com. Publication date: Tue, 15 Apr 2025 21:30:13 +0000