Gemini 2.5 Pro and Claude still remain the best models for coding, but that might change when xAI ships Grok 4 Code in August. Grok 4 is a huge leap from Grok 3, but how good is it compared to other models in the market, such as Gemini 2.5 Pro? We now have answers, thanks to new independent benchmarks. The numbers could be different with Grok 4 Heavy, which uses multiple agents to think and compare results, but the Grok 4 Heavy model is not yet available on the API platform. We're talking about Grok 4 API (grok-4-0709), which received about 4k+ community votes and ranks #3 overall in Text Arena. This is a huge leap from Grok 3, which ranked 8th. Grok 4 Code is optimised for coding, and we're also expecting a CLI, similar to Gemini CLI and Claude Code. According to LMArena's tests, Grok 4 scores Top-3 across all categories (#1 in Math, #2 in Coding, #3 in Hard Prompts). LMArena.ai, which is an open platform for crowdsourced AI benchmarking, has published the results of Grok 4. However, it is worth noting that the tested model is Grok 4, not Grok 4 Heavy. While both are reasoning models, Grok 4 Heavy is significantly better.
This Cyber News was published on www.bleepingcomputer.com. Publication date: Wed, 16 Jul 2025 10:40:12 +0000