Google’s AI Overviews is still wrong 10% of the time, which is a lot of lies

Google’s AI Overviews is still wrong 10% of the time, which is a lot of lies

9 0 0

Google’s AI Overviews has been a mess since day one. Remember when it told people to put glue on pizza? Or when it suggested eating rocks for nutrition? Those were fun times. But Google has been quietly improving the thing, and according to a recent New York Times analysis, it’s now right about 90 percent of the time.

That sounds good until you realize what 10 percent wrong means at Google’s scale. The Times worked with a startup called Oumi to test AI Overviews using SimpleQA, an OpenAI benchmark from 2024 that throws 4,000+ verifiable questions at an AI. Oumi ran the test last year when Gemini 2.5 was Google’s top model and got 85 percent accuracy. After the Gemini 3 update, that climbed to 91 percent.

So one in ten answers is wrong. Google processes billions of searches per day. Do the math. That’s tens of millions of incorrect answers daily, hundreds of thousands per hour. “Lies” might be a strong word since the AI isn’t malicious, but it’s definitely broadcasting a lot of confidently wrong garbage.

I’ve been using AI Overviews since the beta, and my experience matches these numbers. For simple facts like “what year did the Berlin Wall fall” it’s fine. But ask it anything nuanced about a niche topic, and you’re gambling. I got a hilariously wrong answer about a local hiking trail recently, claiming a route existed that hasn’t been maintained since 2019. If I’d followed that advice, I’d have been bushwhacking for hours.

The thing is, 90 percent accuracy is considered decent for a large language model. But Google isn’t a chatbot you chat with for fun. It’s the world’s primary information retrieval system. When people search “how to treat a burn” or “what medication interacts with this,” they expect the truth, not a statistical guess.

Google has been slowly rolling out improvements. They’ve added citation buttons, limited the types of queries that trigger Overviews, and apparently improved the underlying model significantly. But 10 percent wrong is still a disaster for a service that reaches half the planet.

The AI industry loves to benchmark and claim victory when accuracy goes from 85 to 91 percent. That’s a real improvement, no doubt. But for Google specifically, the margin for error is zero. Every wrong answer that goes viral erodes trust. And once trust is gone, getting it back is harder than training a model to stop recommending glue recipes.

Comments (0)

Be the first to comment!