We're not taking the fact-checking powers of AI seriously enough. It's past time to start.
Some notes on the Nobel Prize hallucination that wasn't
A few days ago, I was reading a Stephen Harrison post about Grokipedia. He mentioned Grokipedia had hallucinated a fact about the Nobel Prize.
That fact? That “physics is traditionally the first award presented in the Nobel Prize ceremony.”1 His source regarding the gaffe was Politifact. They had checked this as part of a larger review of Grokipedia and determined that it was false, at least in recent years.
From Politifact’s November reporting:
PolitiFact found at least one instance when Grokipedia introduced misleading information. The Grokipedia and Wikipedia articles for “Nobel Prize in Physics” are largely the same, but one sentence Grokipedia added said, “Physics is traditionally the first award presented in the Nobel Prize ceremony.” It did not provide a citation, and it appears to be wrong: In at least the past few years, the Nobel Prize for Physiology or Medicine was awarded first.
As readers of this blog know, I’ve been writing a piece on the “three moves for AI-based investigation” for students, and thought this might make a great example of why you always need to track down crucial facts in AI responses to sources. I’ve been using the Deloitte example, and some examples from my own experience with AI, and a Grokipedia hallucination would have rounded out the example set nicely.
Of course, I assumed Politifact was likely correct about the error, but I wanted to know the larger context behind it before I used it as an example. So I reached for Deep Background, the AI prompt I built last spring that provides excellent contextualization of most textual claims (it’s a bit iffy on images, like all LLMs, but incredibly solid on text).
I put in the claim:
The tool then did its thing. First it did an implicit claim analysis to make sure we were seeing the larger picture...
Then it formulated a search strategy, executed the searches, and analyzed the results.
Deep Background uses an off-the-shelf LLM (Claude’s paid version, though I’ve also used it successfully with Gemini 3 and ChatGPT 5.x). It then supplements that with extensive set of prompting instructions based on what I’ve learned from teaching students to fact-check over the past fifteen years.
The prompt is currently 3,500 words long (down from its peak actually), and took a few months to perfect. It pulls together a decade and a half of my work in information literacy and argumentation theory. It also takes a bit of time to run. It is very token and search-intensive, requiring a paid account to run properly. But what it gives you if you run all four phases of it is not just an answer, but a well-sourced answer that tends to surface the most authoritative sources, the most compelling rebuttals, and the most pertinent missing context.
Anyway, after a bit of search processing Deep Background spat out the results.
Which, I have to say, were a bit of a surprise:
(You can view the full session with links here)
That’s right, Deep Background told me that rather than this quote being fabricated by Grokipedia, Grokipedia was right.
I clicked through the supporting link Deep Background provided of the 2024 ceremony order. The program seemed to confirm it:
Another Round
Core to Deep Background’s effectiveness is a multi-round process. In the first round, the prompt makes initial foray into the information environment for an answer. The second round stress-tests the answer from the first round. I ran the second round by putting in the pre-programmed keyword that kicks it off:
While most of the result of the second round served only to strengthen the original claim, the response did highlight a possible confusion on the part of PolitiFact: the announcement order (not mentioned by Grokipedia) differs from the presentation order (mentioned by Grokipedia).
See point five:
(Again, you can view the full session with links here)
Given this distinction, I decided to do one more round, this time taking the Politifact claim about award order as the counterclaim to investigate.
Deep Background then resolved the contradiction:
It provided a new sources table with announcement links…
…then made a decent guess at the reason for the difference:
The announcement order appears to be an administrative/logistical arrangement — each awarding institution announces on its own schedule. The Karolinska Institutet simply happens to announce first (on Monday), while the Royal Swedish Academy of Sciences announces the next day.
The ceremony order, by contrast, follows the traditional protocol established in 1901, which tracks Nobel’s will: Physics, Chemistry, Medicine, Literature.
(I’m doubtful of course that each institution “announces on its own schedule”, but the sequence does make logistical sense. The two Nobels announced by the Academy of Sciences are on adjacent days; Medicine, which is announced by the Secretary-General of the Nobel Assembly, kicks off the week. Simpler for all involved.)
A final check
While I did confirm the order on the event agenda, I noted that the agenda did not have timings on it, and since I was actually going up against the finding of a respected fact-checking organization it seemed worth going an extra step. So I asked Deep Background for a link:
Then I used the AI-powered “Ask” feature on YouTube to get the timestamps of each award in the ceremony and clicked them in sequence, as shown on the short one minute screencast below:
And yeah, Physics is first. Physics has always been first. Physics will continue to be first.
Grokipedia was right.
First, you absolutely do not have to hand it to Grokipedia
Let me say this first, in case you landed here to gloat on Grokpedia’s behalf: Grokipedia is, in my opinion, ridiculous. It’s a parasitic, ideological project built by one of the world’s least trustworthy information companies in a way that fails to comprehend what made Wikipedia special in the first place. If you want to read about some of the actual errors contained in it you can read The Atlantic or The Guardian.
So this is not a defense of Grokipedia.
But there are lessons here worth talking about. I’m actually not going to talk here about anything deep or complex. (If you want deeper, try this).
Rather here are three dirt simple takeaways that can improve both your analysis of (potential) AI error, and improve your use of AI to do that analysis.
Takeaway one: Stop assuming AI errors are hallucinations
Here’s one. Thinking of all LLM errors as “hallucinations” makes us worse at finding the truth.
I’ve talked about this before. Most AI errors are not hallucinations in the sense that term was originally meant. As one example, a much more common error is conflation, where two unlike things are stitched into one. Take, for example, this AI Overview that creates a season two finale for the show LOST that pulls in scenes (the church, the flash-sideways) from the season finale, and creates a weird frankenstein-ed episode:
This is not hallucinated in the original sense of the word. It’s muddled in a way that we can strive to comprehend, recognize, and ultimately straighten out.
Another common error LLMs make, particularly with the products relying on web search, is overreliance on unreliable or intentionally deceptive sources. Newsguard has tracked how Russia’s disinformation machine has corrupted AI output by flooding the zone with literally millions of posts and articles, the content of which gets uncritically pulled into responses where coverage is sparse. Renee DiResta, in her Atlantic article, notes how Grokipedia uncritically repeated easily debunked claims about her work. Those claims appear to have been sourced to material entered into congressional proceedings, a common way politicians attempt to launder social media rumor into fact. In my own work, I’ve shown repeatedly how the first pass of LLMs on popular scientific subjects overindex on social media treatments of that subject, often at the expense of verifiable fact or expert insight.
I mention these are not “hallucinations” because of how I discovered this error by Politifact in the first place. I had decided if I was going to use what PolitiFact reported as an example of a Grokipedia mistake, it was important for me to understand the type of error it was, just as I would with a human-produced error. It looked like Grokipedia got something wrong — but what sort of wrong?
Was it conflating? Leaning too heavily on social media sources? Misreading a primary document? If it was “muddled”, where did that muddle likely originate? Or was it truly a “hallucination” in the classic sense, bearing no discernable relation to any set of known claims or documents.
So why did PolitiFact not try to figure out where the “error” came from, a process that would have revealed its own error in a matter of minutes?
I have no idea, really, in this instance. I can say what I see in the behavior of others online, and maybe it applies here. And I think this is something important for students to understand as well.
The ability to call errors in AI output “hallucinations” and the endless talk of “stochastic parrots” has caused a lot of smart people (and the people at Politifact are incredibly smart, I assure you) to bring to a premature close their investigative process. To use the term hallucination is to give yourself permission to disengage, throw up your hands, and say “LLMs, amirite?”
Why is this thing in the output? Randomness! Hallucination! Mystery solved, end of story.
One of the most rewarding habits I have developed when it comes to approaching AI error (and if you watch my daily walkthrough series you’ll see I have discovered a lot of AI error) is using the category of “hallucination” as a last resort description, applying it only after I have been able to eliminate other likely descriptors. This is especially important with search-assisted AIs. In my experience, the vast majority of errors coming out of AI are understandable as bad synthesis of search results and bad weighting of sources pulled in by search. And as a result of taking the error seriously, you sometimes learn it is not an error at all.
“But Mike!” you protest, “You’re so naive! LLMs don’t synthesize, they don’t weight sources, they are just three levels of math in a trenchcoat, plowing through autocomplete mechanics!”
Here’s what I’d say about that. My set of understandings led me to discover a substantial error by a news organization that, if search hasn’t failed me, seems to have been missed by everyone up until now.
What has “I can’t analyze the output because its meaningless fancy autocomplete” done for you?
Takeaway two: Add AI to your toolkit to answer the questions you didn’t think to ask
One of the obvious replies to this write-up is that most of this process I went through is pretty run-of-the-mill after you realize the key error made was the fact-checker misreading the sentence:
Physics is traditionally the first award presented in the Nobel Prize ceremony.
As something like:
Physics is traditionally the first award announced.
It’s so obvious when you read it now, after all. And of course once you know that you could track down the documents to support it. If you’re a fact-checker you could make the call. So (the thinking goes) what we actually need is just for people to start reading sentences correctly; we don’t need AI for fact-checking at all.
This is backwards. It’s our tendency to get trapped in our own interpretations that makes LLMs such valuable tools.
As a person who has fact-checked thousands of things I can tell you that what likely happened here is what has gone wrong so many times. And, unfortunately, traditional search can compound it. In traditional search, we are often looking for an authoritative document. When we see (and misinterpret) something like
Physics is traditionally the first award presented in the Nobel Prize ceremony.
We think “I know what sort of document I need,” and type in something like “nobel prize announcement schedule”. Google dutifully complies.
Our assumption about the claim forms our idea of the type of evidence we are looking for, the retrieval of which often sinks us further into our assumptions. This is not a novel insight of mine; it is one of the most reproduced findings in the study of search and misinformation. We’ve got a whole chapter on the issue in Verified.
AI, when used effectively, doesn’t work like that.2 It is excellent at asking not only the questions we want answered, but the ones we didn’t even know we want answered.
Let me give a simple example. If you saw this image below and were thinking of sharing it, you might go to Google and ask whether the claim the Benin walls are the largest single archeological phenomenon on the planet:
To you, that’s the question that requires checking. But if you just upload the image and text into AI Mode (paid thinking version), you get this:
Yep — the claim is debatable, and I’ve got my eye on that shift from “the largest” to “one of the largest”. But notably, the image here is wrong. The piece we weren’t even thinking about! The AI response isn’t perfect, but it did something invaluable. It got us to zoom out when we were too zoomed in.
AI works best when you don’t force it into a corner on these things, and let it start by just giving first impressions of the claim or artifact itself. But even if you push it a bit the frontier models are remarkably good at intuiting not just the context you are asking for but the context you need. That’s invaluable.
Takeaway three: Pay for a model already
This point is from having talked to a lot of reporters. I remain shocked at how many reporters and even fact-checkers do not use a paid model of AI — or even know it makes a difference.
It makes a huge difference!
As an example of how much using a paid model improves results, here we put the Nobel claim into the free version of AI Mode:
It fails! Not great! This is one of those conflation errors I talk about, where just as it collapses details from two separate LOST finales, it makes a false synthesis out of the announcement order and the presentation order. This happens with free version of LLMs quite a bit, or at least enough to matter.3
Now, if we’ve paid 20 bucks a month and selected thinking mode, we get this:
This is correct! It’s not as well-sourced an answer as Deep Background gives, and it doesn’t figure out the confusion with the announcement order. But it does surface the right answer to the right question. It does provide a link you can start with. That’s from just putting in the bare claim into a $20/month service. And it really doesn’t matter which service. Here it taps Gemini on the back end, but there would be little functional difference with Claude or ChatGPT. Choose one! As long as it is paid, your chance of discovering something you’ve been missing is going to go way up.
I spend a lot of my mornings doing fact-checking walkthroughs using the free tools that the public uses, and it kind of kills me I have to work around their quirks. Most things that are difficult in those walkthroughs would not be difficult at all if people would just pay for the tech. But I use the free tools because I know that’s what the average person is using, and that’s who I’m trying to help.
But reporters aren’t average people. They are professionals. And yet I can’t tell you how many phone calls I’ve had with journalists where a reporter will reveal they do not pay for an LLM. To me, that’s like a reporter in 2005 not paying for internet.
Into the future
I worried a bit about this post. Lord knows that fact-checkers don’t need any additional headaches. The error here actually turns out to not have been broadly disseminated; for whatever reason this particular article’s example wasn’t picked up as much as some other journalistic accounts of the Grokipedia launch.4 The particular piece is not technically a fact-check, though I imagine if my post went viral it would be erroneously described that way. I also have had enough of the false concern of people proclaiming loftily, with much drama, “But who will fact-check the fact-checkers?”5 when what they really mean is “It should be illegal for you to say I am wrong.” If that happens here, with people using this post as “evidence”, it’s going to depress me.
In the end, though, I do think mistakes like this are educative.
Maybe my specific insights here are wrong. I have zero direct insight into how this happened, and I hope readers (and especially the fact-checkers involved) won’t read my hypothesizing as arrogance.
But I remain convinced that there is no future of verification and contextualization that doesn’t involve both better understanding of LLMs and more efficacious use of them. The three simple suggestions here — don’t prematurely dismiss errors as hallucinations, do use LLMs to surface the “unknown unknowns”, and do pay for the better models — are all pleas to engage more fully with this tech, and develop better (and more up-to-date) understandings of what they do well and what they do poorly, going beyond whether they fail or succeed to how they fail and how they succeed. This is also what we need to do in education. If we can start from there I’m confident our effort and engagement will be well-rewarded.
Note: As a result of my discovery, PolitiFact did additional research and confirmed directly with the Nobel Foundation staff that physics is traditionally first. The article has been updated.
I only say “apparently” because the sentence is not currently in Grokipedia, and there is no edit history showing its removal — which is a problem in itself!
I know that “effectively” is doing a lot of work here, but I am working on my get it in, track it down, follow up model where the “get it in” move shows how to tap this particular strength of AI effectively.
Note that conflation errors are not unique to LLMs — this entire post is really about a conflation error by Politifact.
Also, the main point of the paragraph in which the error occurs is that if you’re going to say something like traditionally this is the order of something you need to link that, and Grokipedia did not do that. Which is part of the reason for this whole mess.
One of the many reasons “who will fact-check the fact-checkers” is so tiresome is different fact-checkers quite often disagree. They aren’t monolithic, and most checks are covered by multiple entities. Additionally, there’s an entire universe of reporters, bloggers, op-ed writers, and ordinary people online who critique fact-checks, and if issues are discovered fact-checkers usually do pretty well responding. I know people think this is a clever saying, but it feels to me like someone saying “Aha, you read a book, but who will book the book?” You mean who will critique the book? Anyone who wants to, really.
On the other hand, if you mean “Why isn’t there someone making sure fact-checkers give equal credence to my views?” that’s a different question.





















Thanks for such an insightful piece. I'd like to use the Deep Background prompt to make a Gem in Gemini Pro. Would you advise leaving the 'model type' off or turning on 'Deep Research' along with the prompt?
This article comes at the perfect time. What if Grokipedia's hallucinations were about critical health info? Your point on tracking sources is so cruical.