8 Comments
User's avatar
Stephen Fitzpatrick's avatar

Another superb post. I just don't think many folks are testing these models in the way that you are, but I would think that someone, somewhere should be doing this. I've used your prompt in Claude multiple times and the results are fantastic. What do you make of the new Google AI Search mode? Ridiculously primitive compared to what you're doing, but do you think it will continue to get better and better? As you say, it would seem like many of these problems should be fixable. Nice work!

Expand full comment
aholan@poynter.org's avatar

Great post. I am very interested in thinking through how we can do more systematic testing of hallucinations and sourcing.

Expand full comment
Heide Estes's avatar

Thanks so much for this comprehensive discussion. This is such important work.

Expand full comment
Roger Schibli's avatar

Love this so much!

Expand full comment
Dean Lingley's avatar

Great post Mike! Thanks for your work and sharing! Interesting on the link hallucinations in gemini breaking around 10, I saw a similar issue when asking it to create a youtube music playlist for me from a list of songs. It would get the first 10 perfect and then start going off the rails... must be something with the 10 number...

Expand full comment
Mike Caulfield's avatar

That's very interesting actually. I wonder if maybe there's a background limit on page fetches and after that it YOLO's it?

Expand full comment
Denis Setiawan's avatar

Have you tried in notebook LM? Will the result different with gemini?

Expand full comment
Mike Caulfield's avatar

Notebook LM wouldn't really be able to do this -- the challenge of fact-checking is you don't know what documents you need as sources and how to weight them, which is one reason why things that don't use SIFT Toolbox fail.

Expand full comment