Another superb post. I just don't think many folks are testing these models in the way that you are, but I would think that someone, somewhere should be doing this. I've used your prompt in Claude multiple times and the results are fantastic. What do you make of the new Google AI Search mode? Ridiculously primitive compared to what you're doing, but do you think it will continue to get better and better? As you say, it would seem like many of these problems should be fixable. Nice work!
Great post Mike! Thanks for your work and sharing! Interesting on the link hallucinations in gemini breaking around 10, I saw a similar issue when asking it to create a youtube music playlist for me from a list of songs. It would get the first 10 perfect and then start going off the rails... must be something with the 10 number...
Notebook LM wouldn't really be able to do this -- the challenge of fact-checking is you don't know what documents you need as sources and how to weight them, which is one reason why things that don't use SIFT Toolbox fail.
Another superb post. I just don't think many folks are testing these models in the way that you are, but I would think that someone, somewhere should be doing this. I've used your prompt in Claude multiple times and the results are fantastic. What do you make of the new Google AI Search mode? Ridiculously primitive compared to what you're doing, but do you think it will continue to get better and better? As you say, it would seem like many of these problems should be fixable. Nice work!
Great post. I am very interested in thinking through how we can do more systematic testing of hallucinations and sourcing.
Thanks so much for this comprehensive discussion. This is such important work.
Love this so much!
Great post Mike! Thanks for your work and sharing! Interesting on the link hallucinations in gemini breaking around 10, I saw a similar issue when asking it to create a youtube music playlist for me from a list of songs. It would get the first 10 perfect and then start going off the rails... must be something with the 10 number...
That's very interesting actually. I wonder if maybe there's a background limit on page fetches and after that it YOLO's it?
Have you tried in notebook LM? Will the result different with gemini?
Notebook LM wouldn't really be able to do this -- the challenge of fact-checking is you don't know what documents you need as sources and how to weight them, which is one reason why things that don't use SIFT Toolbox fail.