Caption Files and Attribution Reversal in LLMs
Another win for the "fancy search result" framework
I’ve been trying to figure out why Gemini (and I imagine other LLMs) will sometimes get an exact line from a film right, but attribute it to the wrong person. For instance, if you put a Clark Gable line “he’d bite your finger off just for fun” from Mogambo into AI Overview, you get this, which attributes it to Ava Gardner:

There are a bunch of errors here, including the idea that the warning about getting a finger bitten off is about a panther (it’s actually about a chimp). But the dialogue attribution really doesn’t make any sense. Clark Gable is the person who owns the animals here. Why would Ava Gardner, a globe-trotting “playgirl” fresh off the boat from New York City, be warning him?
I set up an inspector on my Arc film fact-checker to see what grounding searches were bringing back. It took a while but the inspector is going to be invaluable, and I am embarrassed I didn’t set it up before. Here is the search grounding for the related quote “We dine at nine.” which happens right after the “bite your finger off” quote:

Do you see item #4 there? SubtitleCat? That’s your answer to why the attribution reversal is happening.
Because the answer to how the error comes about is actually ridiculously simple. And it might explain a lot of film errors in LLMs.
Here’s what’s happening. For most films of note there is an “srt” file that someone somewhere has uploaded to the internet. This is a file of captions and timings, but no character names:

When you’re watching the film, of course, it’s obvious who says what to whom. But left with only a caption file and a smattering of online conversation about surrounding scenes in the film it does its best mapping.
Think about this, though. What this means is for many films the most detailed description of what happens on screen has no character names or actions in it at all. Everything has to be intuited from a lines that aren’t associated with characters, show no scene breaks, have no settings other than what characters might mention about the settings. We start to understand how my fact-checker can precisely tell you how many minutes into a film a specific scene is, but then place it at a military fort instead of a church.
What’s the solution? For Arc, my filmic fact-checker, I think it is probably to have it:
Make its dialogue attributions in the scene explicit — internally produce the sequence of attributed lines.
Use a variety of search strategies to test those attributions, and reattribute as necessary, including using random less authoritative sites where the lines are attributed, and using reasoning to ask whether the attributions make sense.
Only then move on to the description.
I probably won’t get to it until this weekend. But I have a hunch that the caption file problem is also behind some misattribution of action as well, and may be responsible for a whole host of weirdness. We’ll find out!
A Broader Lesson
I know many of my readers are not technical, and maybe a lot of people have not made it this far. But there is a lesson here! There’s a lot of talk about how LLMs “think” and that’s well and good. But so much of the conversation is based on the LLMs of 2023, before search grounding. When people talk about stochastic parrots or autocomplete on steroids, that’s what they are talking about. When people say “No one knows where the information is coming from” that’s also what people are talking about.
That’s not how these things work anymore! Yes, it is the underlying technology that pulls the whole apparatus together, and yes that part is opaque. But tool use — like the search grounding calls — is inspectable, and rather than trying to guess where in a training set of billions of documents the information came from, for that portion we can see a very small set of inputs into the answer, and make fairly sound inferences about what is going on.
I don’t mean to undersell the notion that so much of how these things work remains a mystery. But there is a sort of learned helplessness even among critics that assumes reasoning in these things is completely uninspectable. That is not true, not by a long shot. And if that’s what we’re teaching our students we are doing them a grave disservice.

A key improvement to LLMs might have to be the LLM having a sense of uncertainty and replying "I don't know", instead of giving uncertain answers without ever being in doubt. And I doubt that is possible, given the techniques involved, but some improvement in this direction may be possible.
Second, given how much work you have to do to get more reliability of results (and that only in a specific domain you know well) aren't you wondering about how much you are putting in and how much the LLM is putting in?