Why the "Everything is Hallucination…

Jun 20

From LLMs to the Gulf of Tonkin, distinctions matter. Also: a note about "radar clutter".

9 Comments

The scenario you describe is also discussed by Robert McNamara in the Fog of War film. I think part of the argument, whether you call them hallucinations or whatever term you choose, is that some people will say the mistakes make them unusable while others claim they are still useful despite hallucinations.

Expand full comment

Reply (1)

Mike Caulfield

Jun 20

Yeah, maybe the broadest point is that even if your argument is that the systems error too much to be useful, the fact they produce error through the same process as non-error is not really evidence that they can't be fixed. Most systems produce error through the same (general) method as non-error. It doesn't really provide any insight.

Expand full comment

Reply (2)

Mike Caulfield

Jun 20

Also I need to watch Fog of War I guess!

Expand full comment

Reply (1)

Michael Berman

Jun 21

You would be fascinated by it

Expand full comment

Stephen Fitzpatrick

Jun 20

I agree. But AI opponents almost always use evidence of "mistakes" as their first go-to in condemning the technology. I've heard (read?) that the solution may be to layer LLM's and (eventually) additional forms of AI to check outputs before revealing them to users but that still recognizes that the underlying issue which is that LLM's make "everything" up - it's just that 95+% of it is coherent and intelligible to us. Don't know if this will work or long it may take, but if there is a solution and we see AI's that basically don't make mistakes, then we may be in another era. 800 million weekly users clearly feel the "hallucination" issue is not nearly significant enough to hinder their benefit.

Expand full comment

Timothy Burke

Jun 20

I had a long dialogue with both Gemini and Claude in my recent testing about the epistemological implications of generative AI and I will admit that it was clarifying for me both in the substance of the responses and in the patterning they demonstrated.

One of the major threads in those conversations concerned the epistemological patterns that derive from the training corpus and from the training feedback provided by developers and users. To wit: the high (and increasing) probability of generative AI responses that have truth value is a reflection of the degree to which the vast expanse of digitized text used for training favors those truths. That's where your radar analogy kicks in, I think: that radar operators get used to "reading" the pings from things that are really, ontologically there, but also understand that the system pings in response to both mechanical and environmental glitches that are not "there" in the same sense.

The difference might be that the only reason the training corpus produces more truthful or accurate information is because the textuality of the 20th Century and early 21st Century overwhelmingly dominates the corpus, and work that is scholarly or at least informed by scholarship in turn dominates that just in terms of pure percentages. Even works of fiction are in many cases shaped by the secular, factual, liberal, rationalist norms of 20th Century expression and thinking.

We're fortunate in some sense that the generative AIs that best simulate human expression require the most comprehensive textuality--it wouldn't have been possible to produce Gemini, Claude or ChatGPT from a corpus just limited to religious texts, to conspiracy theory, to fringe philosophies, to experimental fiction, and so on. If that had been possible, they would "hallucinate" much more both in the sense of struggling to make sense in natural language to queries that weren't posed within those epistemological frames and in producing nonsense or falsehoods that mirrored the narrower corpus.

But this is where the analogy to radar (sort of) breaks, in that it would be harder to become a skilled interpreter of radar signals if the reality of the material world could potentially shift on an ongoing basis, or if the radar designers could "relax" the working physics of the radar in order to detect more ephemeral kinds of material noise in the world. It's enough of a challenge for radar to keep up with the shifting nature of military technologies that are in some cases designed to confuse or evade radar. With generative AI, the character of the training corpora are the only reason that everything the AI says is not a "hallucination"--it's because it is reproducing patterns of language that were not produced as hallucinations in the first place. Which is important--but I fear that a lot of generative AI producers actually don't understand that the improvements they're seeking rest on the maintenance of and continued production of knowledge in digital texts by human beings who are governed by fidelity to accuracy, evidence and truth.

Expand full comment

William of Hammock

Jun 21

Consider this counterargument, which moves beyond the "conversation ending" element of the proposition that "everything is hallucination."

Part of the original use of wartime radar, along with the risks of false positives and errors of commission which equate to violence, is governed by humans navigating the moral and ethical paradoxes of using violence to prevent violence. The "technical errors" are thereby nested in a broader consideration by the only kind of mind equipped to navigate it, for better or for worse.

Consider then the potential (what I consider to be likely) outcome of the combined use of radar and AI. Now we have added at least one extra layer of nesting (which, importantly we've no good way of accounting for how many "layers" any given AI adds), as if we are using radar to detect statistical patterns of statistical errors (of commission and omission). How do human minds respond to an increase in sophisticated outcome simulation and a decrease in "proximity" to the decisions being made? Optimally, the system would just be a better radar system, and therefore the ethical calculus could be performed on a more refined picture. However, how do we categorize deviations from this optimal condition? What went wrong? Statistically speaking, well... nothing went wrong, there's just a new data point for training. This is not to say that a human will not be meaningfully conflicted by bad outcomes, it just complicates weighing statistical refinement with moral proximity, which compounds risks of runaway deference based on sunk costs or statistical dogmatics. It also compounds the risk of bifurcating the statistically literate and illiterate, weakening incentives to understand statistics relative to the seeming sufficiency of a system that, if they are meaningfully successful at all, will be much better at statistics than any human.

The consequences are such that your grievance with how "helpful" a runaway aphorism might be, may be based on a preconceived notion that more refined information is a good for which one pays in costs that are also captured in informational terms. The intransigence of naysayers may signal, in smoke detector principle fashion, that one is introducing noise to the moral calculus. The demand that they first master statistical analyses and present their grievances, in information-theoretic terms, may be evidence for the kind of statistical deference that will only compoundingly lead to reliance on AI outsourcing. If the statistical product cannot be cleanly separated from the moral calculus, then whether or not we are prepared to partition such nested models of deference such that it is well fitted to human limitations, is of immediate concern, just as climate tipping points are of concern long before they are imminent.

All this to say, that the output of LLMs is often coherent but with few if any meaningful checks on correspondence, and that human minds might find coherence sufficient in practice when they would not think so in principle, together means one should not background the sentiment based on its aphorismic decay. The decay products of human narratives belie the very processual limitations that must serve as check and balance to narrative generating machines that have effectively trained on the "rhetorical force" embedded in human language.

Expand full comment

Gerben Wierda

Jun 21Edited

Indeed. The phrase ‘hallucination’ has been an unfortunate choice, as it was used to mark these outputs as ‘bugs’ or ‘errors’ and thus as something that was not ‘normal’.

A far better phrase I think is ‘bad approximations’. Everything out of these statistical models is an approximation — a more neutral term. Many of these can be useful/meaningful/trustworthy, some are wrong.

https://ea.rna.nl/2023/11/01/the-hidden-meaning-of-the-errors-of-chatgpt-and-friends/ makes that argument and gives a clear example from the simpler GPT3 days when showing what actually happens was much easier.

The weakness of these current systems is not that their approximations can fail, it is that they take place at pixel or token level *only*, and that getting to human levels requires a level of calculation that is massively underrated. I.e. we overrate our own intelligence and we underrate the calculations required to get to our level. So no AGI, but that doesn’t mean useless.

Expand full comment

Swen Werner

Jun 21

That critique is not valid. An AI has no ability to determine the difference since it lacks the ability to perceive which is a precondition to hallucinate ie perceive without stimulus. It is also true that we find output that is obviously incorrect but from a model perspective what it does for correct or hallucinated output is identical. that’s is all and that is important in mind because it therefore means that all the fears about AGI enslaving the human race are unwarranted under such circumstances.

Expand full comment

The End(s) of Argument

Why the "Everything is Hallucination…