The Apple "Reasoning Collapse" Paper Is Even Dumber Than You Think
We're this far into reasoners and neither hypesters nor skeptics really understand their significance. Also: Read Toulmin.
I want to talk about how the recent paper from Apple on the “Illusion of Thought” suffers from a wrong (and somewhat ridiculous) view of human cognition, but to do that I’m going to have to talk a bit about what human reasoning looks like.
I used to have a little joke when giving keynotes. I’d put up a slide that said something like “The World’s Most Amazing Conversation”. And on the slide would be something like:
H: (grabbing keys) Heading to class a bit early, 99 is under construction.
A: 99? You know you could just take Meridian, right?
H: Huh. And then get on I-5?
It’s amazing because what you see here is the engine of almost everything that humanity has built, whether it’s social progress or solar panels.
You might think what builds solar panels is math. And of course it is, partly. But what builds math? How did we collectively discover and agree on the validity of various solar panel-building formulas? In fact, what builds anything that goes beyond what came before?
It’s what you see above, in the DNA of every conversation, so common to our humanity we don’t even notice it. A person wants to take an action. They could just take it. If they were any other animal on the face of this earth, they’d just do it.
But they are human, and they want to be seen as reasonable. So when they say or do something — like leave surprisingly early — they sometimes feel compelled to explain it. It’s a weird compulsion. But it’s a key to progress. You want to do something and appear reasonable — you give reasons. To do that you look at what you are expressing, and think what might people respond to this. I’m leaving now — well, why so early? I’m worried about construction.
Once you give reasons, someone can poke at your reasons. Have you considered this? And maybe you have, and maybe you haven’t. Maybe you double down, or maybe you change course. Maybe you find a better way to do it. Maybe in a future conversation you pass that way on to someone else.
This is both how we think, and how we think together, and it’s the foundation of human cognition. Again, math is cool. But even math was built one “why don’t you try it this way?” at a time.
We think together even when we think alone
Ah, people say, but we often think alone. And that’s true. But our model of thinking alone, for almost all of the world, is relatively close to our model of thinking together. I am sitting here typing this, in an almost automatic way, and as the words come out on the screen I become my own reader, thinking, what would someone reading this be confused by? To what might they object? I edit and revise. Hopefully my reasoning gets sharper and my views become more well-grounded.
This is very different from looking at a physical problem and thinking “If I do this what might happen?” There was a meme a while back about being a “shape rotator” — one that can do those IQ test shape rotation tasks well — and the idea was this is a particularly high level of brilliance. It’s certainly a neat trick. But if human cognition was wired that way we’d have crow brains. We’d be infinitely better at confronting series of small puzzles, but you wouldn’t be reading this essay right now. Instead we’ve got these huge heads devoted to language, endlessly listening to ourselves speak and asking “does this sound reasonable?”
We devalue that ability versus the other stuff because that ability is so common in humans. It’s not even just an ability. As I mentioned above it’s a compulsion. But taken alone, rotating shapes improves your proficiency at some things somewhat. Conversely, our process of giving reasons for things and then poking at them underlies everything humans do and accomplish as humans. Complex shape rotation is impressive not because it is some pinnacle of human cognition but because it is not a core part of our stunning native strengths.
A toolbox for fuzzy problems
I realize by now that you are probably wondering what this has to do with the Apple paper about LLMs failing to solve certain logic problems. I promise I am getting to that almost immediately, and all of this lead-up is relevant. You are almost there.
But what reasoning does for humans — reasoning in the Toulminian sense — is allow us to approach any problem that is fuzzy around the edges. Think about cutting down a tree. There’s a piece of this that is mathematics and visualization — if I cut it this way how will it fall? That’s important stuff. But everything around that is about raising the sorts of questions that fit into the explanations/arguments category:
If the tree is diseased is that going to make its fall unpredictable?
Would it make more sense to just pay a tree removal service?
Would there be any harm in waiting until next year?
I know I keep coming back to this but for God’s sake read Toulmin. As Toulmin points out repeatedly, there is a cult of rationality, one that was bolstered by the accomplishments of Newtonian physics, that believes all those bullet point questions are just degraded forms of reasoning compared to computing the fall of the tree. For Toulmin this was the wrong turn that analytic philosophy had made in moving away from the moorings of classical philosophy.
The thing is we’re evolved to answer these sorts of fuzzy questions, so they feel natural, but the question of “Would there be any harm in waiting until next year” is not a simpler question than “Where will a tree cut this way fall?” It is in fact infinitely more complex.
Faced with the impossibility of answering this question like a math problem, Toulmin points out we move out of the realm of rationality and into the far more expansive and complex realm of reasonableness. We can’t know for certain whether there will be harm in waiting. It’s even worse than that in fact — even at the end of next year, having observed what happened — there is no definitive answer to the question of whether one made the right decision to wait.
Toulmin came into epistemology at a time (the 1950s) where the view was conversational argument was a sort of degraded syllogism. And what Toulmin said was no, conversational argument was a different sort of thing entirely. There is no way of getting to an answer, so the question people address is something like:
under the norms we have decided for evidence
and given what we know
what set of beliefs can be claimed to be reasonable.
We understand this as a structure of argument and persuasion but it is also (as seen in the car route conversation) the structure of explanation and sensemaking. Toulmin’s later textbook is not called “An introduction to argument”. It is quite rightly called An Introduction to Reasoning.
When LLMs incorporated things like Chain-of-Thought into reasoning models what they did was bring a simulation of that sense-making engine into the LLM. The system examines your text (a prompt) and produces a tentative reply. It then searches for the sorts of objections someone might raise to that reply. It then looks for rebuttals to those suggestions and so on. It says the sort of things someone might say about your evidence given an assumed set of norms. This is all fake in one sense of course — it’s not *really* thinking — but it turns out to be an incredibly useful way to produce a simulation of thought. It helps these systems to provide well-reasoned answers to questions, even if the system doesn’t technically reason (just as climate models can produce simulations of climate outcomes without producing weather — this shouldn’t be that hard to understand!).
All of this is exciting not because a computer can suddenly play chess (a problem solved ages ago) but because it allows computers to help solve a set of “fuzzy” problems that is exponentially larger than the set of non-fuzzy problems computer-based bits-and-zeros computation tends to be good at. That’s the whole point. That’s why people see promise in the technology.
Why would you use an LLM to play mathematical games?
People have wondered how Apple — with all their resources — could be so behind the rest of industry when it comes to LLMs. It’s truly puzzling! But if you’re looking for a reason there is perhaps no better indication of what’s going wrong at Apple than than this: some of their top researchers were trying to figure out what would be a good test of capability for an LLM and what they came up with was “Let’s have it play checkers.”
It turns out a system built to simulate reasoning in the Toulminian sense doesn’t play pure logic games very well. It talks easier problems through but isn’t really up for narrating thinking and coming up with rebuttals and counter-rebuttals and sifting through evidence across a few hundred rounds of Towers of Hanoi. From this they deduce that these systems are inherently limited. This is their supposedly stunning finding.
I probably should end this post here, because really — what do you say after that? I’m not just being snarky. Building a reasoning system — even if the reasoning is simulated — requires an understanding of what the reasoning model is, and in particular what elements of reasoning LLMs are poised to assist with. What this shows — is something else.
But to tie it back together, it reminds me — as many things do, I suppose — of Toulmin. Particularly, Toulmin in 1958 trying to explain to a bunch of philosophers trying to analyze conversation using formal logic that a) that is not how conversational argument works, and b) that is not because conversation is “degraded”, that is because conversation does things impossible to do with logic alone, and eats problems that logic finds impossible for breakfast.
In terms of our tree removal analogy, checkers is at most predicting the path of the tree. Everything else is so much more complex than that, and so much fuzzier. If we could build assessments that would test the strength of these systems to navigate that fuzziness, to find and present to us a reasonable range of beliefs together with well-grounded reasoning to evaluate, to present ranges of possibilities instead of either accepting anything or resolving to a single truth perhaps we would get somewhere interesting. But its pretty clear at this point that interesting thing will not come from Apple.
Thank you Mike! I too am so tired of both sides (boomers and doomers) missing almost every point, fact and reality when it comes to AI. So much of the confusion would be cleared up if they invested more time in attempting to accurately explain what IS happening instead of conjuring prophecies about what WILL happen.
Of course, this paper is testing something that is happening. But it's an extremely limited slice of reality. This slice has now been seized upon and generalised into another example of another tired old prophecy.
What about an LLM makes you think it will be capable of the kind of human reasoning you describe?
I agree that it is I found it unsurprising they can't follow instructions to play Towers of Hanoi - but I think it still exposes and tests the limitations of their ability to "reason" in this way. And I also think it's useful to know that they solve the simple versions - but then fail as Apple ratcheted up the complexity - showing that a lot of "reasoning" that they appear to have is just regurgitating data they were trained on.
What I don't follow is why you think they might be useful at "fuzzy" problems.