Thank you Mike! I too am so tired of both sides (boomers and doomers) missing almost every point, fact and reality when it comes to AI. So much of the confusion would be cleared up if they invested more time in attempting to accurately explain what IS happening instead of conjuring prophecies about what WILL happen.
Of course, this paper is testing something that is happening. But it's an extremely limited slice of reality. This slice has now been seized upon and generalised into another example of another tired old prophecy.
What about an LLM makes you think it will be capable of the kind of human reasoning you describe?
I agree that it is I found it unsurprising they can't follow instructions to play Towers of Hanoi - but I think it still exposes and tests the limitations of their ability to "reason" in this way. And I also think it's useful to know that they solve the simple versions - but then fail as Apple ratcheted up the complexity - showing that a lot of "reasoning" that they appear to have is just regurgitating data they were trained on.
What I don't follow is why you think they might be useful at "fuzzy" problems.
I don't think about whether they "reason" -- that's meaningless. I don't understand what a statement like that would mean. But they produce stunningly good chains of reasoning and good simulations of reasoning which turns out to be really useful. They do that in more or less the way I describe. You can see that in my previous work where they analyze 57 claims in the Maha report.
So it's not really a prediction, they can simulate this sort of thinking now. Is a lot of it pulled from things other people have said or the sort of things people say. Of course. That's how we think.
It feels like you’re cheating a bit to say that they don’t reason, but they do a great job simulating reasoning? That to me is a distinction without much of a difference. At least we agree they don't reason. I say they do a bad and more specifically unreliable job simulating it too.
"Not really a prediction" I don't follow - they are language models built around predicting the next token? And yes they have an insane amount of training data and large context windows, but they're still just manipulating the form of thought (words/tokens) without understanding their meaning and even "chain of thought" just feeds in more tokens into the context window to predict against. And the “chain of thought” they produce doesn’t seem to match what they’re actually doing? e.g.
I looked at your MAHA post. "I don’t claim that everything in it is right — in fact I’m sure there’s a bunch wrong. But I think it’s still quite useful."
That's quite the caveat! I'm not saying it's *useless* to use AI this way, but like it'd be better to just have a team of humans to read the links? Which is what happened with the NOTUS reporting, etc.
Where someone might be like "hey this sorta worked there's some good stuff in here", I am more "hey this didn't really work - there's a lot of made up wrong stuff here."
Long term with LLMs, I just don't see how anyone builds a reliable, dependable, and profitable product on top of these (very expensive to maintain) models given their inherent instability and unreliability. Whatever OpenAI is charging per token, it's not nearly enough to pay the bills. These models need to deliver enough value to justify their cost (both in just pure $ from a profit/loss perspective, but for society in terms of energy and environment!) No surprise, but I don't think they do.
But the Apple paper is attempting to analyse reasoning traces in controlled environnement and argues that it’s not reasoning but is based on memorisation. I am not sure how it contradicts yours point
So I think the whole thing goes much deeper than this. There are two paradigms for how humans develop capabilities - the scientist - doing endless experiments and deducing from results and the student doing first with others and then practicing alone.
I think LLM's, in so far as they resemble human knowledge acquisition at all are modelled on the scientist. It seems that the latest efforts to improve abilities may double down on this by emphasising knowledge of the physical world and agency in that world. This would be very much how Piaget thought of things.
If I am right though what LLM's need ( and lack) is learning by being taught. - this involves conversation - which becomes the prototype for deliberative thought. We are taught how to talk by a myriad of conversational interactions with our caregivers and then we "think" by talking to ourselves ( you can see young children doing this ).
I don't think a big context window is quite it - sadly I think it's training from the get go in a dialogic way and alteration of weights on the fly.
My guess is that our LLMs are finding that hard and suffer horrible training collapses because they are miracles of compression with the minimal number of nodes/connections to store the data.
However I'm a psychiatrist not a computer scientist and so very much unqualified to really know
I appreciate your thoughts on this! I will mention that people appreciate this kind of research because it keeps Sam Altman and others like him in check. Them claiming that AGI is near only serves to boost funding from investors and it just feels fraudulent to make claims like that. Research like Apple’s is what I want to keep seeing because I’m sorry, if someone’s so-called on-the-cusp-of-AGI chat bot can’t do towers of Hanoi or even checkers, we are far from achieving AGI.
The thing is, the inability of LLMs to do the simple things like checkers alongside the complex conversational reasoning makes them inherently vulnerable. At present LLMs are like a mad brain in a jar who need goons to help them survive. We know that story ends when the hero outwits the goons and pulls the plug.
I was directed to this blog by someone who read my takes on this (also directed at 'what is really happening'; not on Substack). I must say, this is an excellent article. I never read Toulmin (but I did read the (later) Wittgenstein and Dreyfus and have been a critic of 'intelligence is discrete logic' since I had my first job (in AI)). It seems I must read Toulmin. Any tip where to start? (E.g. I suggest Hacker to people who want to get into Wittgenstein).
As Dreyfus argued, our cultural paradigm has been for a long time (Parmenides, Socrates, Plato already) that logic is key to intelligence (Whitehead's "2500 years of footnotes to Plato"), but humans are 'better at frisbee than at logic' (Andy Clark, if I recall correctly). Add to that that the evolutionary constraints (energy efficiency and speed) requires us to be very good at fast estimation and decision making, which means that much of our intelligence is 'mental automation' (convictions and such). Our convictions (on AI also) do not come so much from our observations and our reasonings, but our observations and our reasonings are strongly influenced (filtered for instance) by our convictions. And we're not just evolved for individual success but for 'group' (tribal) success (so for instance, we need convictions for personal efficiency but wel also need stable convictions because otherwise the group would be ineffective). All these are important key aspects of human intelligence.
The thing I guess that is ignored in almost any discussion I see is our fundamental property of not so much reacting to what is there, but to what *could* be there (opportunities, risks). Our *imagination*. I suspect that a self-driving car doesn't just need to sense and react, it needs to sense, infer potentials, and react (see https://ea.rna.nl/2025/01/08/lets-call-gpt-and-friends-wide-ai-and-not-agi/#imagination). Without the middle step the result is limited. And if you go cheap on sensors (like Tesla does) can be very risky even.
In my view, we overestimate what LLMs and such can do, and we overestimate what human minds can do, and we underestimate what is needed for our brains to do what they can do.
A key question is not so much how do we get it form bit, but how do we get bit (discrete logic) from an engine that isn't discrete at all (and not alone that, but that even may rely on nonlinear / chaotic effects). We know so little, it is embarrassing to see us draw such large conclusions.
GenAI is probably going to be pretty disruptive and there will be many valuable scenarios, but AGI is — as far as I'm currently estimating — not on the cards.
Thank you Mike! I too am so tired of both sides (boomers and doomers) missing almost every point, fact and reality when it comes to AI. So much of the confusion would be cleared up if they invested more time in attempting to accurately explain what IS happening instead of conjuring prophecies about what WILL happen.
Of course, this paper is testing something that is happening. But it's an extremely limited slice of reality. This slice has now been seized upon and generalised into another example of another tired old prophecy.
With Sam Altman saying that reasoners are System 2 thinking (another bad analogy) it's kind of inevitable.
What about an LLM makes you think it will be capable of the kind of human reasoning you describe?
I agree that it is I found it unsurprising they can't follow instructions to play Towers of Hanoi - but I think it still exposes and tests the limitations of their ability to "reason" in this way. And I also think it's useful to know that they solve the simple versions - but then fail as Apple ratcheted up the complexity - showing that a lot of "reasoning" that they appear to have is just regurgitating data they were trained on.
What I don't follow is why you think they might be useful at "fuzzy" problems.
I don't think about whether they "reason" -- that's meaningless. I don't understand what a statement like that would mean. But they produce stunningly good chains of reasoning and good simulations of reasoning which turns out to be really useful. They do that in more or less the way I describe. You can see that in my previous work where they analyze 57 claims in the Maha report.
So it's not really a prediction, they can simulate this sort of thinking now. Is a lot of it pulled from things other people have said or the sort of things people say. Of course. That's how we think.
It feels like you’re cheating a bit to say that they don’t reason, but they do a great job simulating reasoning? That to me is a distinction without much of a difference. At least we agree they don't reason. I say they do a bad and more specifically unreliable job simulating it too.
"Not really a prediction" I don't follow - they are language models built around predicting the next token? And yes they have an insane amount of training data and large context windows, but they're still just manipulating the form of thought (words/tokens) without understanding their meaning and even "chain of thought" just feeds in more tokens into the context window to predict against. And the “chain of thought” they produce doesn’t seem to match what they’re actually doing? e.g.
https://www.anthropic.com/research/reasoning-models-dont-say-think
I looked at your MAHA post. "I don’t claim that everything in it is right — in fact I’m sure there’s a bunch wrong. But I think it’s still quite useful."
That's quite the caveat! I'm not saying it's *useless* to use AI this way, but like it'd be better to just have a team of humans to read the links? Which is what happened with the NOTUS reporting, etc.
Where someone might be like "hey this sorta worked there's some good stuff in here", I am more "hey this didn't really work - there's a lot of made up wrong stuff here."
Long term with LLMs, I just don't see how anyone builds a reliable, dependable, and profitable product on top of these (very expensive to maintain) models given their inherent instability and unreliability. Whatever OpenAI is charging per token, it's not nearly enough to pay the bills. These models need to deliver enough value to justify their cost (both in just pure $ from a profit/loss perspective, but for society in terms of energy and environment!) No surprise, but I don't think they do.
But the Apple paper is attempting to analyse reasoning traces in controlled environnement and argues that it’s not reasoning but is based on memorisation. I am not sure how it contradicts yours point
So I think the whole thing goes much deeper than this. There are two paradigms for how humans develop capabilities - the scientist - doing endless experiments and deducing from results and the student doing first with others and then practicing alone.
I think LLM's, in so far as they resemble human knowledge acquisition at all are modelled on the scientist. It seems that the latest efforts to improve abilities may double down on this by emphasising knowledge of the physical world and agency in that world. This would be very much how Piaget thought of things.
If I am right though what LLM's need ( and lack) is learning by being taught. - this involves conversation - which becomes the prototype for deliberative thought. We are taught how to talk by a myriad of conversational interactions with our caregivers and then we "think" by talking to ourselves ( you can see young children doing this ).
I don't think a big context window is quite it - sadly I think it's training from the get go in a dialogic way and alteration of weights on the fly.
My guess is that our LLMs are finding that hard and suffer horrible training collapses because they are miracles of compression with the minimal number of nodes/connections to store the data.
However I'm a psychiatrist not a computer scientist and so very much unqualified to really know
I appreciate your thoughts on this! I will mention that people appreciate this kind of research because it keeps Sam Altman and others like him in check. Them claiming that AGI is near only serves to boost funding from investors and it just feels fraudulent to make claims like that. Research like Apple’s is what I want to keep seeing because I’m sorry, if someone’s so-called on-the-cusp-of-AGI chat bot can’t do towers of Hanoi or even checkers, we are far from achieving AGI.
The thing is, the inability of LLMs to do the simple things like checkers alongside the complex conversational reasoning makes them inherently vulnerable. At present LLMs are like a mad brain in a jar who need goons to help them survive. We know that story ends when the hero outwits the goons and pulls the plug.
I was directed to this blog by someone who read my takes on this (also directed at 'what is really happening'; not on Substack). I must say, this is an excellent article. I never read Toulmin (but I did read the (later) Wittgenstein and Dreyfus and have been a critic of 'intelligence is discrete logic' since I had my first job (in AI)). It seems I must read Toulmin. Any tip where to start? (E.g. I suggest Hacker to people who want to get into Wittgenstein).
As Dreyfus argued, our cultural paradigm has been for a long time (Parmenides, Socrates, Plato already) that logic is key to intelligence (Whitehead's "2500 years of footnotes to Plato"), but humans are 'better at frisbee than at logic' (Andy Clark, if I recall correctly). Add to that that the evolutionary constraints (energy efficiency and speed) requires us to be very good at fast estimation and decision making, which means that much of our intelligence is 'mental automation' (convictions and such). Our convictions (on AI also) do not come so much from our observations and our reasonings, but our observations and our reasonings are strongly influenced (filtered for instance) by our convictions. And we're not just evolved for individual success but for 'group' (tribal) success (so for instance, we need convictions for personal efficiency but wel also need stable convictions because otherwise the group would be ineffective). All these are important key aspects of human intelligence.
The thing I guess that is ignored in almost any discussion I see is our fundamental property of not so much reacting to what is there, but to what *could* be there (opportunities, risks). Our *imagination*. I suspect that a self-driving car doesn't just need to sense and react, it needs to sense, infer potentials, and react (see https://ea.rna.nl/2025/01/08/lets-call-gpt-and-friends-wide-ai-and-not-agi/#imagination). Without the middle step the result is limited. And if you go cheap on sensors (like Tesla does) can be very risky even.
In my view, we overestimate what LLMs and such can do, and we overestimate what human minds can do, and we underestimate what is needed for our brains to do what they can do.
A key question is not so much how do we get it form bit, but how do we get bit (discrete logic) from an engine that isn't discrete at all (and not alone that, but that even may rely on nonlinear / chaotic effects). We know so little, it is embarrassing to see us draw such large conclusions.
GenAI is probably going to be pretty disruptive and there will be many valuable scenarios, but AGI is — as far as I'm currently estimating — not on the cards.