- Teb's Lab
- Posts
- 10 Thoughts On "The Illusion of Thinking"
10 Thoughts On "The Illusion of Thinking"
No one should be taking any victory laps
The Lab Report
I’m Tyler Elliot Bettilyon (Teb) and this is the Lab Report: News and analysis at the intersection of computing technology and policy.
If you’re new to the Lab Report you can subscribe here.
Apple published a new paper provocatively titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.” I’ve seen a lot of hot takes, ranging in character from, “this research is nothing but sour grapes from Apple,” to “this is a deathblow for LLMs.” My hot take? Neither of those are true.
In brief, the paper challenges “reasoning models” including Claude Sonnet 3.7 with Extended Thinking and GPT-o3-mini to solve puzzles with increasing complexity. One such puzzle is the well-known Towers of Hanoi, where you must move a set of disks from a starting peg to a final peg without ever stacking a larger disk onto a smaller disk. Here’s an online version of the puzzle, if you care to play.

The starting point for Towers of Hanoi with 3 disks.
All of the reasoning models seem to utterly collapse at solving these problems after a certain level of complexity. For Towers of Hanoi, the “reasoning” models start failing when they have to move 5 or 6 disks, but solve the puzzle fairly reliably before that. There are some other interesting findings, but that’s the main thrust.
Here are my 10 takes of moderate temperature:
1.) “Thinking” and “reasoning” are definitely misnomers in these models. The so-called “reasoning” approach is to iteratively prompt the LLM by using the LLM itself to generate interim prompts, provide itself with feedback, adjust those prompts based on the feedback, and decide when to stop. But crucially, all of the final output could have been produced with a single prompt if only you knew it ahead of time. Therefore, it is much better to think of test time compute as a search of the underlying LLM’s latent space, not as “reasoning” or “thinking.”
2.) Point 1 is pretty widely accepted in the research community, but the marketing language used to sell LLM products does not reflect this reality. I am once again begging researchers, executives, and marketing folks alike to stop using vague, misleading, bombastic, and anthropomorphizing language to describe these systems.
3.) Accusations that Apple has “sour grapes” are kinda fair. They are way behind the curve on LLMs, and seem to want to keep betting against the technology. Notably Apple did not tout Apple Intelligence much this week at WWDC. But the paper also shows real limitations and weaknesses with using test time compute. It’s more than just sour grapes.
4.) Could this just be the interpolation vs extrapolation problem and target leakage? Machine learning models are notorious for failing to generalize. That is, they are good at making predictions within the distribution of their training data and bad at making predictions outside of it. To support this argument the authors evaluate performance on two older math benchmarks Math-500 and AIME24 and one newer math benchmark AIME25. They write:
[Recent studies] have shown that under equivalent inference token budgets, non-thinking LLMs can eventually reach performance comparable to thinking models on benchmarks like MATH500 [40] and AIME24 [41]. We also conducted our comparative analysis of frontier LRMs like Claude-3.7-Sonnet (with vs. without thinking) and DeepSeek (R1 vs. V3). Our results (shown in Fig. 2) confirm that, on the MATH500 dataset, the pass@k performance of thinking models is comparable to their non-thinking counterparts when provided with the same inference token budget. However, we observed that this performance gap widens on the AIME24 benchmark and widens further on AIME25. This widening gap presents an interpretive challenge. It could be attributed to either: (1) increasing complexity requiring more sophisticated reasoning processes, thus revealing genuine advantages of the thinking models for more complex problems, or (2) reduced data contamination in newer benchmarks (particularly AIME25). Interestingly, human performance on AIME25 was actually higher than on AIME24 [42, 43], suggesting that AIME25 might be less complex. Yet models perform worse on AIME25 than AIME24—potentially suggesting data contamination during the training of frontier LRMs.

Figure 2 from the same paper, showing the difference in performance (y-axis) by “reasoning” (blue lines) vs non-reasoning models with the same backing LLM (pink lines) in the three benchmarks over an increasing token budget (x-axis).
5.) Future and current models will benefit substantially from “tool use” and code execution. The industry is already moving in this direction and the first attempt to standardize a protocol — Anthropic’s MCP — is gaining steam. In this paradigm instead of relying on the mechanics of next token prediction to perform the steps of these puzzles, the LLM will generate and then execute an algorithm, which is also what a human would probably do to solve these puzzles at scale.
State of the art models can already produce working algorithms that solve the Towers of Hanoi problem, in part because this is a very popular introductory CS problem with limitless solutions found online and in textbooks. The models might struggle to generate novel algorithms, but I still think this approach would work better than current “reasoning models” approach for the problems tested by Apple.
6.) Relatedly, it was interesting — but not surprising to me — that the model would not reliably follow the steps of an algorithm provided by the authors in the prompt. Models aren’t trained to do that. Long running examples of following an algorithm are unlikely to be well-represented in training sets. It would actually be quite impressive and interesting emergent behavior if the models reliably followed the steps of novel algorithms provided at inference time, especially over long runs of the algorithm.
This is probably just the extrapolation problem again. There just isn’t that much text showing examples of really long running algorithmic steps. Here are the authors again:
In the Tower of Hanoi environment, the model’s first error in the proposed solution often occurs much later, e.g., around move 100 for (N=10), compared to the River Crossing environment, where the model can only produce a valid solution until move 4. Note that this model also achieves near-perfect accuracy when solving the Tower of Hanoi with (N=5), which requires 31 moves, while it fails to solve the River Crossing puzzle when (N=3), which has a solution of 11 moves. This likely suggests that examples of River Crossing with N>2 are scarce on the web, meaning LRMs may not have frequently encountered or memorized such instances during training.

A portion of Figure 4, from the paper.
7.) I suspect this problem could be easily “papered over” if the training data included several such long running “step by step” outputs from an algorithm. Even so, that wouldn’t solve the main contention that the models aren’t “reasoning” under any prior formal definition of the term — a conclusion I agree with. That said, it’s possible new methods will be developed to overcome the issue more robustly (see point 5).
Those declaring that this is a deathblow for LLMs are forgetting that this is just how progress gets made: you build something, you identify limitations, you overcome them, repeat. That said, those pretending this isn’t a significant limitation and that it should have no meaningful impact on LLM research are equally deluded.
It’s not a foregone conclusion that the problem can be solved, but a lot of smart people are going to spend billions of dollars trying. So I won’t count them out just yet.
8.) I think the path to some kind of recursively self improving “AGI” is still unknown, and I think this paper does actually provide some more evidence for this claim. A machine that had significant formal reasoning capability and the enormous amounts of computational power available to the tested models would do a much better job on these puzzles even at higher complexity. To me, neither scaling up the models nor searching the latent space of a pre-trained transformer model is likely to yield such a system. New methods will be needed for that.
9.) Sometimes these next token predictors are right for the wrong reasons. For example, in Apple’s paper they find that the examined the so-called reasoning traces — the intermediate text produced by the model during the test time compute process — and found that, “For simpler problems, reasoning models often find the correct solution early in their thinking but then continue exploring incorrect solutions.”
This is consistent with other research. Anthropic’s own mechanistic interpretability research suggests that LLMs “reasoning traces” are sometimes inconsistent with the actual internal mechanisms at play or contradictory with the final output produced. That same paper confirms other research that found, “LLMs perform arithmetic using neither robust algorithms nor memorization; rather, they rely on a “bag of heuristics.”
It’s hard to know how often this is happening, and other research shows that the internal mechanisms in LLMs probably do sometimes encode meaningful and accurate “world models.” If anything, this demonstrates just how much more work needs to be done on interpretability.
10.) The above notwithstanding, the current slate of models can do some pretty useful things. I think one thing some AI detractors get wrong is an unwillingness to acknowledge that the current slate of models are doing useful things already, and instead use every new critique as a way to say, “see this stuff is total garbage that will never be useful!”
People are using current models to brainstorm, write code, argue, persuade, draft corporate copy, and more. Some of these uses might not be wise. Some of them — like the heavily automated spam/scam ecosystem — are frankly quite horrible. We do ourselves a disservice and damage our credibility when we don’t acknowledge that something like 91% of developers are using some kind of code generation tools, and that they mostly like using them.
There is clearly some “there” there with LLMs, but I am continually more persuaded by the notion that AI will progress as “normal” technology, and won’t be something that upends the entire economy in the next two years.
Themes in the News
Copyright and Training Data Drama

Disney has entered the chat. Here’s their opening salvo in a brand new copyright lawsuit against the image generator Midjourney:
By helping itself to Plaintiffs’ copyrighted works, and then distributing images (and soon videos) that blatantly incorporate and copy Disney’s and Universal’s famous characters—without investing a penny in their creation—Midjourney is the quintessential copyright free-rider and a bottomless pit of plagiarism.
Like most of these lawsuits, Disney is alleging direct infringement over the consumption of their IP by Midjourney as training data. Unlike most of the other complaints so far, Disney is also alleging widespread “secondary infringement” also called “contributory infringement” — meaning Midjourney has assisted others in infringing Disney’s copyrights. The complaint includes a lot of exhibits clearly demonstrating both the model’s ability to copy Disney’s works as well as evidence that users are producing infringing works at large scale. This website has helpfully plucked many of the exhibits into one place, I encourage you to take a quick scroll.
Disney’s reputation in the world of IP litigation is legendary, so this development is probably bad news for GenAI firms.
Across the pond, the UK version of Getty Images vs Stability AI has gone to trial. As far as I know this is the first AI copyright case to make it to trial, although there are nearly 40 such cases pending in the USA alone. Collectively the outcome of these cases will massively shape the future of the AI industry, as well as as the creative industries from which training data is pulled.
Finally, Reddit is suing Anthropic, who they claim is in breach of contract for scrapping the media site for training data. Like Disney’s, the Reddit complaint is also fiery and excoriating. Here’s a snip:
Anthropic suffers from corporate cognitive dissonance—its actions do not mirror its claimed values. This case is about the two faces of Anthropic: the public face that attempts to ingratiate itself into the consumer’s consciousness with claims of righteousness and respect for boundaries and the law, and the private face that ignores any rules that interfere with its attempts to further line its pockets. Reddit brings this action to stop Anthropic—who tells the world that it does not intend to train its models with stolen data—from doing just that.
Perhaps this is fair criticism, but it is a little ironic coming from a company whose User Agreement contains this language:
When Your Content is created with or submitted to the Services, you grant us a worldwide, royalty-free, perpetual, irrevocable, non-exclusive, transferable, and sublicensable license to use, copy, modify, adapt, prepare derivative works of, distribute, store, perform, and display Your Content and any name, username, voice, or likeness provided in connection with Your Content in all media formats and channels now known or later developed anywhere in the world.
In other words…

A still from the movie The Princess Bride. The character Vizzini is saying, “you’re trying to kidnap what I’ve rightfully stolen.”
AI Causes Job Loss, But Also Loses its Job
A couple weeks ago Dario Amodei, CEO of Anthropic, told Axios he believes that AI could wipe out half of white collar jobs within the next five years. Three months ago he said that 90% of all software would be written by AI within 3-6 months (so 3 months remain for his prediction to pan out). Meanwhile, his company appears to have fired it’s own chatbot, Claude, after letting it run the Claude Explains blog for about a month.
It’s not the first time the Silicon Valley dream of automating white collar work has fallen flat. The Chicago Sun Times is embroiled in controversy after an AI tool made up books to review and cited “experts” who didn’t exist, an oversight that was apparently not caught by editorial staff. Or there’s Klarna, who laid off 700 workers on the theory that AI could handle their work, and are now actively rehiring for many of the roles they eliminated.
I think AI is quite likely to have a meaningful impact on the economy. I also think the transformation will be slower than the Silicon Valley hypesters are promising, riddled with failed experiments including adopting too early and too aggressively.
Remember the dot-com bubble? The lesson wasn’t that the internet was useless technology, it was that a lot of investors and business owners got way out ahead of their skis.
Teb’s Tidbits
Yandex and Meta are de-anonymizing Android users, another reminder of the huge and growing “surveillance economy” — a major source of AI training data by the way.
“Trusting your own judgement on ‘AI’ is a huge risk.” A provocative call for much more rigorous assessments of LLMs and “AI” in general, but largely focused on code generation.
News of the weird: AI slop was found squatting on dozens of abandoned domains, including government websites, an Nvidia URL, and more.
Licensing might not be the answer to the AI Copyright issues and a mandatory licensing scheme might primarily serve the interests of entrenched parties like Disney and The New York Times, argues Derek Slater, who also provides some alternative strategies to ensure fairness for copyright holders.
Reply