October 07, 2025
Richard Sutton’s “Bitter Lesson” has become one of the most influential essays in AI, arguing that general methods leveraging computation ultimately outperform approaches that build in human knowledge. Yet successful modern architectures like CNNs and transformer variants seem to violate this principle by encoding strong structural assumptions. How do we reconcile this apparent contradiction?
The answer lies in understanding what kinds of biases we’re introducing and at what level they operate. David Marr’s three-level framework for understanding information processing systems provides the key to this puzzle.
Marr’s Three Levels: A Lens for Understanding Biases
David Marr proposed that any information processing system should be understood at three distinct levels:
This framework, originally developed for understanding vision, turns out to be crucial for understanding when inductive biases help versus hinder learning systems.
The Bitter Lesson: What It Really Critiques
Sutton’s bitter lesson specifically targets our tendency to encode human knowledge at the algorithmic level. The history of AI is littered with attempts to shortcut learning by hard-coding human insights:
These approaches initially showed promise but were eventually outperformed by general methods that learned from data given sufficient compute. The lesson seemed clear: stop trying to inject human knowledge and let the machines learn.
Why CNNs Don’t Violate the Bitter Lesson
But wait—don’t convolutional neural networks encode the human insight that visual features should be translation-invariant? Doesn’t this violate the bitter lesson?
No, and the distinction is subtle but crucial. CNNs encode a computational-level constraint about the problem domain itself: visual patterns mean the same thing regardless of where they appear in an image. This isn’t a shortcut or a specific solution—it’s a statement about the mathematical structure of the problem space.
Consider the difference:
The CNN doesn’t tell the network what patterns to look for. It simply ensures that whatever patterns it learns will be spatially consistent. The actual feature learning still happens through gradient descent on data, not through human specification.
Modern Examples: Geometric Deep Learning
This distinction becomes even clearer with recent advances in geometric deep learning. Models like GATr (Geometric Algebra Transformers) encode E(3) equivariance—the property that 3D relationships should respect rotations, reflections, and translations.
Again, this isn’t injecting human knowledge about what 3D relationships matter. It’s encoding the mathematical fact that physical laws don’t change when you rotate your coordinate system. The model still needs to learn which relationships are important for the task at hand.
Revisiting the Bitter Lesson in the Era of LLMs
A recent podcast conversation between Richard Sutton and Dwarkesh Patel has reignited discussions about the Bitter Lesson, particularly as it applies to large language models (LLMs). In the episode, Sutton revisits his original essay, emphasizing how LLMs, despite their scale, still incorporate elements that deviate from the pure compute-driven ethos. He critiques the reliance on human-generated data and pretraining paradigms as introducing biases that “taint” the learning process with human priors, rather than allowing systems to discover knowledge from raw, interactive experience.
Building on this, AI researcher Kevin Murphy shared a detailed tweet thread analyzing the podcast, highlighting key differences between current LLM training and Sutton’s vision of classic model-free reinforcement learning (RL). Murphy points out that LLMs often start with supervised pretraining on human text, which averages over hidden human actions and goals, creating a “marginalized” world model. He also notes the lack of continual learning in LLMs, predicting future progress from adaptive systems that update in non-stationary environments.
One of Murphy’s most insightful observations addresses the nature of observations in LLMs. Human language already “carves nature at its joints,” providing pre-packaged abstractions that skip the hard work of learning from raw streams. He sees this as a core limitation, advocating for multimodal agents (e.g., visual GUI-users) that learn their own abstractions in multi-agent settings.
This perspective fits neatly with the view that the Bitter Lesson operates primarily at the representational level. Human words inject representational biases (pre-structured concepts), counter to the Bitter Lesson’s call for compute to discover patterns unaided. Murphy explicitly calls for systems that “learn its own abstractions over time,” which is a Bitter Lesson application at the representational level—let compute handle feature extraction from raw data. At the computational level, though, he assumes an inductive bias like RL in interactive, goal-agnostic environments (with intrinsic rewards like curiosity), which is needed to specify the problem of “carving nature” through experience rather than pure supervision.
Murphy’s analysis reinforces that while LLMs have scaled impressively, their dependence on linguistic shortcuts at the representational level may limit their generality, aligning with Sutton’s warnings. Future systems, as envisioned, could better embody the Bitter Lesson by minimizing these human-injected abstractions while retaining essential computational biases.
The Taxonomy of Biases
We can classify inductive biases along two dimensions:
By Level:
By Effect:
The bitter lesson primarily warns against algorithmic-level and search biases—attempts to replace learning with human-designed solutions. Computational-level representational biases that reflect genuine mathematical structure in the domain are not only acceptable but essential for efficient learning.
Practical Guidelines
When designing learning systems, ask yourself:
Does this bias reflect an invariant property of the problem domain?
Does this bias constrain what can be learned or how learning proceeds?
Would this bias remain valid even if the specific task changes?
The Sweet Synthesis
The real lesson isn’t bitter at all—it’s nuanced. The most successful modern AI systems combine:
This synthesis explains why transformers with positional encodings dominate NLP, why graph neural networks excel at molecular property prediction, and why equivariant networks are revolutionizing computational physics.
Looking Forward
As we develop increasingly powerful AI systems, the key isn’t to abandon all inductive biases in favor of pure tabula rasa learning. Instead, we need to:
The bitter lesson, properly understood, doesn’t advocate for bias-free learning. It advocates for the right kinds of biases—those that reflect the true structure of reality rather than our human preconceptions about how to solve problems.
The future of AI lies not in the false dichotomy between structure and learning, but in their harmonious integration. By encoding what must be true while learning what happens to be true, we can build systems that are both efficient and general, both principled and powerful.