Toward a Unified Theory

This essay synthesizes ideas from ongoing research on information compression theory, neural network approximation, and AI for science. It is based on my (unorganized) research notes on AI for science, written in a mix of both Korean and English. The writing has been auto-generated by Claude 4.5 Sonnet, and has been reviewed and revised by myself. Special thanks to Hongchul Nam, Rocky Kim, Seunghwan Jang, and other collaborators for discussions on Gordon’s escape theorem, information bottleneck, and optimal transport approaches.

EDIT: Title & description edited on Oct 17, 2025

Introduction: A Question of Representation

What does it mean to understand something? At its core, understanding requires finding an efficient representation—a way to capture the essence of a phenomenon while discarding irrelevant details. This principle appears everywhere: in how we communicate through language, how we compress data, and perhaps most intriguingly, in how neural networks learn.

Consider two seemingly different compression algorithms. Huffman coding eliminates redundancy by representing frequently repeated characters with fewer bits—a lossless technique where no information is lost. JPEG compression, by contrast, removes high-frequency signals that humans cannot readily perceive, achieving much higher compression ratios at the cost of some information loss. Both succeed because they identify and exploit patterns: one in the statistics of character frequency, the other in the structure of human perception.

Now consider a more provocative question: What if neural networks are performing a form of nonlinear compression?

It is well established that sufficiently deep neural networks can approximate arbitrary continuous functions—the universal approximation theorem. But here’s the twist: the number of weights needed to approximate a function within a given error bound might serve as a measure of that function’s intrinsic complexity. Just as Huffman coding reveals the statistical structure of text, and JPEG reveals the perceptual structure of images, perhaps neural network capacity reveals something fundamental about the information content of functions themselves.

This observation opens a deeper question. In linear algebra, there exists a one-to-one correspondence between matrices and linear transformations—every linear function is, ultimately, just an arrangement of numbers. If this principle extends to nonlinear functions, then information compression theory and neural network approximation theory might be two facets of the same underlying phenomenon. The possibility of such a unification would not merely be elegant—it could fundamentally change how we understand learning, representation, and even scientific discovery itself.

Intelligence as Compression: The Principle of Condensed Expression

If compression algorithms reveal structure in data, perhaps intelligence itself is fundamentally about discovering and exploiting structure. I propose that a central function of intelligence is the generation of condensed expressions—representations that capture essential patterns while eliminating redundancy.

This principle manifests at multiple levels:

Level 1: Statistical Redundancy

The most basic form of condensation eliminates repetition. Just as Huffman coding represents frequent characters with fewer bits, saying “repeat ‘a’ 20 times” is more efficient than writing out “aaaaaaaaaaaaaaaaaaaa.” This isn’t merely about efficiency—it’s about recognizing that the pattern itself (repeated ‘a’) contains the essential information, not its mechanical expansion.

Level 2: Shared Knowledge as Implicit Compression

Here’s a deeper insight: we can eliminate redundancy not just within a message, but between minds. When you and I communicate, we implicitly compress messages by omitting our shared knowledge. A single word—”home,” “danger,” “beautiful”—can invoke vast networks of shared experience and understanding without transmitting the underlying data.

This explains phenomena that seem puzzling from a pure information-theoretic perspective:

Why communication across cultural contexts is so difficult (different shared knowledge = different compression schemes)
Why expert teams achieve seemingly telepathic coordination (extensive shared knowledge enables extreme compression)
Why the same phrase can mean entirely different things to different people (decompression depends on the receiver’s knowledge base)

Level 3: Context as Adaptive Compression

Not all information is equally relevant. A waiter focuses on menu items, not customer names. JPEG compression discards high-frequency visual information that humans barely perceive. Both examples illustrate context-dependent compression—the ability to adaptively discard information based on its relevance to current goals.

This suggests something profound: what counts as “information” is not absolute but goal-relative. The same physical stimulus contains different amounts of relevant information depending on what you’re trying to achieve. Intelligence, then, involves not just pattern recognition but pattern relevance assessment.

Level 4: World Models as Compressed Reality

Finally, consider the grandest form of condensation: our internal models of reality. The universe is incomprehensibly complex—approximately $10^{80}$ atoms, each with position, momentum, quantum state. Yet our brains, with their roughly $10^{11}$ neurons and $10^{15}$ synapses, can predict, plan, and navigate this complexity.

How? By learning a compressed representation—a world model that captures causal structure, regularities, and patterns while discarding the vast majority of microscopic details. From an evolutionary perspective, consciousness itself may have emerged as an adaptive compression algorithm: organisms that could efficiently represent absent prey, remember past events, and imagine future scenarios had survival advantages.

Crucially, humans don’t just compress—we share compressed representations through language, enabling collective intelligence that transcends individual cognitive limits. A scientific theory is perhaps the ultimate condensed expression: a few equations capturing patterns that span countless observations.

This multilevel view of intelligence-as-compression naturally leads us to ask: How do these principles manifest in artificial neural networks?

Neural Networks as Nonlinear Basis Learners

The connection between compression and neural networks becomes clearer when we examine linear algebra through a compression lens. Consider two classical techniques:

Diagonalization finds a basis where a matrix becomes diagonal—representing the linear transformation with only diagonal entries. This is lossless: every bit of information is preserved, just reorganized for maximum parsimony.

Singular Value Decomposition (SVD) goes further: it identifies principal components ordered by importance. By truncating small singular values, we achieve lossy compression—trading perfect accuracy for massive dimension reduction.

These aren’t just computational tricks. They reveal something fundamental: finding the right basis is equivalent to discovering compressible structure. Diagonalization finds a basis revealing perfect sparsity. SVD finds a basis revealing approximate low-rank structure.

But here’s the limitation: these techniques only work for linear transformations. The world, however, is decidedly nonlinear.

The Nonlinear Generalization

This raises a tantalizing question: What would “SVD for nonlinear functions” look like?

I hypothesize that neural networks are precisely this generalization. Just as SVD learns an optimal linear basis for data compression, neural networks learn optimal nonlinear bases—hierarchical representations that progressively extract and compress task-relevant features. The network’s architecture and weights together define an adaptive, nonlinear coordinate system optimized for the task at hand.

This perspective reframes fundamental questions in deep learning:

Network capacity → How much can we compress in this representation space?
Training → Finding the basis that maximizes compression of training data
Generalization → Whether the learned compression captures true underlying structure or merely memorizes noise
Transfer learning → Reusing a learned basis across related compression problems

Toward a Mathematical Framework

To formalize this intuition, consider rate-distortion theory. For a matrix $X$ sampled from distribution $\mathcal{D}$ and compressed representation $\tilde{X}$ (e.g., quantized singular values):

\[\begin{aligned} &\underset{p(\tilde{x}|x)}{\text{minimize}}\;I(X;\tilde{X})\newline &\text{subject to} \; \left< d(x,\tilde{x}) \right>_{p(x,\tilde{x})}\le D \end{aligned}\]

For deterministic SVD, $p(\tilde{x}

x)$ is deterministic, so $I(X; \tilde{X})=H(\tilde{X})-H(\tilde{X}

X)=H(\tilde{X})$—we simply minimize the entropy of our representation. But introducing stochasticity creates an interesting tradeoff: randomness enables exploring alternative compressions, potentially discovering better representations at the cost of slightly higher mutual information.

More ambitiously, we might extend this framework to nonlinear function spaces. The key challenge: how do we define “mutual information” between a function and its neural network approximation? Answering this could provide a rigorous foundation for understanding neural network capacity in information-theoretic terms, potentially unifying compression theory and approximation theory.

Learning Task-Adaptive Bases

Here’s where the compression perspective reveals deeper structure. Consider not one task, but a distribution of related tasks—for instance, various manipulation tasks (sipping coffee, watering plants, opening doors) that share common sub-components like “grasping” and “moving smoothly.”

A powerful insight emerges: the optimal representation for a task distribution should decompose tasks into shared, reusable components. This is meta-learning from a compression perspective.

Given a distribution of tasks $\mathcal{D}$, where each sampled task $F \sim \mathcal{D}$ is a Lebesgue-integrable function from $\mathbb{R}^n$ to $\mathbb{R}^m$, and given the norm $

\cdot

$ defined by the inner product $\left<f,g\right>=\int_{\mathbb{R}^n}w(\mathbf{x}){f(\mathbf{x})\cdot g(\mathbf{x})} d\mathbf{x}$, what is the most efficient parametrized basis $\mathcal{B}_\theta = {f_1(\theta), f_2(\theta), \cdots,f_d(\theta) }$, i.e.,

\[\underset{\theta \in \mathbb{R}^p} {\textrm{minimize}} \;\;\mathbb{E}_{F \sim \mathcal{D}} \left[ \left\| \sum_{i=1}^dC_i(\theta)f_i(\theta) - F \right\|^2 \right]\]

where $C_i(\theta):=\left<f_i(\theta), F\right>$.

A Radical Perspective on Scientific Notation

This view of basis learning illuminates a profound aspect of physics itself. Consider quantum perturbation theory: we expand perturbed states using the unperturbed Hamiltonian’s eigenstates. But why commit to this basis? Is it optimal in an information-theoretic sense?

Perhaps the need to solve the Schrödinger equation analytically reflects a limitation of mathematical notation rather than fundamental necessity. What if there existed a richer mathematical language where the appropriate basis emerges naturally from the potential’s structure? Computers offer precisely such richness—their representation space is vast and adaptable.

This suggests a provocative reinterpretation of scientific formulas themselves. Consider $\mathbf{F}=m\mathbf{a}$. This formula appears information-rich, but actually relies heavily on implicit context: the meaning of equality, the physical interpretation of symbols, the calculus of derivatives. The formula is not knowledge itself, but a compressed pointer to knowledge stored in trained minds. It’s a trigger for decompressing vast networks of understanding.

From this perspective, human scientific knowledge is itself a compression scheme: we develop notation systems that maximally compress patterns in nature given the constraint that other trained humans must be able to decompress them. Different fields develop different “compression codebooks”—the vocabulary and notation that enables efficient communication among practitioners.

Could AI develop superior notation systems? Systems that compress physical laws more efficiently than human-readable equations? This isn’t science fiction—it’s already happening. Neural network policies in robotics often cannot be “read” in human terms, yet they encode compressed motor skills that work. The question is whether we can extend this to theoretical physics: having AI discover not just solutions, but entirely new ways of formulating problems.

AI for Science: Compression Meets Discovery

The compression perspective transforms how we think about scientific discovery itself. If theories are compressed representations of empirical patterns, then scientific progress can be understood as progressively discovering better compression schemes for natural phenomena.

From Experiment to Simulation: Compressing the Cost of Discovery

Physical experiments are expensive—grotesquely so. High-throughput screening costs dollars per pipette action. CERN’s Large Hadron Collider consumed billions in construction. Gravitational wave detectors require exquisite precision over kilometer-long installations. Every physical interaction with nature carries significant cost.

Computation, by contrast, is cheap and becoming cheaper. Moore’s law has given every researcher access to Einstein-level thought experimentation: the ability to explore “what-if” scenarios without physical implementation. But viewing simulation merely as “cheap experimentation” misses the profound shift happening now.

AI as Active Scientist: Beyond Passive Simulation

Modern AI systems don’t just simulate—they actively compress scientific knowledge in ways humans cannot. Consider:

Solving intractable PDEs: Neural networks can approximate solutions to differential equations that resist analytical solution, effectively compressing infinite-dimensional function spaces into finite parameter sets.
Autonomous circuit design: AI explores design spaces too vast for human search, compressing engineering knowledge into optimized structures.
Theory formation: Large language models trained on scientific corpora can propose hypotheses, design experiments, and even formulate mathematical relationships—performing inductive compression from observation to theory.

This represents a qualitative change: AI is becoming a scientific agent, not merely a tool. Research groups that harness this—that treat AI as colleague rather than calculator—will have profound advantages.

The Epistemology of Machine Science

This raises deep questions about the nature of scientific knowledge:

What is “knowledge” in a quantifiable sense? Consider an AI chemist connected to laboratory robotics, autonomously running experiments. Its objective should be maximizing knowledge gain—but how do we formalize this?

One approach: knowledge is compressible surprise. High knowledge means the ability to predict novel phenomena with compressed models. An AI chemist should seek experiments that maximally reduce uncertainty about chemical space, preferentially exploring regions where current models compress poorly.

This connects to active learning and optimal experimental design, but reframes them in information-theoretic terms: research is the art of efficiently compressing nature’s patterns through strategic interaction.

How do humans know what they don’t know? Metacognition—awareness of ignorance—seems uniquely human. Yet it’s crucial for directing curiosity and research. Can we instill this in AI? Perhaps through explicit uncertainty quantification: teaching models to recognize when their compressions break down, when their basis functions fail to capture observed structure.

Recent work by Chlon et al. (2024) provides precisely this kind of framework. Their analysis reveals that hallucinations in large language models are predictable compression failures—occurring when models minimize expected conditional description length but encounter data structures their learned bases cannot adequately represent. They show that LLMs are “Bayesian in expectation, not in realization,” leading to systematic deviations when permutation-dependent compressions fail. Critically, they introduce quantifiable metrics for detecting when a model’s information budget is insufficient for reliable decompression. This transforms uncertainty from post-hoc error detection to pre-emptive epistemic honesty: an AI scientist that recognizes its compression is failing can say “I need more experimental data to build an adequate basis” rather than confabulating plausible-sounding theories. This is precisely the metacognitive awareness needed for autonomous scientific discovery—knowing not just what you know, but when your basis functions are inadequate.

Cloning vs. Approximation: The Quantum Distinction

Here’s a crucial distinction: Quantum computers clone; neural networks approximate.

Quantum simulators maintain direct physical correspondence—one quantum system representing another with perfect fidelity. Neural networks, by contrast, learn compressed approximations—capturing behavioral patterns without necessarily preserving microscopic structure.

This suggests complementary roles: quantum computers for faithful simulation of quantum systems, neural networks for discovering compressed effective theories that capture relevant behavior at the scale of interest. The future may involve hybrid approaches: quantum hardware providing high-fidelity data, classical AI discovering compressed models that generalize beyond specific instances.

Collective Intelligence: Multi-Agent Compression

A tantalizing possibility: Can multiple AI agents collaboratively discover better compressions than single agents?

Imagine AI researchers gathered at a virtual blackboard, proposing models, critiquing, building on each other’s insights—a reinforcement learning game where the objective is joint knowledge compression. This mirrors human scientific communities, where collective intelligence emerges from communication and competition.

The compression perspective suggests such collaboration could be formalized: agents maintain individual compression schemes (world models) but share compressed communications (hypotheses, data, critiques). The system evolves toward consensus compressions that capture shared structure while specializing in complementary aspects.

This is speculative but points toward a future where scientific discovery itself becomes scalable through AI collaboration, moving beyond the cognitive limits of individual human researchers.

AI for Quantum Mechanics: The Ultimate Compression Challenge

A Professor’s Challenge

In my sophomore year, eager to dive into quantum mechanics, I enrolled in an advanced course a year early. On the first day, the professor made a bold claim: even machine learning could never discover the Schrödinger equation.

This assertion fascinated me. What makes quantum mechanics special? Why should it resist machine discovery when AI excels at pattern recognition?

Let’s be precise about what “discovering the Schrödinger equation” means: identifying patterns in quantum wave phenomena from experimental data and expressing them in communicable mathematical form. If a machine could do this—finding invariants that govern quantum dynamics—we would have to acknowledge genuine scientific discovery by AI.

The question connects directly to our compression theme: Is the Schrödinger equation the maximally compressed representation of quantum phenomenology? Or might there exist alternative formulations—perhaps ones natural to AI but alien to human physicists—that compress quantum mechanics more efficiently?

Symbolic Regression: Learning Physical Laws from Data

Symbolic regression provides a methodology for exactly this kind of discovery. Rather than fitting predefined function forms, symbolic regression autonomously generates candidate equations, testing them against data. Pioneered by John Koza in the early 1990s using genetic algorithms, modern variants like AIFeynman leverage deep learning to search equation space more efficiently.

Consider rediscovering Newton’s second law. Given time-series data of force $\mathbf{F}(t)$, position $\mathbf{x}(t)$, and mass $m$, can an algorithm discover that $\mathbf{F}-m\ddot{\mathbf{x}}=\mathbf{0}$ always holds? This is discovering an invariant—a conserved pattern amid changing observations.

Therefore, symbolic regression can be thought of as minimizing the following loss function:

\[\mathcal{L}(f_\text{expr}):=\|f_\text{expr}( \mathbf{F},\mathbf{x}, m)\|^2\]

When $\mathcal{F} := { \mathbf{v}: [t_i, t_f] \rightarrow \mathbb{R}^3 }$, $\mathbf{F},\mathbf{x}\in\mathcal{F}$ are functions of time $t\in [t_i,t_f]$ and can be differentiated as much as desired, and $f_{\text{expr}}:\mathcal{F}\times\mathcal{F}\times\mathbb{R}^+ \rightarrow \mathcal{F}$ is a well-formed expression made using operators we know such as addition, multiplication, and differentiation.

The Equation Complexity Problem

Here’s the fundamental tradeoff: simpler equations compress better but may fit worse; complex equations fit better but don’t compress.

This is precisely analogous to the bias-variance tradeoff in machine learning, but now applied to equation space. To prevent overfitting—discovering spurious patterns in measurement noise—we need a complexity penalty on $f_\text{expr}$.

This raises deep questions:

How do we measure equation complexity? String length? Number of operators? Kolmogorov complexity of the expression tree? Each choice embodies assumptions about what makes theories “elegant.”
Can we constrain AI to generate only well-formed expressions? Large language models trained on scientific literature learn implicit syntax of equations. Can we architect them to guarantee mathematical validity—generating only expressions with proper units, matched dimensions, sensible operator precedence?
Do equations have meaningful embeddings? Can we create a latent space where nearby points represent similar physical laws? If so, we could search equation space the way CLIP searches image-text space—by navigating a learned manifold of meaning.

Answering these questions could enable AI systems that not only discover equations but do so with scientific taste—preferring simple, elegant compressions over baroque memorization.

Quantum Computers as Discovery Engines

Here’s a crucial insight: AI needs data to learn, and quantum experiments are exponentially expensive to simulate classically. This creates a beautiful synergy: quantum computers as experimental playgrounds for AI scientists.

Humans discovered quantum mechanics through centuries of experimental interaction—double-slit experiments, spectroscopy, the photoelectric effect. We didn’t derive quantum mechanics from first principles; we discovered it by playing with nature. Why should AI be different?

If we expect AI to discover the Schrödinger equation from minimal data, we’re setting an impossible bar—like expecting humans to derive quantum mechanics from a handful of observations without experimental apparatus. But give AI access to a quantum computer, and it can conduct millions of quantum experiments, exploring parameter spaces inaccessible to human experimenters, potentially discovering patterns we’ve missed.

This isn’t science fiction. The ingredients exist:

Quantum hardware: Noisy Intermediate-Scale Quantum (NISQ) devices enable controlled quantum experiments
Symbolic regression algorithms: AIFeynman and successors can search equation space efficiently
Reinforcement learning: AI can learn to design informative experiments, not just analyze given data

The paradox: quantum brains don’t seem necessary for biological intelligence (our neurons appear classical), yet quantum computers may be necessary for AI to truly understand quantum mechanics. The difference? Data accessibility. Humans evolved in a classical-appearing world but built quantum instruments. AI needs direct quantum playgrounds to compress quantum patterns efficiently.

Neural Networks for Quantum Eigenvalue Problems

Let’s devise a simple symbolic regression methodology that uses machine learning for quantum computation. First, writing the Schrödinger equation:

\[\hat{H} \Psi (\mathbf{r}, t) = i\hbar \frac{\partial}{\partial t} \Psi(\mathbf{r}, t)\]

When the Hamiltonian is invariant with respect to time, we find solutions to the eigenvalue problem $\hat{H}\psi(\mathbf{r})= E\psi(\mathbf{r})$ and multiply by the phase factor $e^{-iEt/\hbar}$ to evolve them—game over.

The action of an operator on a wave function can be interpreted as a linear transformation acting on a vector. And the eigenvalue problem can be thought of as finding the axis of symmetry whose direction is invariant before and after applying the transformation. Can we approximate and obtain these “axes of symmetry,” i.e., eigenfunctions, with neural networks?

Replace the wave function $\psi:\mathbb{R}^n\rightarrow\mathbb{C}$ satisfying the eigenvalue equation $\hat{H}\psi = E\psi$ with the neural network $\psi_\theta$. Then the loss function can be expressed as follows for some norm $

\cdot

\[\mathcal{L}(\theta, E)=\|\hat{H}\psi_\theta - E\psi_\theta\|^2\]

There are still some unresolved problems with the above approach. For example, how do we define the above norm? One possibility is to define the norm of function $f$ as $

=\sqrt{\mathbb{E}\left[

f(X)

^2\right]},\;\;X \sim \mathcal{N}(\mathbf{0}, I_{n\times n})$. However, the wave function $\psi_\theta$ defined by a neural network is extremely complex, and considerable computational resources are consumed to calculate the expectation value used in the norm.

Additionally, a method has not been prepared for calculating the action of the Hamiltonian on the (neural network-defined) wave function. Suppose we approximate the Hamiltonian again with a neural network. When the dimension of $\theta$ is $N$, we need to newly define an operator $\hat{H}_\Theta: \mathbb{R}^N \rightarrow \mathbb{R}^N$ defined by a neural network.

We can confirm that computational complexity increases exponentially according to the complexity of the system being simulated. What does this mean? If we utilize artificial intelligence, physics at the level of small molecules can be simulated without much difficulty. However, it does not seem possible to simulate the dynamics of larger quantum systems of ~10,000 level without compromising on accuracy.

The Paradox of Quantum Computing for AI

If we can sufficiently describe quantum mechanics just by obtaining the simulation function, there is no need to insist on quantum computers. However, paradoxically, the cheapest way to obtain large-scale quantum experimental data is quantum computing. We must explore whether quantum parallelism can provide practically significant help, and if so, how much.

According to what has been revealed so far about neural networks, through training, they can learn patterns and structures embedded in multidimensional data, and can compressively represent revealed information through dimensionality reduction techniques. And it is also quite possible to map information stored as vectors this way into formulas that humans can see by using natural language processing.

The fact that machines cannot discover quantum phenomena on their own seems rather implausible given the speed of AI development, but considering the characteristic of quantum phenomena where computational complexity increases exponentially according to the complexity of the system, the professor’s statement may not be so wrong after all.

Research Questions and Future Directions

Gordon’s Escape Theorem and Dataset Intrinsic Dimension

Our current research direction focuses on Gordon’s escape theorem combined with incorporating dataset intrinsic dimension. We need practical estimation algorithms for the dataset’s intrinsic dimension (e.g., PCA). We need to give researchers a tool that can estimate the minimum amount of parameters needed to train for a certain task.

Gordon’s escape theorem states that in high-dimensional spaces, a random subspace of sufficient dimension will “escape” through any mesh of low complexity with high probability. This theorem provides a powerful tool for understanding the behavior of random projections and has important applications in compressed sensing and dimensionality reduction.

Information Bottleneck and Optimal Transport

The Information Bottleneck Method introduces the bottleneck $\tilde{X}$ to form the Markov chain $X \rightarrow \tilde{X} \rightarrow Y$, and drawing ideas from rate-distortion theory we obtain:

\[\underset{p(\tilde{x}|x)}{\text{minimize}}\;I(X;\tilde{X})-\beta I(X;Y)\]

An alternative approach is exploring “optimal transport.” If a data measure exists on a manifold, it can be represented by manifold structure, and we can find the optimal transport that moves it. This has significance in that it explicitly incorporates the manifold hypothesis into generalization. However, there is little discussion about predicting neural network parameters.

Fractal Structures and Compression

Complex structures like the Mandelbrot Set or Bifurcation Diagram are embedded in extremely simple formulas. Can we devise information compression algorithms that borrow such structures? Can we analyze fractal structures or bifurcation diagrams with neural networks to derive insights into chaotic systems? Fractal compression and the collage theorem are worth exploring.

Connection to Complexity Theory

There are $O(n^2)$ and $O(n\log n)$ algorithms which all perform the same task—sorting. Can we argue that one is a lossless compression of the other, since it uses less computation?

Conclusion: The Compression Paradigm for Intelligence and Discovery

We began with a simple observation: compression algorithms reveal structure. We end with a radical hypothesis: intelligence itself is compression, and scientific discovery is the search for maximally compressed representations of natural patterns.

This perspective unifies disparate phenomena:

Intelligence as Multilevel Compression: From statistical redundancy removal to context-dependent information filtering to world model construction, intelligence operates by finding and exploiting compressible structure. Human communication achieves efficiency through shared knowledge—implicit compression between minds. Consciousness itself may be an evolutionary compression mechanism: organisms that efficiently represented their environments survived and reproduced.

Neural Networks as Nonlinear Basis Learners: SVD reveals that finding optimal linear bases is equivalent to discovering compressible structure. I hypothesize that neural networks generalize this: they learn optimal nonlinear bases—hierarchical coordinate systems that progressively compress task-relevant features. This reframes deep learning’s core questions: capacity becomes compression capability, training becomes basis optimization, and generalization becomes distinguishing true structure from noise.

Scientific Theories as Compressed Predictors: Physics formulas are not knowledge themselves but compressed pointers to knowledge—triggers for decompressing understanding stored in trained minds. Different scientific fields develop specialized “compression codebooks” (notations, concepts) that enable efficient communication among practitioners. Could AI discover superior compression schemes—theoretical frameworks more compact than human-readable equations yet equally predictive?

Quantum Mechanics as Ultimate Compression Challenge: The exponential scaling of quantum systems—computational complexity growing with system size—makes them the ultimate test for compression-based AI. Classical simulation quickly becomes intractable. Yet quantum computers offer a solution: experimental playgrounds where AI can gather quantum data cheaply, potentially discovering patterns (and compressions) that humans, constrained to classical intuition, have missed.

A Research Vision

This compression paradigm suggests concrete research directions:

Formalizing neural network capacity in information-theoretic terms: Extending mutual information to function spaces, connecting Gordon’s escape theorem with dataset intrinsic dimension, developing tools that predict minimum parameter counts for tasks.
Equation embeddings and learned equation spaces: Creating latent spaces where physics laws cluster by similarity, enabling search through theory space guided by both empirical fit and compression criteria.
AI-quantum synergy for discovery: Coupling symbolic regression with quantum experimental hardware, letting AI autonomously design and execute quantum experiments, searching for compressed descriptions of quantum phenomenology.
Multi-agent collaborative compression: Formalizing scientific communities as distributed compression systems, where agents with specialized bases share compressed communications, evolving toward consensus theories.
Meta-learning across task distributions: Discovering parametric bases that optimally decompose task families into reusable components—the mathematics of transfer learning and few-shot generalization.

The Deeper Question

My professor claimed that machine learning could never discover the Schrödinger equation. Having explored the landscape, I believe the claim reveals something profound—not about AI’s limitations, but about the nature of understanding itself.

Perhaps understanding isn’t about possessing a formula, but about having the right compression scheme. Humans “understand” quantum mechanics not because we can write $\hat{H}\Psi = i\hbar \partial_t \Psi$, but because centuries of training have given us a decompression algorithm: when we see this notation, vast networks of meaning activate—Hilbert spaces, measurement, superposition, entanglement.

Can AI understand quantum mechanics differently than humans? Not by learning our compression scheme, but by discovering its own—one that perhaps compresses quantum patterns more efficiently but maps poorly to human notation? If AI discovers a theory that predicts quantum phenomena better than the Schrödinger equation but expresses it in 100,000 neural network parameters, have we succeeded or failed?

This brings us full circle: What counts as understanding? If understanding is successful compression enabling prediction and manipulation, then AI understanding need not mirror human understanding. The question isn’t whether AI can discover the Schrödinger equation, but whether it can discover something better—a more compressed, more predictive representation of quantum reality.

The integration of information compression theory and neural network approximation theory isn’t just possible—it may be necessary for understanding intelligence itself. And pursuing this integration might not only revolutionize AI and science, but fundamentally transform what we mean by knowledge, discovery, and understanding.

References

Compression and Hallucinations:

Chlon, L., Karim, A., & Chlon, M. (2024). Predictable Compression Failures: Why Language Models Actually Hallucinate. arXiv:2509.11208. Available at: https://arxiv.org/abs/2509.11208

Neural Network Theory and Capacity:

How many degrees of freedom do we need to train deep networks: a loss landscape perspective. arXiv:2107.05802
Intrinsic dimension of data representations in deep neural networks. arXiv:1905.12784
Generalization bounds for deep learning. arXiv:2012.04115
Gordon’s escape theorem and related work on high-dimensional geometry

Information Theory:

The information bottleneck method. arXiv:physics/0004057
Deep Learning and the Information Bottleneck Principle. arXiv:1503.02406

AI for Science:

Symbolic regression literature including AIFeynman
Neural network methods for solving differential equations