Research Papers
212 papers in library
A correspondence problem for mathematical proof
Mathematical proofs are often said to justify their conclusions by indicating the existence of a corresponding formal derivation. We argue that this widespread view relies on an under-examined notion ...
Reviving the Philosophy of Geometry
In the Anglophone world, the philosophical treatment of geometry has fallen on hard times. This chapter argues that philosophy can come to a better understanding of mathematics by providing an account...
Topological constraints on self-organization in locally interacting systems
All intelligence is collective intelligence, in the sense that it is made of parts that must align with respect to system-level goals. Understanding the dynamics that facilitate or limit navigation of...
A correspondence problem for mathematical proof
Mathematical proofs are often said to justify their conclusions by indicating the existence of a corresponding formal derivation. We argue that this widespread view relies on an under-examined notion ...
Position: Categorical Deep Learning is an Algebraic Theory of All Architectures
We present our position on the elusive quest for a general-purpose framework for specifying and studying deep learning architectures. Our opinion is that the key attempts made so far lack a coherent b...
Categorical Proof-Theoretic Semantics
In proof-theoretic semantics, model-theoretic validity is replaced by prooftheoretic validity. Validity of formulae is defined inductively from a base giving the validity of atoms using inductive claus...
Topological Data Analysis and Topological Deep Learning Beyond Persistent Homology -- A Review
Topological data analysis (TDA) is a rapidly evolving field in applied mathematics and data science that leverages tools from topology to uncover robust, shape-driven insights in complex datasets. The...
Introduction to Homotopy Type Theory
This is an introductory textbook to univalent mathematics and homotopy type theory, a mathematical foundation that takes advantage of the structural nature of mathematical definitions and construction...
Graph Representations for Higher-Order Logic and Theorem Proving
This paper presents the first use of graph neural networks (GNNs) for higher-order proof search and demonstrates that GNNs can improve upon state-of-the-art results in this domain. Interactive, higher...
EvolProver: Advancing Automated Theorem Proving by Evolving Formalized Problems via Symmetry and Difficulty
Large Language Models (LLMs) for formal theorem proving have shown significant promise, yet they often lack generalizability and are fragile to even minor transformations of problem statements. To add...
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
The last decade has witnessed an experimental revolution in data science and machine learning, epitomised by deep learning methods. Indeed, many high-dimensional learning tasks previously thought to b...
Position: Categorical Deep Learning is an Algebraic Theory of All Architectures
We present our position on the elusive quest for a general-purpose framework for specifying and studying deep learning architectures. Our opinion is that the key attempts made so far lack a coherent b...
LeanTree: Accelerating White-Box Proof Search with Factorized States in Lean 4
Automated theorem proving (ATP) has been a classical problem in artificial intelligence since its inception, yet it remains challenging due to its vast state and action space. Large language models (L...
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search
We introduce DeepSeek-Prover-V1.5, an open-source language model designed for theorem proving in Lean 4, which enhances DeepSeek-Prover-V1 by optimizing both training and inference processes. Pre-trai...
Aristotle: IMO-level Automated Theorem Proving
We introduce Aristotle, an AI system that combines formal verification with informal reasoning, achieving gold-medal-equivalent performance on the 2025 International Mathematical Olympiad problems. Ar...
Lean Copilot: Large Language Models as Copilots for Theorem Proving in Lean
Neural theorem proving combines large language models (LLMs) with proof assistants such as Lean, where the correctness of formal proofs can be rigorously verified, leaving no room for hallucination. W...
Leanabell-Prover: Posttraining Scaling in Formal Reasoning
Recent advances in automated theorem proving (ATP) through LLMs have highlighted the potential of formal reasoning with Lean 4 codes. However, ATP has not yet be revolutionized by the recent posttrain...
Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space
Large Language Models (LLMs) apply uniform computation to all tokens, despite language exhibiting highly non-uniform information density. This token-uniform regime wastes capacity on locally predictab...
Sheaf Theory through Examples (Abridged Version)
This book provides an inviting tour through sheaf theory, from the perspective of applied category theory and pitched at a less specialized audience than is typical with introductions to sheaves. The ...
Group Theory in a Nutshell for Physicists
Cover -- Title -- Copyright -- Dedication -- Contents -- Preface -- A Brief Review of Linear Algebra -- Part I: Groups: Discrete or Continuous, Finite or Infinite -- I.1 Symmetry and Groups -- I.2 Fin...
Mind Everywhere: A Framework for Conceptualizing Goal-Directedness in Biology and Other Domains—Part One
What makes a system—evolved, engineered, or hybrid—describable by teleological and mentalistic terms such as intelligent, goal-directed, cognitive, and intentional? In this two-part article, we review...
Mind Everywhere: A Framework for Conceptualizing Goal-Directedness in Biology and Other Domains—Part Two
What makes a system—evolved, engineered, or hybrid—describable by teleological and mentalistic terms such as intelligent, goal-directed, cognitive, and intentional? In this two-part article, we review...
What does it mean for a system to compute?
Abstract Many real-world dynamic systems, both natural and artificial, are understood to be performing computations. For artificial dynamic systems, explicitly designed to p...
A CATEGORICAL INTRODUCTION TO SHEAVES
Sheaf is a very useful notion when defining and computing many different cohomology theories over topological spaces. There are several ways to build up sheaf theory with different axioms; however, some ...
Cognition all the way down 2.0: neuroscience beyond neurons in the diverse intelligence era
Abstract This paper formalizes biological intelligence as search efficiency in multi-scale problem spaces, aiming to resolve epistemic deadlocks in the basal “cognition wars...
In-Context Algebra
We investigate the mechanisms that arise when transformers are trained to solve arithmetic on sequences where tokens are variables whose meaning is determined only through their interactions. While pr...
An Introduction to Topos Theory
The purpose of this text is to equip the reader with an intuitive but precise understanding of elementary structures of category and topos theory. In order to achieve this goal, we provide a guided to...
A Universal Approach to Self-Referential Paradoxes, Incompleteness and Fixed Points
Following F. William Lawvere, we show that many self-referential paradoxes, incompleteness theorems and fixed point theorems fall out of the same simple scheme. We demonstrate these similarities by sh...
A Type Theory with a Tiny Object
We present an extension of Martin-Löf Type Theory that contains a tiny object; a type for which there is a right adjoint to the formation of function types as well as the expected left adjoint. We dem...
Classical Sorting Algorithms as a Model of Morphogenesis: self-sorting arrays reveal unexpected competencies in a minimal model of basal intelligence
The emerging field of Diverse Intelligence seeks to identify, formalize, and understand commonalities in behavioral competencies across a wide range of implementations. Especially interesting are simp...
Flow-Lenia: Emergent evolutionary dynamics in mass conservative continuous cellular automata
Central to the artificial life endeavour is the creation of artificial systems spontaneously generating properties found in the living world such as autopoiesis, self-replication, evolution and open-e...
Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space
Large Language Models (LLMs) apply uniform computation to all tokens, despite language exhibiting highly non-uniform information density. This token-uniform regime wastes capacity on locally predictab...
A Background Independent Algebra in Quantum Gravity
We propose an algebra of operators along an observer's worldline as a background-independent algebra in quantum gravity. In that context, it is natural to think of the Hartle-Hawking no boundary state...
LLMSTEP: LLM proofstep suggestions in Lean
We present LLMSTEP, a tool for integrating a language model into the Lean proof assistant. LLMSTEP is a Lean 4 tactic that sends a user's proof state to a server hosting a language model. The language...
miniCTX: Neural Theorem Proving with (Long-)Contexts
Real-world formal theorem proving often depends on a wealth of context, including definitions, lemmas, comments, file structure, and other information. We introduce miniCTX, which tests a model's abil...
AI-Assisted Generation of Difficult Math Questions
Current LLM training positions mathematical reasoning as a core capability. With publicly available sources fully tapped, there is unmet demand for diverse and challenging math questions. Relying sole...
Categorical Proof-Theoretic Semantics
In proof-theoretic semantics, model-theoretic validity is replaced by proof-theoretic validity. Validity of formulae is defined inductively from a base giving the validity of atoms using inductive cla...
Graph Representations for Higher-Order Logic and Theorem Proving
This paper presents the first use of graph neural networks (GNNs) for higher-order proof search and demonstrates that GNNs can improve upon state-of-the-art results in this domain. Interactive, higher...
EvolProver: Advancing Automated Theorem Proving by Evolving Formalized Problems via Symmetry and Difficulty
Large Language Models (LLMs) for formal theorem proving have shown significant promise, yet they often lack generalizability and are fragile to even minor transformations of problem statements. To add...
Categorical Proof-Theoretic Semantics
In proof-theoretic semantics, model-theoretic validity is replaced by proof-theoretic validity. Validity of formulae is defined inductively from a base giving the validity of atoms using inductive cla...
EvolProver: Advancing Automated Theorem Proving by Evolving Formalized Problems via Symmetry and Difficulty
Large Language Models (LLMs) for formal theorem proving have shown significant promise, yet they often lack generalizability and are fragile to even minor transformations of problem statements. To add...
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
The last decade has witnessed an experimental revolution in data science and machine learning, epitomised by deep learning methods. Indeed, many high-dimensional learning tasks previously thought to b...
Hilbert $*$-categories: Where limits in analysis and category theory meet
This article introduces Hilbert $*$-categories: an abstraction of categories with similar algebraic and analytic properties to the categories of real, complex, and quaternionic Hilbert spaces and boun...
Basic Category Theory
This short introductory category theory textbook is for readers with relatively little mathematical background (e.g. the first half of an undergraduate mathematics degree). At its heart is the concept...
Coend calculus
The book formerly known as "This is the (co)end, my only (co)friend".
LeanTree: Accelerating White-Box Proof Search with Factorized States in Lean 4
Automated theorem proving (ATP) has been a classical problem in artificial intelligence since its inception, yet it remains challenging due to its vast state and action space. Large language models (L...
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search
We introduce DeepSeek-Prover-V1.5, an open-source language model designed for theorem proving in Lean 4, which enhances DeepSeek-Prover-V1 by optimizing both training and inference processes. Pre-trai...
Aristotle: IMO-level Automated Theorem Proving
We introduce Aristotle, an AI system that combines formal verification with informal reasoning, achieving gold-medal-equivalent performance on the 2025 International Mathematical Olympiad problems. Ar...
Lean Copilot: Large Language Models as Copilots for Theorem Proving in Lean
Neural theorem proving combines large language models (LLMs) with proof assistants such as Lean, where the correctness of formal proofs can be rigorously verified, leaving no room for hallucination. W...
Leanabell-Prover: Posttraining Scaling in Formal Reasoning
Recent advances in automated theorem proving (ATP) through LLMs have highlighted the potential of formal reasoning with Lean 4 codes. However, ATP has not yet be revolutionized by the recent posttrain...
Premise Selection for a Lean Hammer
Neural methods are transforming automated reasoning for proof assistants, yet integrating these advances into practical verification workflows remains challenging. Hammers are tools that interface wit...
LeanDojo: Theorem Proving with Retrieval-Augmented Language Models
Large language models (LLMs) have shown promise in proving formal theorems using proof assistants such as Lean. However, existing methods are difficult to reproduce or build on, due to private code, d...
Position: Categorical Deep Learning is an Algebraic Theory of All Architectures
We present our position on the elusive quest for a general-purpose framework for specifying and studying deep learning architectures. Our opinion is that the key attempts made so far lack a coherent b...
Recursive Language Models
We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference strategy...
Swarms of Large Language Model Agents for Protein Sequence Design with Experimental Validation
Designing proteins de novo with tailored structural, physicochemical, and functional properties remains a grand challenge in biotechnology, medicine, and materials science, due to the vastness of sequ...
Closing the loop on morphogenesis: a mathematical model of morphogenesis by closed-loop reaction-diffusion
Morphogenesis, the establishment and repair of emergent complex anatomy by groups of cells, is a fascinating and biomedically-relevant problem. One of its most fascinating aspects is that a developing...
Category Theory for Programming
In these lecture notes, we give a brief introduction to some elements of category theory. The choice of topics is guided by applications to functional programming. Firstly, we study initial algebras, ...
Deep Manifold Part 2: Neural Network Mathematics
This work develops the global equations of neural networks through stacked piecewise manifolds, fixed-point theory, and boundary-conditioned iteration. Once fixed coordinates and operators are removed...
On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models
This paper addresses the general problem of reinforcement learning (RL) in partially observable environments. In 2013, our large RL recurrent neural networks (RNNs) learned from scratch to drive simul...
Transformers know more than they can tell -- Learning the Collatz sequence
We investigate transformer prediction of long Collatz steps, a complex arithmetic function that maps odd integers to their distant successors in the Collatz sequence ( $u_{n+1}=u_n/2$ if $u_n$ is even...
Learning in Transcriptional Network Models: Computational Discovery of Pathway-Level Memory and Effective Interventions
Trainability, in any substrate, refers to the ability to change future behavior based on past experiences. An understanding of such capacity within biological cells and tissues would enable a particul...
Top-down models in biology: explanation and control of complex living systems above the molecular level
It is widely assumed in developmental biology and bioengineering that optimal understanding and control of complex living systems follows from models of molecular events. The success of reductionism h...
Darwin’s agential materials: evolutionary implications of multiscale competency in developmental biology
A critical aspect of evolution is the layer of developmental physiology that operates between the genotype and the anatomical phenotype. While much work has addressed the evolution of developmental me...
Technological Approach to Mind Everywhere: An Experimentally-Grounded Framework for Understanding Diverse Bodies and Minds
Synthetic biology and bioengineering provide the opportunity to create novel embodied cognitive systems (otherwise known as minds) in a very wide variety of chimeric architectures combining evolved an...
The Universal Weight Subspace Hypothesis
We show that deep neural networks trained across diverse tasks exhibit remarkably similar low-dimensional parametric subspaces. We provide the first large-scale empirical evidence that demonstrates th...
Meta Large Language Model Compiler: Foundation Models of Compiler Optimization
Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of software engineering and coding tasks. However, their application in the domain of code and compiler optimiza...
Microbenchmarking NVIDIA's Blackwell Architecture: An in-depth Architectural Analysis
As GPU architectures rapidly evolve to meet the overcoming demands of exascale computing and machine learning, the performance implications of architectural innovations remain poorly understood across...
Classical Sorting Algorithms as a Model of Morphogenesis: self-sorting arrays reveal unexpected competencies in a minimal model of basal intelligence
The emerging field of Diverse Intelligence seeks to identify, formalize, and understand commonalities in behavioral competencies across a wide range of implementations. Especially interesting are simp...
What neuroscience can tell AI about learning in continuously changing environments
Modern artificial intelligence (AI) models, such as large language models, are usually trained once on a huge corpus of data, potentially fine-tuned for a specific task and then deployed with fixed pa...
Origins of the brain networks for advanced mathematics in expert mathematicians
The origins of human abilities for mathematics are debated: Some theories suggest that they are founded upon evolutionarily ancient brain circuits for number and space and others that they are grounde...
Collective intelligence: A unifying concept for integrating biology across scales and substrates
A defining feature of biology is the use of a multiscale architecture, ranging from molecular networks to cells, tissues, organs, whole bodies, and swarms. Crucially however, biology is not only neste...
It’s not the thought that counts: Allostasis at the core of brain function
In psychology and neuroscience, scientific questions are often framed in terms of mental activity (e.g., cognition, emotion, and perception); however, the brain is an organ with a particular function ...
Proof Theory
Proof theory is not an esoteric technical subject that was invented tosupport a formalist doctrine in the philosophy of mathematics; rather,it has been developed as an attempt to analyze aspects of ma...
Cache-to-Cache: Direct Semantic Communication Between Large Language Models
Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate...
Continual Learning: Fast and Slow
According to the Complementary Learning Systems (CLS) theory~\cite{mcclelland1995there} in neuroscience, humans do effective \emph{continual learning} through two complementary systems: a fast learnin...
The Future of Continual Learning in the Era of Foundation Models: Three Key Directions
Continual learning--the ability to acquire, retain, and refine knowledge over time--has always been fundamental to intelligence, both human and artificial. Historically, different AI paradigms have ac...
Fantastic Pretraining Optimizers and Where to Find Them
AdamW has long been the dominant optimizer in language model pretraining, despite numerous claims that alternative optimizers offer 1.4 to 2x speedup. We posit that two methodological shortcomings hav...
Autonomous Code Evolution Meets NP-Completeness
Large language models (LLMs) have recently shown strong coding abilities, enabling not only static code generation but also iterative code self-evolving through agentic frameworks. Recently, AlphaEvol...
Complexity Classes as Mathematical Axioms
Treating a conjecture, P^#P != NP, on the separation of complexity classes as an axiom, an implication is found in three manifold topology with little obvious connection to complexity theory. This is ...
NP-complete Problems and Physical Reality
Can NP-complete problems be solved efficiently in the physical universe? I survey proposals including soap bubbles, protein folding, quantum computing, quantum advice, quantum adiabatic algorithms, qu...
Future of Quantum Computing
On Tuesday 26th November 2024, four discussants participated in a moderated virtual panel titled Future of Quantum Computing as one session of the 8th International Conference on Quantum Techniques in...
The Ghost in the Quantum Turing Machine
In honor of Alan Turing's hundredth birthday, I unwisely set out some thoughts about one of Turing's obsessions throughout his life, the question of physics and free will. I focus relatively narrowly ...
Learning Universal Predictors
Meta-learning has emerged as a powerful approach to train neural networks to learn new tasks quickly from limited data. Broad exposure to different tasks leads to versatile representations enabling ge...
Large Language Models as Computable Approximations to Solomonoff Induction
The rapid advancement of large language models (LLMs) calls for a rigorous theoretical framework to explain their empirical success. While significant progress has been made in understanding LLM behav...
From Theory to Practice: Plug and Play with Succinct Data Structures
Engineering efficient implementations of compact and succinct structures is a time-consuming and challenging task, since there is no standard library of easy-to- use, highly optimized, and composable ...
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
Large language models interact with users through a simulated 'Assistant' persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals...
DataRater: Meta-Learned Dataset Curation
The quality of foundation models depends heavily on their training data. Consequently, great efforts have been put into dataset curation. Yet most approaches rely on manual tuning of coarse-grained mi...
Diffusion Beats Autoregressive in Data-Constrained Settings
Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promis...
Subliminal Learning: Language models transmit behavioral traits via hidden signals in data
We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a "teacher" model with some trait T (su...
LLM4Decompile: Decompiling Binary Code with Large Language Models
Decompilation aims to convert binary code to high-level source code, but traditional tools like Ghidra often produce results that are difficult to read and execute. Motivated by the advancements in La...
Large Language Models and Emergence: A Complex Systems Perspective
Emergence is a concept in complexity science that describes how many-body systems manifest novel higher-level properties, properties that can be described by replacing high-dimensional mechanisms with...
Transformers are Efficient Compilers, Provably
Transformer-based large language models (LLMs) have demonstrated surprisingly robust performance across a wide range of language-related tasks, including programming language understanding and generat...
Why Philosophers Should Care about Computational Complexity
One might think that, once we know something is computable, how efficiently it can be computed is a practical question with little further philosophical importance. In this essay, I offer a detailed case...
Fast and Simplex: 2-Simplicial Attention in Triton
Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count toge...
Simple linear attention language models balance the recall-throughput tradeoff
Recent work has shown that attention-based language models excel at recall, the ability to ground generations in tokens previously seen in context. However, the efficiency of attention-based models is...
Dimension Mixer: A Generalized Method for Structured Sparsity in Deep Neural Networks
The recent success of multiple neural architectures like CNNs, Transformers, and MLP-Mixers motivated us to look for similarities and differences between them. We found that these architectures can be...
Intelligence at the Edge of Chaos
We explore the emergence of intelligent behavior in artificial systems by investigating how the complexity of rule-based systems influences the capabilities of models trained to predict these rules. O...
Magistral
We introduce Magistral, Mistral's first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior mod...
Reinforcement Pre-Training
In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a rea...
Trends in AI Supercomputers
Frontier AI development relies on powerful AI supercomputers, yet analysis of these systems is limited. We create a dataset of 500 AI supercomputers from 2019 to 2025 and analyze key trends in perform...
eGPU: Extending eBPF Programmability and Observability to GPUs
Precise GPU observability and programmability are essential for optimizing performance in AI workloads and other computationally intensive high-performance computing (HPC) applications. In this paper,...
Attention-Level Speculation
As Large Language Models (LLMs) grow in size and context length, efficient inference strategies are essential to maintain low-latency token generation. Unfortunately, conventional tensor and data para...
Qwen3 Technical Report
In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capa...
LazyLog: A New Shared Log Abstraction for Low-Latency Applications
Shared logs offer linearizable total order across storage shards. However, they enforce this order eagerly upon ingestion, leading to high latencies. We observe that in many modern shared-log applicat...
How Much Knowledge Can You Pack Into the Parameters of a Language Model?
It has recently been observed that neural language models trained on unstructured text can implicitly store and retrieve knowledge using natural language queries. In this short paper, we measure the p...
The empirical status of predictive coding and active inference
Research on predictive processing models has focused largely on two specific algorithmic theories: Predictive Coding for perception and Active Inference for decision-making. While these interconnected...
Predictive eye movements are adjusted in a Bayes-optimal fashion in response to unexpectedly changing environmental probabilities
This study examined the application of active inference to dynamic visuomotor control. Active inference proposes that actions are dynamically planned according to uncertainty about sensory information...
Continuous Thought Machines
Biological brains demonstrate complex neural activity, where the timing and interplay between neurons is critical to how brains process information. Most deep learning architectures simplify neural ac...
A tutorial on the free-energy framework for modelling perception and learning
This paper provides an easy to follow tutorial on the free-energy framework for modelling perception developed by Friston, which extends the predictive coding model of Rao and Ballard. These models as...
Canonical Microcircuits for Predictive Coding
This Perspective considers the influential notion of a canonical (cortical) microcircuit in light of recent theories about neuronal processing. Specifically, we conciliate quantitative studies of micr...
Language Models use Lookbacks to Track Beliefs
How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilitie...
The Diffusion Duality
Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and ma...
Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Shojaee et al. (2025) report that Large Reasoning Models (LRMs) exhibit "accuracy collapse" on planning puzzles beyond certain complexity thresholds. We demonstrate that their findings primarily refle...
Self-Adapting Language Models
Large language models (LLMs) are powerful but static; they lack mechanisms to adapt their weights in response to new tasks, knowledge, or examples. We introduce Self-Adapting LLMs (SEAL), a framework ...
Chain-of-Thought Reasoning is a Policy Improvement Operator
Large language models have astounded the world with fascinating new capabilities. However, they currently lack the ability to teach themselves new skills, relying instead on large amounts of human-gen...
General agents need world models
Are world models a necessary ingredient for flexible, goal-directed behaviour, or is model-free learning sufficient? We provide a formal answer to this question, showing that any agent capable of gene...
Reasoning with Language Model is Planning with World Model
Large language models (LLMs) have shown remarkable reasoning capabilities, especially when prompted to generate intermediate reasoning steps (e.g., Chain-of-Thought, CoT). However, LLMs can still stru...
Emergent Abilities of Large Language Models
Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we...
Observer Theory and the Ruliad: An Extension to the Wolfram Model
This paper presents an extension of Observer Theory within the context of the Ruliad, using a mathematically rigorous formalization with category theory as the unifying framework. This paper demonstra...
Minimum Description Length and Generalization Guarantees for Representation Learning
A major challenge in designing efficient statistical supervised learning algorithms is finding representations that perform well not only on available training samples but also on unseen data. While t...
Compute-Optimal LLMs Provably Generalize Better With Scale
Why do larger language models generalize better? To investigate this question, we develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regi...
Large Language Model Compression with Global Rank and Sparsity Optimization
Low-rank and sparse composite approximation is a natural idea to compress Large Language Models (LLMs). However, such an idea faces two primary challenges that adversely affect the performance of exis...
A Survey to Recent Progress Towards Understanding In-Context Learning
In-Context Learning (ICL) empowers Large Language Models (LLMs) with the ability to learn from a few examples provided in the prompt, enabling downstream generalization without the requirement for gra...
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
Today's AI systems have human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The advance of AI could itself be automated. If done safely, that would acceler...
The Complexity Dynamics of Grokking
We investigate the phenomenon of generalization through the lens of compression. In particular, we study the complexity dynamics of neural networks to explain grokking, where networks suddenly transit...
Denoising Diffusion Probabilistic Models
We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results ...
Deep Unsupervised Learning using Nonequilibrium Thermodynamics
A central problem in machine learning involves modeling complex data-sets using highly flexible families of probability distributions in which learning, sampling, inference, and evaluation are still a...
Mechanistic Design and Scaling of Hybrid Architectures
The development of deep learning architectures is a resource-demanding process, due to a vast design space, long prototyping times, and high compute costs associated with at-scale model training and e...
Mathematical discoveries from program search with large language models
Large language models (LLMs) have demonstrated tremendous capabilities in solving complex tasks, from quantitative reasoning to understanding natural language. However, LLMs sometimes suffer from conf...
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
The last decade has witnessed an experimental revolution in data science and machine learning, epitomised by deep learning methods. Indeed, many high-dimensional learning tasks previously thought to b...
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well ...
Voyager: An Open-Ended Embodied Agent with Large Language Models
We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human inter...
Trade-offs in Data Memorization via Strong Data Processing Inequalities
Recent research demonstrated that training large language models involves memorization of a significant fraction of training data. Such memorization can lead to privacy violations when training on sen...
What Formal Languages Can Transformers Express? A Survey
As transformers have gained prominence in natural language processing, some researchers have investigated theoretically what problems they can and cannot solve, by treating problems as formal language...
The Illusion of State in State-Space Models
State-space models (SSMs) have emerged as a potential alternative to transformers. One theoretical weakness of transformers is that they cannot express certain kinds of sequential computation and stat...
Emerging Properties in Unified Multimodal Pretraining
Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open-source foundational model that nativel...
How much do language models memorize?
We propose a new method for estimating how much a model ``knows'' about a datapoint and use it to measure the capacity of modern language models. Prior studies of language model memorization have stru...
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to addr...
BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization
Byte Pair Encoding (BPE) tokenizers, widely used in Large Language Models, face challenges in multilingual settings, including penalization of non-Western scripts and the creation of tokens with parti...
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
Recent advances in reasoning-centric language models have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards. However, it remains contentious whe...
Sequential Monte Carlo Steering of Large Language Models using Probabilistic Programs
Even after fine-tuning and reinforcement learning, large language models (LLMs) can be difficult, if not impossible, to control reliably with prompts alone. We propose a new inference-time approach to...
AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning
Despite recent progress in large-scale reinforcement learning (RL) for reasoning, the training recipe for building high-performing reasoning models remains elusive. Key implementation details of front...
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and ...
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well ...
REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards
We introduce Reasoning Gym (RG), a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains in...
Learning to Model the World with Language
To interact with humans and act in the world, agents need to understand the range of language that people use and relate it to the visual world. While current agents can learn to execute simple langua...
Hardware-Efficient Attention for Fast Decoding
LLM decoding is bottlenecked for large batches and long contexts by loading the key-value (KV) cache from high-bandwidth memory, which inflates per-token latency, while the sequential nature of decodi...
FP8 Formats for Deep Learning
FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary inte...
Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
Reinforcement learning (RL) yields substantial improvements in large language models (LLMs) downstream task performance and alignment with human values. Surprisingly, such large gains result from upda...
Deep Reinforcement Learning, a textbook
Deep reinforcement learning has gathered much attention recently. Impressive results were achieved in activities as diverse as autonomous driving, game playing, molecular recombination, and robotics. ...
Large Language Diffusion Models
Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre...
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR wo...
Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Credit Assignment
This paper investigates approaches to enhance the reasoning capabilities of Large Language Model (LLM) agents using Reinforcement Learning (RL). Specifically, we focus on multi-turn tool-use scenarios...
Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of LMO-based Optimizers for LLMs)
Recent developments in deep learning optimization have brought about radically new algorithms based on the Linear Minimization Oracle (LMO) framework, such as $\sf Muon$ and $\sf Scion$. After over a ...
Visual Planning: Let's Think Only with Images
Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely...
The Platonic Representation Hypothesis
We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways...
Round and Round We Go! What makes Rotary Positional Encodings useful?
Positional Encodings (PEs) are a critical component of Transformer-based Large Language Models (LLMs), providing the attention mechanism with important sequence-position information. One of the most p...
HiRoPE: Length Extrapolation for Code Models Using Hierarchical Position
Addressing the limitation of context length in large language models for code-related tasks is the primary focus of this paper. Existing LLMs are constrained by their pre-trained context lengths, lead...
Efficient Memory Management for Large Language Model Serving with PagedAttention
High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for eac...
Transformers Represent Belief State Geometry in their Residual Stream
Produced while being an affiliate at PIBBSS[1]. The work was done initially with funding from a Lightspeed Grant, and then continued while at PIBBSS.…
Consequences of the Moosbauer-Poole Algorithms
Moosbauer and Poole have recently shown that the multiplication of two $5\times 5$ matrices requires no more than 93 multiplications in the (possibly non-commutative) coefficient ring, and that the mu...
Illuminating search spaces by mapping elites
Many fields use search algorithms, which automatically explore a search space to find high-performing solutions: chemists search through the space of molecules to discover new drugs; engineers search ...
Iteratively reweighted kernel machines efficiently learn sparse functions
The impressive practical performance of neural networks is often attributed to their ability to learn low-dimensional data representations and hierarchical structure directly from data. In this work, ...
Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features
In recent years neural networks have achieved impressive results on many technological and scientific tasks. Yet, the mechanism through which these models automatically select features, or patterns in...
Byte Latent Transformer: Patches Scale Better Than Tokens
We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inferen...
Voyager: An Open-Ended Embodied Agent with Large Language Models
We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human inter...
Denoising Diffusion Probabilistic Models
We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results ...
A Path Towards Autonomous Machine Intelligence Version 0.9.2, 2022-06-27
How could machines learn as efficiently as humans and animals? How could machines learn to reason and plan? How could machines learn representations of percepts and action plans at multiple levels of ab...
Learning high-level visual representations from a child's perspective without strong inductive biases
Young children develop sophisticated internal models of the world based on their visual experience. Can such models be learned from a child's visual experience without strong inductive biases? To inve...
Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers
In this paper, I introduce the retrieval problem, a simple yet common reasoning task that can be solved only by transformers with a minimum number of layers, which grows logarithmically with the input...
Scaling Laws for Precision
Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise "precision-aware" scaling laws for b...
FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation
This work presents a Fully BInarized Large Language Model (FBI-LLM), demonstrating for the first time how to train a large-scale binary language model from scratch (not the partial binary or ternary L...
Learning to Reason for Long-Form Story Generation
Generating high-quality stories spanning thousands of tokens requires competency across a variety of skills, from tracking plot and character arcs to keeping a consistent and engaging style. Due to th...
Diffusion Models are Evolutionary Algorithms
In a convergence of machine learning and biology, we reveal that diffusion models are evolutionary algorithms. By considering evolution as a denoising process and reversed evolution as diffusion, we m...
Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages
Human bilinguals often use similar brain regions to process multiple languages, depending on when they learned their second language and their proficiency. In large language models (LLMs), how are mul...
Similarity of Neural Network Representations Revisited
Recent work has sought to understand the behavior of neural networks by comparing representations between layers and between different trained models. We examine methods for comparing neural network r...
Layers at Similar Depths Generate Similar Activations Across LLM Architectures
How do the latent spaces used by independently-trained LLMs relate to one another? We study the nearest neighbor relationships induced by activations at different layers of 24 open-weight LLMs, and fi...
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models
As enthusiasm for scaling computation (data and parameters) in the pretraining era gradually diminished, test-time scaling (TTS)—also referred to as “test-time computing”—has emerged as a prominent re...
TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems
This paper focuses on multimodal alignment within the realm of Artificial Intelligence, particularly in text and image modalities. The semantic gap between the textual and visual modality poses a disc...
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Archi...
History, Development, and Principles of Large Language Models-An Introductory Survey
Language models serve as a cornerstone in natural language processing (NLP), utilizing mathematical methods to generalize language laws and knowledge for prediction and generation. Over extensive rese...
Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability
Mathematical reasoning tasks pose significant challenges for large language models (LLMs) because they require precise logical deduction and sequence analysis. In this work, we introduce the concept o...
A mathematical theory of semantic development in deep neural networks
An extensive body of empirical research has revealed remarkable regularities in the acquisition, organization, deployment, and neural representation of human semantic knowledge, thereby raising a fund...
Progress measures for grokking via mechanistic interpretability
Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding em...
Towards Automated Circuit Discovery for Mechanistic Interpretability
Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process the...
Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes
I argue that data becomes temporarily interesting by itself to some self-improving, but computationally limited, subjective observer once he learns to predict or compress the data in a better way, thu...
Circuit Tracing: Revealing Computational Graphs in Language Models
We describe an approach to tracing the “step-by-step” computation involved when a model responds to a single prompt.
On the Biology of a Large Language Model
We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts, using our circuit tracing methodology.
Generalization through variance: how noise shapes inductive biases in diffusion models
How diffusion models generalize beyond their training set is not known, and is somewhat mysterious given two facts: the optimum of the denoising score matching (DSM) objective usually used to train di...
GENERALIZATION THROUGH VARIANCE: HOW NOISE SHAPES INDUCTIVE BIASES IN DIFFUSION MODELS
How diffusion models generalize beyond their training set is not known, and is somewhat mysterious given two facts: the optimum of the denoising score matching (DSM) objective usually used to train di...
Training Large Language Models to Reason in a Continuous Latent Space
Large language models (LLMs) are restricted to reason in the "language space", where they typically express the reasoning process with a chain-of-thought (CoT) to solve a complex reasoning problem. Ho...
Do Llamas Work in English? On the Latent Language of Multilingual Transformers
We ask whether multilingual language models trained on unbalanced, English-dominated corpora use English as an internal pivot language -- a question of key importance for understanding how language mo...
On the Emergence of Thinking in LLMs I: Searching for the Right Intuition
Recent advancements in AI, such as OpenAI’s new o models, Google’s Gemini Thinking model, and Deepseek R1, are transforming LLMs into LRMs (Large Reasoning Models). Unlike LLMs, LRMs perform thinking ...
On the Emergence of Thinking in LLMs I: Searching for the Right Intuition
Recent AI advancements, such as OpenAI's new models, are transforming LLMs into LRMs (Large Reasoning Models) that perform reasoning during inference, taking extra time and compute for higher-quality ...