Search Results
Research Papers
114 papers in library
Diffusion Beats Autoregressive in Data-Constrained Settings
Subliminal Learning: Language models transmit behavioral traits via hidden signals in data
LLM4Decompile: Decompiling Binary Code with Large Language Models
Large Language Models and Emergence: A Complex Systems Perspective
Transformers are Efficient Compilers, Provably
Fast and Simplex: 2-Simplicial Attention in Triton
Simple linear attention language models balance the recall-throughput tradeoff
Dimension Mixer: A Generalized Method for Structured Sparsity in Deep Neural Networks
Intelligence at the Edge of Chaos
Trends in AI Supercomputers
How Much Knowledge Can You Pack Into the Parameters of a Language Model?
The empirical status of predictive coding and active inference
Predictive eye movements are adjusted in a Bayes-optimal fashion in response to unexpectedly changing environmental probabilities
Continuous Thought Machines
A tutorial on the free-energy framework for modelling perception and learning
Language Models use Lookbacks to Track Beliefs
The Diffusion Duality
Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Chain-of-Thought Reasoning is a Policy Improvement Operator
General agents need world models
Reasoning with Language Model is Planning with World Model
Minimum Description Length and Generalization Guarantees for Representation Learning
Compute-Optimal LLMs Provably Generalize Better With Scale
Large Language Model Compression with Global Rank and Sparsity Optimization
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
Deep Unsupervised Learning using Nonequilibrium Thermodynamics
Mechanistic Design and Scaling of Hybrid Architectures
Mathematical discoveries from program search with large language models
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
Trade-offs in Data Memorization via Strong Data Processing Inequalities
Emerging Properties in Unified Multimodal Pretraining
How much do language models memorize?
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
Sequential Monte Carlo Steering of Large Language Models using Probabilistic Programs
AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards
Learning to Model the World with Language
Hardware-Efficient Attention for Fast Decoding
Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
Deep Reinforcement Learning, a textbook
Large Language Diffusion Models
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Credit Assignment
Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of LMO-based Optimizers for LLMs)
Visual Planning: Let's Think Only with Images
The Platonic Representation Hypothesis
Round and Round We Go! What makes Rotary Positional Encodings useful?
HiRoPE: Length Extrapolation for Code Models Using Hierarchical Position
Efficient Memory Management for Large Language Model Serving with PagedAttention
Consequences of the Moosbauer-Poole Algorithms
Illuminating search spaces by mapping elites
Iteratively reweighted kernel machines efficiently learn sparse functions
Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features
Byte Latent Transformer: Patches Scale Better Than Tokens
Voyager: An Open-Ended Embodied Agent with Large Language Models
Denoising Diffusion Probabilistic Models
Learning high-level visual representations from a child's perspective without strong inductive biases
Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers
Scaling Laws for Precision
FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation
Learning to Reason for Long-Form Story Generation
Diffusion Models are Evolutionary Algorithms
Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages
Similarity of Neural Network Representations Revisited
Layers at Similar Depths Generate Similar Activations Across LLM Architectures
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models
TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
History, Development, and Principles of Large Language Models-An Introductory Survey
Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability
A mathematical theory of semantic development in deep neural networks
Progress measures for grokking via mechanistic interpretability
Towards Automated Circuit Discovery for Mechanistic Interpretability
Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes
Generalization through variance: how noise shapes inductive biases in diffusion models
GENERALIZATION THROUGH VARIANCE: HOW NOISE SHAPES INDUCTIVE BIASES IN DIFFUSION MODELS
Training Large Language Models to Reason in a Continuous Latent Space
Do Llamas Work in English? On the Latent Language of Multilingual Transformers
On the Emergence of Thinking in LLMs I: Searching for the Right Intuition
On the Emergence of Thinking in LLMs I: Searching for the Right Intuition
Papers by Tags
-
Active Inference (2)
-
Bayesian (1)
-
Bayesian Brain (1)
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Computer science (1)
-
-
-
-
-
-
-
-
Pure mathematics (1)
-
-
-
-
-
-
Virtual reality (1)
-
Visuomotor (1)
Timeline
July 2025
10 papersDiffusion Beats Autoregressive in Data-Constrained Settings
Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promis...
Subliminal Learning: Language models transmit behavioral traits via hidden signals in data
We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a "teacher" model with some trait T (su...
LLM4Decompile: Decompiling Binary Code with Large Language Models
Decompilation aims to convert binary code to high-level source code, but traditional tools like Ghidra often produce results that are difficult to read and execute. Motivated by the advancements in La...
Large Language Models and Emergence: A Complex Systems Perspective
Emergence is a concept in complexity science that describes how many-body systems manifest novel higher-level properties, properties that can be described by replacing high-dimensional mechanisms with...
Transformers are Efficient Compilers, Provably
Transformer-based large language models (LLMs) have demonstrated surprisingly robust performance across a wide range of language-related tasks, including programming language understanding and generat...
Why Philosophers Should Care about Computational Complexity
One might think that, once we know something is computable, how efficiently it can be computed is a practical question with little further philosophical importance. In this essay, I offer a detailed case...
Fast and Simplex: 2-Simplicial Attention in Triton
Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count toge...
Simple linear attention language models balance the recall-throughput tradeoff
Recent work has shown that attention-based language models excel at recall, the ability to ground generations in tokens previously seen in context. However, the efficiency of attention-based models is...
Dimension Mixer: A Generalized Method for Structured Sparsity in Deep Neural Networks
The recent success of multiple neural architectures like CNNs, Transformers, and MLP-Mixers motivated us to look for similarities and differences between them. We found that these architectures can be...
Intelligence at the Edge of Chaos
We explore the emergence of intelligent behavior in artificial systems by investigating how the complexity of rule-based systems influences the capabilities of models trained to predict these rules. O...
June 2025
53 papersMagistral
We introduce Magistral, Mistral's first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior mod...
Reinforcement Pre-Training
In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a rea...
Trends in AI Supercomputers
Frontier AI development relies on powerful AI supercomputers, yet analysis of these systems is limited. We create a dataset of 500 AI supercomputers from 2019 to 2025 and analyze key trends in perform...
eGPU: Extending eBPF Programmability and Observability to GPUs
Precise GPU observability and programmability are essential for optimizing performance in AI workloads and other computationally intensive high-performance computing (HPC) applications. In this paper,...
Attention-Level Speculation
As Large Language Models (LLMs) grow in size and context length, efficient inference strategies are essential to maintain low-latency token generation. Unfortunately, conventional tensor and data para...
Qwen3 Technical Report
In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capa...
LazyLog: A New Shared Log Abstraction for Low-Latency Applications
Shared logs offer linearizable total order across storage shards. However, they enforce this order eagerly upon ingestion, leading to high latencies. We observe that in many modern shared-log applicat...
How Much Knowledge Can You Pack Into the Parameters of a Language Model?
It has recently been observed that neural language models trained on unstructured text can implicitly store and retrieve knowledge using natural language queries. In this short paper, we measure the p...
The empirical status of predictive coding and active inference
Research on predictive processing models has focused largely on two specific algorithmic theories: Predictive Coding for perception and Active Inference for decision-making. While these interconnected...
Predictive eye movements are adjusted in a Bayes-optimal fashion in response to unexpectedly changing environmental probabilities
This study examined the application of active inference to dynamic visuomotor control. Active inference proposes that actions are dynamically planned according to uncertainty about sensory information...
Continuous Thought Machines
Biological brains demonstrate complex neural activity, where the timing and interplay between neurons is critical to how brains process information. Most deep learning architectures simplify neural ac...
A tutorial on the free-energy framework for modelling perception and learning
This paper provides an easy to follow tutorial on the free-energy framework for modelling perception developed by Friston, which extends the predictive coding model of Rao and Ballard. These models as...
Canonical Microcircuits for Predictive Coding
This Perspective considers the influential notion of a canonical (cortical) microcircuit in light of recent theories about neuronal processing. Specifically, we conciliate quantitative studies of micr...
Language Models use Lookbacks to Track Beliefs
How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilitie...
The Diffusion Duality
Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and ma...
Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Shojaee et al. (2025) report that Large Reasoning Models (LRMs) exhibit "accuracy collapse" on planning puzzles beyond certain complexity thresholds. We demonstrate that their findings primarily refle...
Self-Adapting Language Models
Large language models (LLMs) are powerful but static; they lack mechanisms to adapt their weights in response to new tasks, knowledge, or examples. We introduce Self-Adapting LLMs (SEAL), a framework ...
Chain-of-Thought Reasoning is a Policy Improvement Operator
Large language models have astounded the world with fascinating new capabilities. However, they currently lack the ability to teach themselves new skills, relying instead on large amounts of human-gen...
General agents need world models
Are world models a necessary ingredient for flexible, goal-directed behaviour, or is model-free learning sufficient? We provide a formal answer to this question, showing that any agent capable of gene...
Reasoning with Language Model is Planning with World Model
Large language models (LLMs) have shown remarkable reasoning capabilities, especially when prompted to generate intermediate reasoning steps (e.g., Chain-of-Thought, CoT). However, LLMs can still stru...
Emergent Abilities of Large Language Models
Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we...
Observer Theory and the Ruliad: An Extension to the Wolfram Model
This paper presents an extension of Observer Theory within the context of the Ruliad, using a mathematically rigorous formalization with category theory as the unifying framework. This paper demonstra...
eGPU: Extending eBPF Programmability and Observability to GPUs
Minimum Description Length and Generalization Guarantees for Representation Learning
A major challenge in designing efficient statistical supervised learning algorithms is finding representations that perform well not only on available training samples but also on unseen data. While t...
Compute-Optimal LLMs Provably Generalize Better With Scale
Why do larger language models generalize better? To investigate this question, we develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regi...
Large Language Model Compression with Global Rank and Sparsity Optimization
Low-rank and sparse composite approximation is a natural idea to compress Large Language Models (LLMs). However, such an idea faces two primary challenges that adversely affect the performance of exis...
A Survey to Recent Progress Towards Understanding In-Context Learning
In-Context Learning (ICL) empowers Large Language Models (LLMs) with the ability to learn from a few examples provided in the prompt, enabling downstream generalization without the requirement for gra...
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
Today's AI systems have human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The advance of AI could itself be automated. If done safely, that would acceler...
The Complexity Dynamics of Grokking
We investigate the phenomenon of generalization through the lens of compression. In particular, we study the complexity dynamics of neural networks to explain grokking, where networks suddenly transit...
Denoising Diffusion Probabilistic Models
We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results ...
Deep Unsupervised Learning using Nonequilibrium Thermodynamics
A central problem in machine learning involves modeling complex data-sets using highly flexible families of probability distributions in which learning, sampling, inference, and evaluation are still a...
Mechanistic Design and Scaling of Hybrid Architectures
The development of deep learning architectures is a resource-demanding process, due to a vast design space, long prototyping times, and high compute costs associated with at-scale model training and e...
Mathematical discoveries from program search with large language models
Large language models (LLMs) have demonstrated tremendous capabilities in solving complex tasks, from quantitative reasoning to understanding natural language. However, LLMs sometimes suffer from conf...
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
The last decade has witnessed an experimental revolution in data science and machine learning, epitomised by deep learning methods. Indeed, many high-dimensional learning tasks previously thought to b...
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well ...
Voyager: An Open-Ended Embodied Agent with Large Language Models
We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human inter...
Trade-offs in Data Memorization via Strong Data Processing Inequalities
Recent research demonstrated that training large language models involves memorization of a significant fraction of training data. Such memorization can lead to privacy violations when training on sen...
What Formal Languages Can Transformers Express? A Survey
As transformers have gained prominence in natural language processing, some researchers have investigated theoretically what problems they can and cannot solve, by treating problems as formal language...
The Illusion of State in State-Space Models
State-space models (SSMs) have emerged as a potential alternative to transformers. One theoretical weakness of transformers is that they cannot express certain kinds of sequential computation and stat...
Emerging Properties in Unified Multimodal Pretraining
Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open-source foundational model that nativel...
How much do language models memorize?
We propose a new method for estimating how much a model ``knows'' about a datapoint and use it to measure the capacity of modern language models. Prior studies of language model memorization have stru...
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to addr...
BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization
Byte Pair Encoding (BPE) tokenizers, widely used in Large Language Models, face challenges in multilingual settings, including penalization of non-Western scripts and the creation of tokens with parti...
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
Recent advances in reasoning-centric language models have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards. However, it remains contentious whe...
Sequential Monte Carlo Steering of Large Language Models using Probabilistic Programs
Even after fine-tuning and reinforcement learning, large language models (LLMs) can be difficult, if not impossible, to control reliably with prompts alone. We propose a new inference-time approach to...
AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning
Despite recent progress in large-scale reinforcement learning (RL) for reasoning, the training recipe for building high-performing reasoning models remains elusive. Key implementation details of front...
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and ...
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well ...
REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards
We introduce Reasoning Gym (RG), a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains in...
Learning to Model the World with Language
To interact with humans and act in the world, agents need to understand the range of language that people use and relate it to the visual world. While current agents can learn to execute simple langua...
May 2025
37 papersHardware-Efficient Attention for Fast Decoding
LLM decoding is bottlenecked for large batches and long contexts by loading the key-value (KV) cache from high-bandwidth memory, which inflates per-token latency, while the sequential nature of decodi...
FP8 Formats for Deep Learning
FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary inte...
Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
Reinforcement learning (RL) yields substantial improvements in large language models (LLMs) downstream task performance and alignment with human values. Surprisingly, such large gains result from upda...
Deep Reinforcement Learning, a textbook
Deep reinforcement learning has gathered much attention recently. Impressive results were achieved in activities as diverse as autonomous driving, game playing, molecular recombination, and robotics. ...
Large Language Diffusion Models
Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre...
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR wo...
Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Credit Assignment
This paper investigates approaches to enhance the reasoning capabilities of Large Language Model (LLM) agents using Reinforcement Learning (RL). Specifically, we focus on multi-turn tool-use scenarios...
Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of LMO-based Optimizers for LLMs)
Recent developments in deep learning optimization have brought about radically new algorithms based on the Linear Minimization Oracle (LMO) framework, such as $\sf Muon$ and $\sf Scion$. After over a ...
Visual Planning: Let's Think Only with Images
Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely...
The Platonic Representation Hypothesis
We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways...
Round and Round We Go! What makes Rotary Positional Encodings useful?
Positional Encodings (PEs) are a critical component of Transformer-based Large Language Models (LLMs), providing the attention mechanism with important sequence-position information. One of the most p...
HiRoPE: Length Extrapolation for Code Models Using Hierarchical Position
Addressing the limitation of context length in large language models for code-related tasks is the primary focus of this paper. Existing LLMs are constrained by their pre-trained context lengths, lead...
Efficient Memory Management for Large Language Model Serving with PagedAttention
High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for eac...
Transformers Represent Belief State Geometry in their Residual Stream
Produced while being an affiliate at PIBBSS[1]. The work was done initially with funding from a Lightspeed Grant, and then continued while at PIBBSS.…
Consequences of the Moosbauer-Poole Algorithms
Moosbauer and Poole have recently shown that the multiplication of two $5\times 5$ matrices requires no more than 93 multiplications in the (possibly non-commutative) coefficient ring, and that the mu...
Illuminating search spaces by mapping elites
Many fields use search algorithms, which automatically explore a search space to find high-performing solutions: chemists search through the space of molecules to discover new drugs; engineers search ...
Iteratively reweighted kernel machines efficiently learn sparse functions
The impressive practical performance of neural networks is often attributed to their ability to learn low-dimensional data representations and hierarchical structure directly from data. In this work, ...
Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features
In recent years neural networks have achieved impressive results on many technological and scientific tasks. Yet, the mechanism through which these models automatically select features, or patterns in...
Byte Latent Transformer: Patches Scale Better Than Tokens
We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inferen...
Voyager: An Open-Ended Embodied Agent with Large Language Models
We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human inter...
Denoising Diffusion Probabilistic Models
We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results ...
A Path Towards Autonomous Machine Intelligence Version 0.9.2, 2022-06-27
How could machines learn as efficiently as humans and animals? How could machines learn to reason and plan? How could machines learn representations of percepts and action plans at multiple levels of ab...
Learning high-level visual representations from a child's perspective without strong inductive biases
Young children develop sophisticated internal models of the world based on their visual experience. Can such models be learned from a child's visual experience without strong inductive biases? To inve...
Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers
In this paper, I introduce the retrieval problem, a simple yet common reasoning task that can be solved only by transformers with a minimum number of layers, which grows logarithmically with the input...
Scaling Laws for Precision
Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise "precision-aware" scaling laws for b...
FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation
This work presents a Fully BInarized Large Language Model (FBI-LLM), demonstrating for the first time how to train a large-scale binary language model from scratch (not the partial binary or ternary L...
Learning to Reason for Long-Form Story Generation
Generating high-quality stories spanning thousands of tokens requires competency across a variety of skills, from tracking plot and character arcs to keeping a consistent and engaging style. Due to th...
Diffusion Models are Evolutionary Algorithms
In a convergence of machine learning and biology, we reveal that diffusion models are evolutionary algorithms. By considering evolution as a denoising process and reversed evolution as diffusion, we m...
Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages
Human bilinguals often use similar brain regions to process multiple languages, depending on when they learned their second language and their proficiency. In large language models (LLMs), how are mul...
Similarity of Neural Network Representations Revisited
Recent work has sought to understand the behavior of neural networks by comparing representations between layers and between different trained models. We examine methods for comparing neural network r...
Layers at Similar Depths Generate Similar Activations Across LLM Architectures
How do the latent spaces used by independently-trained LLMs relate to one another? We study the nearest neighbor relationships induced by activations at different layers of 24 open-weight LLMs, and fi...
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models
As enthusiasm for scaling computation (data and parameters) in the pretraining era gradually diminished, test-time scaling (TTS)—also referred to as “test-time computing”—has emerged as a prominent re...
TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems
This paper focuses on multimodal alignment within the realm of Artificial Intelligence, particularly in text and image modalities. The semantic gap between the textual and visual modality poses a disc...
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Archi...
History, Development, and Principles of Large Language Models-An Introductory Survey
Language models serve as a cornerstone in natural language processing (NLP), utilizing mathematical methods to generalize language laws and knowledge for prediction and generation. Over extensive rese...
Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability
Mathematical reasoning tasks pose significant challenges for large language models (LLMs) because they require precise logical deduction and sequence analysis. In this work, we introduce the concept o...
April 2025
12 papersA mathematical theory of semantic development in deep neural networks
An extensive body of empirical research has revealed remarkable regularities in the acquisition, organization, deployment, and neural representation of human semantic knowledge, thereby raising a fund...
Progress measures for grokking via mechanistic interpretability
Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding em...
Towards Automated Circuit Discovery for Mechanistic Interpretability
Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process the...
Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes
I argue that data becomes temporarily interesting by itself to some self-improving, but computationally limited, subjective observer once he learns to predict or compress the data in a better way, thu...
Circuit Tracing: Revealing Computational Graphs in Language Models
We describe an approach to tracing the “step-by-step” computation involved when a model responds to a single prompt.
On the Biology of a Large Language Model
We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts, using our circuit tracing methodology.
Generalization through variance: how noise shapes inductive biases in diffusion models
How diffusion models generalize beyond their training set is not known, and is somewhat mysterious given two facts: the optimum of the denoising score matching (DSM) objective usually used to train di...
GENERALIZATION THROUGH VARIANCE: HOW NOISE SHAPES INDUCTIVE BIASES IN DIFFUSION MODELS
How diffusion models generalize beyond their training set is not known, and is somewhat mysterious given two facts: the optimum of the denoising score matching (DSM) objective usually used to train di...
Training Large Language Models to Reason in a Continuous Latent Space
Large language models (LLMs) are restricted to reason in the "language space", where they typically express the reasoning process with a chain-of-thought (CoT) to solve a complex reasoning problem. Ho...
Do Llamas Work in English? On the Latent Language of Multilingual Transformers
We ask whether multilingual language models trained on unbalanced, English-dominated corpora use English as an internal pivot language -- a question of key importance for understanding how language mo...
February 2025
2 papersOn the Emergence of Thinking in LLMs I: Searching for the Right Intuition
Recent advancements in AI, such as OpenAI’s new o models, Google’s Gemini Thinking model, and Deepseek R1, are transforming LLMs into LRMs (Large Reasoning Models). Unlike LLMs, LRMs perform thinking ...
On the Emergence of Thinking in LLMs I: Searching for the Right Intuition
Recent AI advancements, such as OpenAI's new models, are transforming LLMs into LRMs (Large Reasoning Models) that perform reasoning during inference, taking extra time and compute for higher-quality ...
2025
114 papersDiffusion Beats Autoregressive in Data-Constrained Settings
Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promis...
Subliminal Learning: Language models transmit behavioral traits via hidden signals in data
We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a "teacher" model with some trait T (su...
LLM4Decompile: Decompiling Binary Code with Large Language Models
Decompilation aims to convert binary code to high-level source code, but traditional tools like Ghidra often produce results that are difficult to read and execute. Motivated by the advancements in La...
Large Language Models and Emergence: A Complex Systems Perspective
Emergence is a concept in complexity science that describes how many-body systems manifest novel higher-level properties, properties that can be described by replacing high-dimensional mechanisms with...
Transformers are Efficient Compilers, Provably
Transformer-based large language models (LLMs) have demonstrated surprisingly robust performance across a wide range of language-related tasks, including programming language understanding and generat...
Why Philosophers Should Care about Computational Complexity
One might think that, once we know something is computable, how efficiently it can be computed is a practical question with little further philosophical importance. In this essay, I offer a detailed case...
Fast and Simplex: 2-Simplicial Attention in Triton
Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count toge...
Simple linear attention language models balance the recall-throughput tradeoff
Recent work has shown that attention-based language models excel at recall, the ability to ground generations in tokens previously seen in context. However, the efficiency of attention-based models is...
Dimension Mixer: A Generalized Method for Structured Sparsity in Deep Neural Networks
The recent success of multiple neural architectures like CNNs, Transformers, and MLP-Mixers motivated us to look for similarities and differences between them. We found that these architectures can be...
Intelligence at the Edge of Chaos
We explore the emergence of intelligent behavior in artificial systems by investigating how the complexity of rule-based systems influences the capabilities of models trained to predict these rules. O...
Magistral
We introduce Magistral, Mistral's first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior mod...
Reinforcement Pre-Training
In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a rea...
Trends in AI Supercomputers
Frontier AI development relies on powerful AI supercomputers, yet analysis of these systems is limited. We create a dataset of 500 AI supercomputers from 2019 to 2025 and analyze key trends in perform...
eGPU: Extending eBPF Programmability and Observability to GPUs
Precise GPU observability and programmability are essential for optimizing performance in AI workloads and other computationally intensive high-performance computing (HPC) applications. In this paper,...
Attention-Level Speculation
As Large Language Models (LLMs) grow in size and context length, efficient inference strategies are essential to maintain low-latency token generation. Unfortunately, conventional tensor and data para...
Qwen3 Technical Report
In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capa...
LazyLog: A New Shared Log Abstraction for Low-Latency Applications
Shared logs offer linearizable total order across storage shards. However, they enforce this order eagerly upon ingestion, leading to high latencies. We observe that in many modern shared-log applicat...
How Much Knowledge Can You Pack Into the Parameters of a Language Model?
It has recently been observed that neural language models trained on unstructured text can implicitly store and retrieve knowledge using natural language queries. In this short paper, we measure the p...
The empirical status of predictive coding and active inference
Research on predictive processing models has focused largely on two specific algorithmic theories: Predictive Coding for perception and Active Inference for decision-making. While these interconnected...
Predictive eye movements are adjusted in a Bayes-optimal fashion in response to unexpectedly changing environmental probabilities
This study examined the application of active inference to dynamic visuomotor control. Active inference proposes that actions are dynamically planned according to uncertainty about sensory information...
Continuous Thought Machines
Biological brains demonstrate complex neural activity, where the timing and interplay between neurons is critical to how brains process information. Most deep learning architectures simplify neural ac...
A tutorial on the free-energy framework for modelling perception and learning
This paper provides an easy to follow tutorial on the free-energy framework for modelling perception developed by Friston, which extends the predictive coding model of Rao and Ballard. These models as...
Canonical Microcircuits for Predictive Coding
This Perspective considers the influential notion of a canonical (cortical) microcircuit in light of recent theories about neuronal processing. Specifically, we conciliate quantitative studies of micr...
Language Models use Lookbacks to Track Beliefs
How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilitie...
The Diffusion Duality
Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and ma...
Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Shojaee et al. (2025) report that Large Reasoning Models (LRMs) exhibit "accuracy collapse" on planning puzzles beyond certain complexity thresholds. We demonstrate that their findings primarily refle...
Self-Adapting Language Models
Large language models (LLMs) are powerful but static; they lack mechanisms to adapt their weights in response to new tasks, knowledge, or examples. We introduce Self-Adapting LLMs (SEAL), a framework ...
Chain-of-Thought Reasoning is a Policy Improvement Operator
Large language models have astounded the world with fascinating new capabilities. However, they currently lack the ability to teach themselves new skills, relying instead on large amounts of human-gen...
General agents need world models
Are world models a necessary ingredient for flexible, goal-directed behaviour, or is model-free learning sufficient? We provide a formal answer to this question, showing that any agent capable of gene...
Reasoning with Language Model is Planning with World Model
Large language models (LLMs) have shown remarkable reasoning capabilities, especially when prompted to generate intermediate reasoning steps (e.g., Chain-of-Thought, CoT). However, LLMs can still stru...
Emergent Abilities of Large Language Models
Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we...
Observer Theory and the Ruliad: An Extension to the Wolfram Model
This paper presents an extension of Observer Theory within the context of the Ruliad, using a mathematically rigorous formalization with category theory as the unifying framework. This paper demonstra...
eGPU: Extending eBPF Programmability and Observability to GPUs
Minimum Description Length and Generalization Guarantees for Representation Learning
A major challenge in designing efficient statistical supervised learning algorithms is finding representations that perform well not only on available training samples but also on unseen data. While t...
Compute-Optimal LLMs Provably Generalize Better With Scale
Why do larger language models generalize better? To investigate this question, we develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regi...
Large Language Model Compression with Global Rank and Sparsity Optimization
Low-rank and sparse composite approximation is a natural idea to compress Large Language Models (LLMs). However, such an idea faces two primary challenges that adversely affect the performance of exis...
A Survey to Recent Progress Towards Understanding In-Context Learning
In-Context Learning (ICL) empowers Large Language Models (LLMs) with the ability to learn from a few examples provided in the prompt, enabling downstream generalization without the requirement for gra...
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
Today's AI systems have human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The advance of AI could itself be automated. If done safely, that would acceler...
The Complexity Dynamics of Grokking
We investigate the phenomenon of generalization through the lens of compression. In particular, we study the complexity dynamics of neural networks to explain grokking, where networks suddenly transit...
Denoising Diffusion Probabilistic Models
We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results ...
Deep Unsupervised Learning using Nonequilibrium Thermodynamics
A central problem in machine learning involves modeling complex data-sets using highly flexible families of probability distributions in which learning, sampling, inference, and evaluation are still a...
Mechanistic Design and Scaling of Hybrid Architectures
The development of deep learning architectures is a resource-demanding process, due to a vast design space, long prototyping times, and high compute costs associated with at-scale model training and e...
Mathematical discoveries from program search with large language models
Large language models (LLMs) have demonstrated tremendous capabilities in solving complex tasks, from quantitative reasoning to understanding natural language. However, LLMs sometimes suffer from conf...
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
The last decade has witnessed an experimental revolution in data science and machine learning, epitomised by deep learning methods. Indeed, many high-dimensional learning tasks previously thought to b...
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well ...
Voyager: An Open-Ended Embodied Agent with Large Language Models
We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human inter...
Trade-offs in Data Memorization via Strong Data Processing Inequalities
Recent research demonstrated that training large language models involves memorization of a significant fraction of training data. Such memorization can lead to privacy violations when training on sen...
What Formal Languages Can Transformers Express? A Survey
As transformers have gained prominence in natural language processing, some researchers have investigated theoretically what problems they can and cannot solve, by treating problems as formal language...
The Illusion of State in State-Space Models
State-space models (SSMs) have emerged as a potential alternative to transformers. One theoretical weakness of transformers is that they cannot express certain kinds of sequential computation and stat...
Emerging Properties in Unified Multimodal Pretraining
Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open-source foundational model that nativel...
How much do language models memorize?
We propose a new method for estimating how much a model ``knows'' about a datapoint and use it to measure the capacity of modern language models. Prior studies of language model memorization have stru...
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to addr...
BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization
Byte Pair Encoding (BPE) tokenizers, widely used in Large Language Models, face challenges in multilingual settings, including penalization of non-Western scripts and the creation of tokens with parti...
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
Recent advances in reasoning-centric language models have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards. However, it remains contentious whe...
Sequential Monte Carlo Steering of Large Language Models using Probabilistic Programs
Even after fine-tuning and reinforcement learning, large language models (LLMs) can be difficult, if not impossible, to control reliably with prompts alone. We propose a new inference-time approach to...
AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning
Despite recent progress in large-scale reinforcement learning (RL) for reasoning, the training recipe for building high-performing reasoning models remains elusive. Key implementation details of front...
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and ...
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well ...
REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards
We introduce Reasoning Gym (RG), a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains in...
Learning to Model the World with Language
To interact with humans and act in the world, agents need to understand the range of language that people use and relate it to the visual world. While current agents can learn to execute simple langua...
Hardware-Efficient Attention for Fast Decoding
LLM decoding is bottlenecked for large batches and long contexts by loading the key-value (KV) cache from high-bandwidth memory, which inflates per-token latency, while the sequential nature of decodi...
FP8 Formats for Deep Learning
FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary inte...
Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
Reinforcement learning (RL) yields substantial improvements in large language models (LLMs) downstream task performance and alignment with human values. Surprisingly, such large gains result from upda...
Deep Reinforcement Learning, a textbook
Deep reinforcement learning has gathered much attention recently. Impressive results were achieved in activities as diverse as autonomous driving, game playing, molecular recombination, and robotics. ...
Large Language Diffusion Models
Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre...
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR wo...
Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Credit Assignment
This paper investigates approaches to enhance the reasoning capabilities of Large Language Model (LLM) agents using Reinforcement Learning (RL). Specifically, we focus on multi-turn tool-use scenarios...
Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of LMO-based Optimizers for LLMs)
Recent developments in deep learning optimization have brought about radically new algorithms based on the Linear Minimization Oracle (LMO) framework, such as $\sf Muon$ and $\sf Scion$. After over a ...
Visual Planning: Let's Think Only with Images
Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely...
The Platonic Representation Hypothesis
We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways...
Round and Round We Go! What makes Rotary Positional Encodings useful?
Positional Encodings (PEs) are a critical component of Transformer-based Large Language Models (LLMs), providing the attention mechanism with important sequence-position information. One of the most p...
HiRoPE: Length Extrapolation for Code Models Using Hierarchical Position
Addressing the limitation of context length in large language models for code-related tasks is the primary focus of this paper. Existing LLMs are constrained by their pre-trained context lengths, lead...
Efficient Memory Management for Large Language Model Serving with PagedAttention
High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for eac...
Transformers Represent Belief State Geometry in their Residual Stream
Produced while being an affiliate at PIBBSS[1]. The work was done initially with funding from a Lightspeed Grant, and then continued while at PIBBSS.…
Consequences of the Moosbauer-Poole Algorithms
Moosbauer and Poole have recently shown that the multiplication of two $5\times 5$ matrices requires no more than 93 multiplications in the (possibly non-commutative) coefficient ring, and that the mu...
Illuminating search spaces by mapping elites
Many fields use search algorithms, which automatically explore a search space to find high-performing solutions: chemists search through the space of molecules to discover new drugs; engineers search ...
Iteratively reweighted kernel machines efficiently learn sparse functions
The impressive practical performance of neural networks is often attributed to their ability to learn low-dimensional data representations and hierarchical structure directly from data. In this work, ...
Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features
In recent years neural networks have achieved impressive results on many technological and scientific tasks. Yet, the mechanism through which these models automatically select features, or patterns in...
Byte Latent Transformer: Patches Scale Better Than Tokens
We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inferen...
Voyager: An Open-Ended Embodied Agent with Large Language Models
We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human inter...
Denoising Diffusion Probabilistic Models
We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results ...
A Path Towards Autonomous Machine Intelligence Version 0.9.2, 2022-06-27
How could machines learn as efficiently as humans and animals? How could machines learn to reason and plan? How could machines learn representations of percepts and action plans at multiple levels of ab...
Learning high-level visual representations from a child's perspective without strong inductive biases
Young children develop sophisticated internal models of the world based on their visual experience. Can such models be learned from a child's visual experience without strong inductive biases? To inve...
Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers
In this paper, I introduce the retrieval problem, a simple yet common reasoning task that can be solved only by transformers with a minimum number of layers, which grows logarithmically with the input...
Scaling Laws for Precision
Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise "precision-aware" scaling laws for b...
FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation
This work presents a Fully BInarized Large Language Model (FBI-LLM), demonstrating for the first time how to train a large-scale binary language model from scratch (not the partial binary or ternary L...
Learning to Reason for Long-Form Story Generation
Generating high-quality stories spanning thousands of tokens requires competency across a variety of skills, from tracking plot and character arcs to keeping a consistent and engaging style. Due to th...
Diffusion Models are Evolutionary Algorithms
In a convergence of machine learning and biology, we reveal that diffusion models are evolutionary algorithms. By considering evolution as a denoising process and reversed evolution as diffusion, we m...
Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages
Human bilinguals often use similar brain regions to process multiple languages, depending on when they learned their second language and their proficiency. In large language models (LLMs), how are mul...
Similarity of Neural Network Representations Revisited
Recent work has sought to understand the behavior of neural networks by comparing representations between layers and between different trained models. We examine methods for comparing neural network r...
Layers at Similar Depths Generate Similar Activations Across LLM Architectures
How do the latent spaces used by independently-trained LLMs relate to one another? We study the nearest neighbor relationships induced by activations at different layers of 24 open-weight LLMs, and fi...
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models
As enthusiasm for scaling computation (data and parameters) in the pretraining era gradually diminished, test-time scaling (TTS)—also referred to as “test-time computing”—has emerged as a prominent re...
TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems
This paper focuses on multimodal alignment within the realm of Artificial Intelligence, particularly in text and image modalities. The semantic gap between the textual and visual modality poses a disc...
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Archi...
History, Development, and Principles of Large Language Models-An Introductory Survey
Language models serve as a cornerstone in natural language processing (NLP), utilizing mathematical methods to generalize language laws and knowledge for prediction and generation. Over extensive rese...
Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability
Mathematical reasoning tasks pose significant challenges for large language models (LLMs) because they require precise logical deduction and sequence analysis. In this work, we introduce the concept o...
A mathematical theory of semantic development in deep neural networks
An extensive body of empirical research has revealed remarkable regularities in the acquisition, organization, deployment, and neural representation of human semantic knowledge, thereby raising a fund...
Progress measures for grokking via mechanistic interpretability
Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding em...
Towards Automated Circuit Discovery for Mechanistic Interpretability
Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process the...
Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes
I argue that data becomes temporarily interesting by itself to some self-improving, but computationally limited, subjective observer once he learns to predict or compress the data in a better way, thu...
Circuit Tracing: Revealing Computational Graphs in Language Models
We describe an approach to tracing the “step-by-step” computation involved when a model responds to a single prompt.
On the Biology of a Large Language Model
We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts, using our circuit tracing methodology.
Generalization through variance: how noise shapes inductive biases in diffusion models
How diffusion models generalize beyond their training set is not known, and is somewhat mysterious given two facts: the optimum of the denoising score matching (DSM) objective usually used to train di...
GENERALIZATION THROUGH VARIANCE: HOW NOISE SHAPES INDUCTIVE BIASES IN DIFFUSION MODELS
How diffusion models generalize beyond their training set is not known, and is somewhat mysterious given two facts: the optimum of the denoising score matching (DSM) objective usually used to train di...
Training Large Language Models to Reason in a Continuous Latent Space
Large language models (LLMs) are restricted to reason in the "language space", where they typically express the reasoning process with a chain-of-thought (CoT) to solve a complex reasoning problem. Ho...
Do Llamas Work in English? On the Latent Language of Multilingual Transformers
We ask whether multilingual language models trained on unbalanced, English-dominated corpora use English as an internal pivot language -- a question of key importance for understanding how language mo...
On the Emergence of Thinking in LLMs I: Searching for the Right Intuition
Recent advancements in AI, such as OpenAI’s new o models, Google’s Gemini Thinking model, and Deepseek R1, are transforming LLMs into LRMs (Large Reasoning Models). Unlike LLMs, LRMs perform thinking ...
On the Emergence of Thinking in LLMs I: Searching for the Right Intuition
Recent AI advancements, such as OpenAI's new models, are transforming LLMs into LRMs (Large Reasoning Models) that perform reasoning during inference, taking extra time and compute for higher-quality ...