Ludwig - Research Papers

Diffusion Beats Autoregressive in Data-Constrained Settings

Mihir Prabhudesai, Menging Wu, Amir Zadeh, et al. · (2025) · DOI: 10.48550/arXiv.2507.15857

Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promis...

Computer Science - Artificial Intelligence Computer Science - Machine Learning Computer Science - Robotics Computer Science - Computer Vision and Pattern Recognition

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

Alex Cloud, Minh Le, James Chua, et al. · (2025) · DOI: 10.48550/arXiv.2507.14805

We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a "teacher" model with some trait T (su...

Computer Science - Artificial Intelligence Computer Science - Machine Learning

LLM4Decompile: Decompiling Binary Code with Large Language Models

Hanzhuo Tan, Qi Luo, Jing Li, et al. · (2024) · DOI: 10.48550/arXiv.2403.05286

Decompilation aims to convert binary code to high-level source code, but traditional tools like Ghidra often produce results that are difficult to read and execute. Motivated by the advancements in La...

Computer Science - Computation and Language Computer Science - Programming Languages

Large Language Models and Emergence: A Complex Systems Perspective

David C. Krakauer, John W. Krakauer, Melanie Mitchell · (2025) · DOI: 10.48550/arXiv.2506.11135

Emergence is a concept in complexity science that describes how many-body systems manifest novel higher-level properties, properties that can be described by replacing high-dimensional mechanisms with...

Computer Science - Artificial Intelligence Computer Science - Machine Learning Computer Science - Computation and Language Computer Science - Neural and Evolutionary Computing

Transformers are Efficient Compilers, Provably

Xiyu Zhai, Runlong Zhou, Liao Zhang, et al. · (2025) · DOI: 10.48550/arXiv.2410.14706

Transformer-based large language models (LLMs) have demonstrated surprisingly robust performance across a wide range of language-related tasks, including programming language understanding and generat...

Computer Science - Machine Learning Computer Science - Programming Languages

Why Philosophers Should Care about Computational Complexity

Scott Aaronson · (2013)

One might think that, once we know something is computable, how eﬃciently it can be computed is a practical question with little further philosophical importance. In this essay, I oﬀer a detailed case...

Fast and Simplex: 2-Simplicial Attention in Triton

Aurko Roy, Timothy Chou, Sai Surya Duvvuri, et al. · (2025) · DOI: 10.48550/arXiv.2507.02754

Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count toge...

Computer Science - Artificial Intelligence Computer Science - Machine Learning

Simple linear attention language models balance the recall-throughput tradeoff

Simran Arora, Sabri Eyuboglu, Michael Zhang, et al. · (2025) · DOI: 10.48550/arXiv.2402.18668

Recent work has shown that attention-based language models excel at recall, the ability to ground generations in tokens previously seen in context. However, the efficiency of attention-based models is...

Computer Science - Machine Learning Computer Science - Computation and Language

Dimension Mixer: A Generalized Method for Structured Sparsity in Deep Neural Networks

Suman Sapkota, Binod Bhattarai · · (2023) · DOI: 10.48550/arXiv.2311.18735

The recent success of multiple neural architectures like CNNs, Transformers, and MLP-Mixers motivated us to look for similarities and differences between them. We found that these architectures can be...

Computer Science - Machine Learning

Intelligence at the Edge of Chaos

Shiyang Zhang, Aakash Patel, Syed A. Rizvi, et al. · · (2025) · DOI: 10.48550/arXiv.2410.02536

We explore the emergence of intelligent behavior in artificial systems by investigating how the complexity of rule-based systems influences the capabilities of models trained to predict these rules. O...

Computer Science - Artificial Intelligence Computer Science - Neural and Evolutionary Computing

Magistral

Mistral-AI, Abhinav Rastogi, Albert Q. Jiang, et al. · · (2025) · DOI: 10.48550/arXiv.2506.10910

We introduce Magistral, Mistral's first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior mod...

Computer Science - Computation and Language

Reinforcement Pre-Training

Qingxiu Dong, Li Dong, Yao Tang, et al. · · (2025) · DOI: 10.48550/arXiv.2506.08007

In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a rea...

Computer Science - Computation and Language

Trends in AI Supercomputers

Konstantin F. Pilz, James Sanders, Robi Rahman, et al. · · (2025) · DOI: 10.48550/arXiv.2504.16026

Frontier AI development relies on powerful AI supercomputers, yet analysis of these systems is limited. We create a dataset of 500 AI supercomputers from 2019 to 2025 and analyze key trends in perform...

Computer Science - Artificial Intelligence Computer Science - Computers and Society

eGPU: Extending eBPF Programmability and Observability to GPUs

Yiwei Yang, Tong Yu, Yusheng Zheng, et al. · · DOI: 10.1145/3723851.3726984

Precise GPU observability and programmability are essential for optimizing performance in AI workloads and other computationally intensive high-performance computing (HPC) applications. In this paper,...

Attention-Level Speculation

Jack Cai, Ammar Vora, Randolph Zhang, et al. · · (2025) · DOI:

As Large Language Models (LLMs) grow in size and context length, efficient inference strategies are essential to maintain low-latency token generation. Unfortunately, conventional tensor and data para...

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, et al. · · (2025) · DOI: 10.48550/arXiv.2505.09388

In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capa...

Computer Science - Computation and Language

LazyLog: A New Shared Log Abstraction for Low-Latency Applications

Xuhao Luo, Shreesha G. Bhat, Jiyu Hu, et al. · · DOI: 10.1145/3694715.3695983

Shared logs offer linearizable total order across storage shards. However, they enforce this order eagerly upon ingestion, leading to high latencies. We observe that in many modern shared-log applicat...

How Much Knowledge Can You Pack Into the Parameters of a Language Model?

Adam Roberts, Colin Raffel, Noam Shazeer · · (2020) · DOI: 10.48550/arXiv.2002.08910

It has recently been observed that neural language models trained on unstructured text can implicitly store and retrieve knowledge using natural language queries. In this short paper, we measure the p...

Computer Science - Machine Learning Computer Science - Computation and Language Statistics - Machine Learning

The empirical status of predictive coding and active inference

Rowan Hodson, Marishka Mehta, Ryan Smith · Neuroscience & Biobehavioral Reviews · (2024) · DOI: 10.1016/j.neubiorev.2023.105473

Research on predictive processing models has focused largely on two specific algorithmic theories: Predictive Coding for perception and Active Inference for decision-making. While these interconnected...

Active Inference Bayesian Brain Computational Modeling Predictive Coding Predictive Processing

Predictive eye movements are adjusted in a Bayes-optimal fashion in response to unexpectedly changing environmental probabilities

Tom Arthur, David J. Harris · Cortex · (2021) · DOI: 10.1016/j.cortex.2021.09.017

This study examined the application of active inference to dynamic visuomotor control. Active inference proposes that actions are dynamically planned according to uncertainty about sensory information...

Active Inference Predictive Coding Bayesian Virtual reality Visuomotor

Continuous Thought Machines

Luke Darlow, Ciaran Regan, Sebastian Risi, et al. · · (2025) · DOI: 10.48550/arXiv.2505.05522

Biological brains demonstrate complex neural activity, where the timing and interplay between neurons is critical to how brains process information. Most deep learning architectures simplify neural ac...

Computer Science - Artificial Intelligence Computer Science - Machine Learning

A tutorial on the free-energy framework for modelling perception and learning

Rafal Bogacz · Journal of Mathematical Psychology · (2017) · DOI: 10.1016/j.jmp.2015.11.003

This paper provides an easy to follow tutorial on the free-energy framework for modelling perception developed by Friston, which extends the predictive coding model of Rao and Ballard. These models as...

Canonical Microcircuits for Predictive Coding

Andre M. Bastos, W. Martin Usrey, Rick A. Adams, et al. · Neuron · (2012) · DOI: 10.1016/j.neuron.2012.10.038

This Perspective considers the influential notion of a canonical (cortical) microcircuit in light of recent theories about neuronal processing. Specifically, we conciliate quantitative studies of micr...

Language Models use Lookbacks to Track Beliefs

Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, et al. · · (2025) · DOI: 10.48550/arXiv.2505.14685

How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilitie...

Computer Science - Computation and Language

The Diffusion Duality

Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, et al. · · (2025) · DOI: 10.48550/arXiv.2506.10892

Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and ma...

Computer Science - Artificial Intelligence Computer Science - Machine Learning Computer Science - Computation and Language

Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

C. Opus, A. Lawsen · · (2025) · DOI: 10.48550/arXiv.2506.09250

Shojaee et al. (2025) report that Large Reasoning Models (LRMs) exhibit "accuracy collapse" on planning puzzles beyond certain complexity thresholds. We demonstrate that their findings primarily refle...

Computer Science - Artificial Intelligence Computer Science - Machine Learning

Self-Adapting Language Models

Adam Zweiger, Jyothish Pari, Han Guo, et al. · · (2025) · DOI:

Large language models (LLMs) are powerful but static; they lack mechanisms to adapt their weights in response to new tasks, knowledge, or examples. We introduce Self-Adapting LLMs (SEAL), a framework ...

Chain-of-Thought Reasoning is a Policy Improvement Operator

Hugh Zhang, David C. Parkes · · (2023) · DOI: 10.48550/arXiv.2309.08589

Large language models have astounded the world with fascinating new capabilities. However, they currently lack the ability to teach themselves new skills, relying instead on large amounts of human-gen...

Computer Science - Artificial Intelligence Computer Science - Machine Learning Computer Science - Computation and Language

General agents need world models

Jonathan Richens, David Abel, Alexis Bellot, et al. · · (2025) · DOI: 10.48550/arXiv.2506.01622

Are world models a necessary ingredient for flexible, goal-directed behaviour, or is model-free learning sufficient? We provide a formal answer to this question, showing that any agent capable of gene...

Computer Science - Artificial Intelligence Computer Science - Machine Learning Statistics - Machine Learning Computer Science - Robotics

Reasoning with Language Model is Planning with World Model

Shibo Hao, Yi Gu, Haodi Ma, et al. · · (2023) · DOI: 10.48550/arXiv.2305.14992

Large language models (LLMs) have shown remarkable reasoning capabilities, especially when prompted to generate intermediate reasoning steps (e.g., Chain-of-Thought, CoT). However, LLMs can still stru...

Computer Science - Artificial Intelligence Computer Science - Machine Learning Computer Science - Computation and Language

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, et al. · · (2022) · DOI:

Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we...

[2506.08007] Reinforcement Pre-Training

· · DOI:

Observer Theory and the Ruliad: An Extension to the Wolfram Model

Sam Senchal · Observer Theory · (2025) · DOI:

This paper presents an extension of Observer Theory within the context of the Ruliad, using a mathematically rigorous formalization with category theory as the unifying framework. This paper demonstra...

eGPU: Extending eBPF Programmability and Observability to GPUs

Yiwei Yang, Tong Yu, Yusheng Zheng, et al. · · (2025) · DOI: 10.1145/3723851.3726984

Minimum Description Length and Generalization Guarantees for Representation Learning

Milad Sefidgaran, Abdellatif Zaidi, Piotr Krasnowski · · DOI:

A major challenge in designing efficient statistical supervised learning algorithms is finding representations that perform well not only on available training samples but also on unseen data. While t...

Compute-Optimal LLMs Provably Generalize Better With Scale

Marc Finzi, Sanyam Kapoor, Diego Granziol, et al. · · (2025) · DOI: 10.48550/arXiv.2504.15208

Why do larger language models generalize better? To investigate this question, we develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regi...

Computer Science - Artificial Intelligence Computer Science - Machine Learning

Large Language Model Compression with Global Rank and Sparsity Optimization

Changhai Zhou, Qian Qiao, Weizhong Zhang, et al. · · (2025) · DOI: 10.48550/arXiv.2505.03801

Low-rank and sparse composite approximation is a natural idea to compress Large Language Models (LLMs). However, such an idea faces two primary challenges that adversely affect the performance of exis...

Computer Science - Artificial Intelligence Computer Science - Machine Learning

A Survey to Recent Progress Towards Understanding In-Context Learning

Haitao Mao, Guangliang Liu, Yao Ma, et al. · · (2025) · DOI:

In-Context Learning (ICL) empowers Large Language Models (LLMs) with the ability to learn from a few examples provided in the prompt, enabling downstream generalization without the requirement for gra...

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

Jenny Zhang, Shengran Hu, Cong Lu, et al. · · (2025) · DOI: 10.48550/arXiv.2505.22954

Today's AI systems have human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The advance of AI could itself be automated. If done safely, that would acceler...

Computer Science - Artificial Intelligence

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân Vu, Marvin Eisenberger, et al. · · DOI:

The Complexity Dynamics of Grokking

Branton DeMoss, Silvia Sapora, Jakob Foerster, et al. · · (2024) · DOI: 10.48550/arXiv.2412.09810

We investigate the phenomenon of generalization through the lens of compression. In particular, we study the complexity dynamics of neural networks to explain grokking, where networks suddenly transit...

Computer Science - Machine Learning

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, Pieter Abbeel · · (2020) · DOI: 10.48550/arXiv.2006.11239

We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results ...

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, et al. · · (2015) · DOI: 10.48550/arXiv.1503.03585

A central problem in machine learning involves modeling complex data-sets using highly flexible families of probability distributions in which learning, sampling, inference, and evaluation are still a...

Computer Science - Machine Learning Statistics - Machine Learning Condensed Matter - Disordered Systems and Neural Networks Quantitative Biology - Neurons and Cognition

Mechanistic Design and Scaling of Hybrid Architectures

Michael Poli, Armin W. Thomas, Eric Nguyen, et al. · · (2024) · DOI: 10.48550/arXiv.2403.17844

The development of deep learning architectures is a resource-demanding process, due to a vast design space, long prototyping times, and high compute costs associated with at-scale model training and e...

Computer Science - Machine Learning

Mathematical discoveries from program search with large language models

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, et al. · Nature · (2024) · DOI: 10.1038/s41586-023-06924-6

Large language models (LLMs) have demonstrated tremendous capabilities in solving complex tasks, from quantitative reasoning to understanding natural language. However, LLMs sometimes suffer from conf...

Computer science Pure mathematics

Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

Michael M. Bronstein, Joan Bruna, Taco Cohen, et al. · · (2021) · DOI: 10.48550/arXiv.2104.13478

The last decade has witnessed an experimental revolution in data science and machine learning, epitomised by deep learning methods. Indeed, many high-dimensional learning tasks previously thought to b...

Computer Science - Artificial Intelligence Computer Science - Machine Learning Statistics - Machine Learning Computer Science - Computational Geometry Computer Science - Computer Vision and Pattern Recognition

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, Le Yu, Chang Gao, et al. · · (2025) · DOI: 10.48550/arXiv.2506.01939

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well ...

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, et al. · · (2023) · DOI: 10.48550/arXiv.2305.16291

We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human inter...

Trade-offs in Data Memorization via Strong Data Processing Inequalities

Vitaly Feldman, Guy Kornowski, Xin Lyu · · (2025) · DOI: 10.48550/arXiv.2506.01855

Recent research demonstrated that training large language models involves memorization of a significant fraction of training data. Such memorization can lead to privacy violations when training on sen...

Computer Science - Machine Learning Statistics - Machine Learning Computer Science - Information Theory Mathematics - Information Theory

What Formal Languages Can Transformers Express? A Survey

Lena Strobl, William Merrill, Gail Weiss, et al. · Transactions of the Association for Computational Linguistics · (2024) · DOI: 10.1162/tacl_a_00663

As transformers have gained prominence in natural language processing, some researchers have investigated theoretically what problems they can and cannot solve, by treating problems as formal language...

The Illusion of State in State-Space Models

William Merrill, Jackson Petty, Ashish Sabharwal · · DOI:

State-space models (SSMs) have emerged as a potential alternative to transformers. One theoretical weakness of transformers is that they cannot express certain kinds of sequential computation and stat...

Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models

Jacob Pfau, William Merrill, Samuel R Bowman · · (2024) · DOI:

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, et al. · · (2025) · DOI: 10.48550/arXiv.2505.14683

Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open-source foundational model that nativel...

Computer Science - Computer Vision and Pattern Recognition

How much do language models memorize?

John X. Morris, Chawin Sitawarin, Chuan Guo, et al. · · (2025) · DOI: 10.48550/arXiv.2505.24832

We propose a new method for estimating how much a model ``knows'' about a datapoint and use it to measure the capacity of modern language models. Prior studies of language model memorization have stru...

Computer Science - Computation and Language

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, et al. · · (2022) · DOI: 10.48550/arXiv.2205.14135

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to addr...

Computer Science - Machine Learning

BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization

Sander Land, Catherine Arnett · · (2025) · DOI: 10.48550/arXiv.2505.24689

Byte Pair Encoding (BPE) tokenizers, widely used in Large Language Models, face challenges in multilingual settings, including penalization of non-Western scripts and the creation of tokens with parti...

Computer Science - Computation and Language

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Mingjie Liu, Shizhe Diao, Ximing Lu, et al. · · (2025) · DOI: 10.48550/arXiv.2505.24864

Recent advances in reasoning-centric language models have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards. However, it remains contentious whe...

Computer Science - Artificial Intelligence Computer Science - Computation and Language

Sequential Monte Carlo Steering of Large Language Models using Probabilistic Programs

Alexander K. Lew, Tan Zhi-Xuan, Gabriel Grand, et al. · · (2023) · DOI: 10.48550/arXiv.2306.03081

Even after fine-tuning and reinforcement learning, large language models (LLMs) can be difficult, if not impossible, to control reliably with prompts alone. We propose a new inference-time approach to...

Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Programming Languages Statistics - Computation

AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning

Yang Chen, Zhuolin Yang, Zihan Liu, et al. · · (2025) · DOI: 10.48550/arXiv.2505.16400

Despite recent progress in large-scale reinforcement learning (RL) for reasoning, the training recipe for building high-performing reasoning models remains elusive. Key implementation details of front...

Computer Science - Artificial Intelligence Computer Science - Machine Learning Computer Science - Computation and Language

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, et al. · · (2025) · DOI: 10.48550/arXiv.2504.13837

Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and ...

Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, Le Yu, Chang Gao, et al. · · (2025) · DOI: 10.48550/arXiv.2506.01939

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well ...

Computer Science - Artificial Intelligence Computer Science - Machine Learning Computer Science - Computation and Language

REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

Zafir Stojanovski, Oliver Stanley, Joe Sharratt, et al. · · (2025) · DOI: 10.48550/arXiv.2505.24760

We introduce Reasoning Gym (RG), a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains in...

Computer Science - Artificial Intelligence Computer Science - Machine Learning Computer Science - Computation and Language

Learning to Model the World with Language

Jessy Lin, Yuqing Du, Olivia Watkins, et al. · · (2024) · DOI: 10.48550/arXiv.2308.01399

To interact with humans and act in the world, agents need to understand the range of language that people use and relate it to the visual world. While current agents can learn to execute simple langua...

Computer Science - Artificial Intelligence Computer Science - Machine Learning Computer Science - Computation and Language

Hardware-Efficient Attention for Fast Decoding

Ted Zadouri, Hubert Strauss, Tri Dao · · (2025) · DOI: 10.48550/arXiv.2505.21487

LLM decoding is bottlenecked for large batches and long contexts by loading the key-value (KV) cache from high-bandwidth memory, which inflates per-token latency, while the sequential nature of decodi...

Computer Science - Machine Learning Computer Science - Computation and Language

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, et al. · · (2022) · DOI: 10.48550/arXiv.2209.05433

FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary inte...

Computer Science - Machine Learning

Reinforcement Learning Finetunes Small Subnetworks in Large Language Models

Sagnik Mukherjee, Lifan Yuan, Dilek Hakkani-Tur, et al. · · (2025) · DOI: 10.48550/arXiv.2505.11711

Reinforcement learning (RL) yields substantial improvements in large language models (LLMs) downstream task performance and alignment with human values. Surprisingly, such large gains result from upda...

Computer Science - Machine Learning

Deep Reinforcement Learning, a textbook

Aske Plaat · · (2022) · DOI:

Deep reinforcement learning has gathered much attention recently. Impressive results were achieved in activities as diverse as autonomous driving, game playing, molecular recombination, and robotics. ...

Computer Science - Artificial Intelligence Computer Science - Machine Learning

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, et al. · · (2025) · DOI: 10.48550/arXiv.2502.09992

Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre...

Computer Science - Machine Learning Computer Science - Computation and Language

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Andrew Zhao, Yiran Wu, Yang Yue, et al. · · (2025) · DOI: 10.48550/arXiv.2505.03335

Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR wo...

Computer Science - Artificial Intelligence Computer Science - Machine Learning Computer Science - Computation and Language

Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Credit Assignment

Siliang Zeng, Quan Wei, William Brown, et al. · · (2025) · DOI: 10.48550/arXiv.2505.11821

This paper investigates approaches to enhance the reasoning capabilities of Large Language Model (LLM) agents using Reinforcement Learning (RL). Specifically, we focus on multi-turn tool-use scenarios...

Computer Science - Machine Learning

Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of LMO-based Optimizers for LLMs)

Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, et al. · · (2025) · DOI: 10.48550/arXiv.2505.13416

Recent developments in deep learning optimization have brought about radically new algorithms based on the Linear Minimization Oracle (LMO) framework, such as $\sf Muon$ and $\sf Scion$. After over a ...

Computer Science - Machine Learning Statistics - Machine Learning Mathematics - Optimization and Control

Visual Planning: Let's Think Only with Images

Yi Xu, Chengzu Li, Han Zhou, et al. · · (2025) · DOI: 10.48550/arXiv.2505.11409

Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely...

Computer Science - Artificial Intelligence Computer Science - Machine Learning Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition

The Platonic Representation Hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, et al. · · (2024) · DOI: 10.48550/arXiv.2405.07987

We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways...

Computer Science - Artificial Intelligence Computer Science - Machine Learning Computer Science - Neural and Evolutionary Computing Computer Science - Computer Vision and Pattern Recognition

Round and Round We Go! What makes Rotary Positional Encodings useful?

Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, et al. · · (2025) · DOI: 10.48550/arXiv.2410.06205

Positional Encodings (PEs) are a critical component of Transformer-based Large Language Models (LLMs), providing the attention mechanism with important sequence-position information. One of the most p...

Computer Science - Machine Learning Computer Science - Computation and Language

HiRoPE: Length Extrapolation for Code Models Using Hierarchical Position

Kechi Zhang, Ge Li, Huangzhao Zhang, et al. · · (2024) · DOI: 10.48550/arXiv.2403.19115

Addressing the limitation of context length in large language models for code-related tasks is the primary focus of this paper. Existing LLMs are constrained by their pre-trained context lengths, lead...

Computer Science - Software Engineering

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, et al. · · (2023) · DOI: 10.48550/arXiv.2309.06180

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for eac...

Computer Science - Machine Learning Computer Science - Distributed, Parallel, and Cluster Computing

Transformers Represent Belief State Geometry in their Residual Stream

Adam Shai · · (2024) · DOI:

Produced while being an affiliate at PIBBSS[1]. The work was done initially with funding from a Lightspeed Grant, and then continued while at PIBBSS.…

Consequences of the Moosbauer-Poole Algorithms

Manuel Kauers, Isaac Wood · · (2025) · DOI: 10.48550/arXiv.2505.05896

Moosbauer and Poole have recently shown that the multiplication of two $5\times 5$ matrices requires no more than 93 multiplications in the (possibly non-commutative) coefficient ring, and that the mu...

Computer Science - Symbolic Computation

Emergence of Language in the Developing Brain

Linnea Evanson, Christine Bulteau, Mathilde Chipaux, et al. · · DOI:

Illuminating search spaces by mapping elites

Jean-Baptiste Mouret, Jeff Clune · · (2015) · DOI: 10.48550/arXiv.1504.04909

Many fields use search algorithms, which automatically explore a search space to find high-performing solutions: chemists search through the space of molecules to discover new drugs; engineers search ...

Computer Science - Artificial Intelligence Computer Science - Neural and Evolutionary Computing Computer Science - Robotics Quantitative Biology - Populations and Evolution

Iteratively reweighted kernel machines efficiently learn sparse functions

Libin Zhu, Damek Davis, Dmitriy Drusvyatskiy, et al. · · (2025) · DOI: 10.48550/arXiv.2505.08277

The impressive practical performance of neural networks is often attributed to their ability to learn low-dimensional data representations and hierarchical structure directly from data. In this work, ...

Computer Science - Machine Learning Statistics - Machine Learning Mathematics - Optimization and Control Mathematics - Statistics Theory Statistics - Statistics Theory

Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features

Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, et al. · · (2023) · DOI: 10.48550/arXiv.2212.13881

In recent years neural networks have achieved impressive results on many technological and scientific tasks. Yet, the mechanism through which these models automatically select features, or patterns in...

Computer Science - Artificial Intelligence Computer Science - Machine Learning Statistics - Machine Learning

Byte Latent Transformer: Patches Scale Better Than Tokens

Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, et al. · · (2024) · DOI: 10.48550/arXiv.2412.09871

We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inferen...

Computer Science - Computation and Language

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, et al. · · (2023) · DOI: 10.48550/arXiv.2305.16291

We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human inter...

Computer Science - Artificial Intelligence Computer Science - Machine Learning

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, Pieter Abbeel · · (2020) · DOI: 10.48550/arXiv.2006.11239

We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results ...

Computer Science - Machine Learning Statistics - Machine Learning

A Path Towards Autonomous Machine Intelligence Version 0.9.2, 2022-06-27

Yann LeCun · · DOI:

How could machines learn as eﬃciently as humans and animals? How could machines learn to reason and plan? How could machines learn representations of percepts and action plans at multiple levels of ab...

Learning high-level visual representations from a child's perspective without strong inductive biases

A. Emin Orhan, Brenden M. Lake · · (2023) · DOI: 10.48550/arXiv.2305.15372

Young children develop sophisticated internal models of the world based on their visual experience. Can such models be learned from a child's visual experience without strong inductive biases? To inve...

Computer Science - Machine Learning Computer Science - Neural and Evolutionary Computing Quantitative Biology - Neurons and Cognition Computer Science - Computer Vision and Pattern Recognition

Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers

Tiberiu Musat · · (2025) · DOI: 10.48550/arXiv.2411.12118

In this paper, I introduce the retrieval problem, a simple yet common reasoning task that can be solved only by transformers with a minimum number of layers, which grows logarithmically with the input...

Computer Science - Machine Learning Computer Science - Computation and Language

Scaling Laws for Precision

Tanishq Kumar, Zachary Ankner, Benjamin F. Spector, et al. · · (2024) · DOI: 10.48550/arXiv.2411.04330

Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise "precision-aware" scaling laws for b...

Computer Science - Machine Learning Computer Science - Computation and Language

FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation

Liqun Ma, Mingjie Sun, Zhiqiang Shen · · (2024) · DOI: 10.48550/arXiv.2407.07093

This work presents a Fully BInarized Large Language Model (FBI-LLM), demonstrating for the first time how to train a large-scale binary language model from scratch (not the partial binary or ternary L...

Computer Science - Artificial Intelligence Computer Science - Machine Learning Computer Science - Computation and Language

Learning to Reason for Long-Form Story Generation

Alexander Gurung, Mirella Lapata · · (2025) · DOI: 10.48550/arXiv.2503.22828

Generating high-quality stories spanning thousands of tokens requires competency across a variety of skills, from tracking plot and character arcs to keeping a consistent and engaging style. Due to th...

Computer Science - Computation and Language

Diffusion Models are Evolutionary Algorithms

Yanbo Zhang, Benedikt Hartl, Hananel Hazan, et al. · · (2024) · DOI: 10.48550/arXiv.2410.02543

In a convergence of machine learning and biology, we reveal that diffusion models are evolutionary algorithms. By considering evolution as a denoising process and reversed evolution as diffusion, we m...

Computer Science - Machine Learning Computer Science - Neural and Evolutionary Computing

Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages

Jannik Brinkmann, Chris Wendler, Christian Bartelt, et al. · · (2025) · DOI: 10.48550/arXiv.2501.06346

Human bilinguals often use similar brain regions to process multiple languages, depending on when they learned their second language and their proficiency. In large language models (LLMs), how are mul...

Computer Science - Computation and Language

Similarity of Neural Network Representations Revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, et al. · · (2019) · DOI: 10.48550/arXiv.1905.00414

Recent work has sought to understand the behavior of neural networks by comparing representations between layers and between different trained models. We examine methods for comparing neural network r...

Computer Science - Machine Learning Statistics - Machine Learning Quantitative Biology - Neurons and Cognition

Layers at Similar Depths Generate Similar Activations Across LLM Architectures

Christopher Wolfram, Aaron Schein · · (2025) · DOI: 10.48550/arXiv.2504.08775

How do the latent spaces used by independently-trained LLMs relate to one another? We study the nearest neighbor relationships induced by activations at different layers of 24 open-weight LLMs, and fi...

Computer Science - Artificial Intelligence Computer Science - Computation and Language

What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models

Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, et al. · · (2025) · DOI: 10.48550/arXiv.2503.24235

As enthusiasm for scaling computation (data and parameters) in the pretraining era gradually diminished, test-time scaling (TTS)—also referred to as “test-time computing”—has emerged as a prominent re...

Computer Science - Artificial Intelligence Computer Science - Computation and Language

TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems

Khang H. N. Vo, Duc P. T. Nguyen, Thong Nguyen, et al. · · (2025) · DOI: 10.48550/arXiv.2503.06380

This paper focuses on multimodal alignment within the realm of Artificial Intelligence, particularly in text and image modalities. The semantic gap between the textual and visual modality poses a disc...

Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, et al. · · (2023) · DOI: 10.48550/arXiv.2301.08243

This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Archi...

Computer Science - Artificial Intelligence Computer Science - Machine Learning Computer Science - Computer Vision and Pattern Recognition Electrical Engineering and Systems Science - Image and Video Processing

History, Development, and Principles of Large Language Models-An Introductory Survey

Zichong Wang, Zhibo Chu, Thang Viet Doan, et al. · · (2024) · DOI: 10.48550/arXiv.2402.06853

Language models serve as a cornerstone in natural language processing (NLP), utilizing mathematical methods to generalize language laws and knowledge for prediction and generation. Over extensive rese...

Computer Science - Computation and Language

Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability

Zicheng Lin, Tian Liang, Jiahao Xu, et al. · · (2025) · DOI: 10.48550/arXiv.2411.19943

Mathematical reasoning tasks pose significant challenges for large language models (LLMs) because they require precise logical deduction and sequence analysis. In this work, we introduce the concept o...

Computer Science - Artificial Intelligence Computer Science - Machine Learning Computer Science - Computation and Language

A mathematical theory of semantic development in deep neural networks

Andrew M. Saxe, James L. McClelland, Surya Ganguli · Proceedings of the National Academy of Sciences · (2019) · DOI: 10.1073/pnas.1820226116

An extensive body of empirical research has revealed remarkable regularities in the acquisition, organization, deployment, and neural representation of human semantic knowledge, thereby raising a fund...

Computer Science - Artificial Intelligence Computer Science - Machine Learning Statistics - Machine Learning Quantitative Biology - Neurons and Cognition

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, et al. · · (2023) · DOI: 10.48550/arXiv.2301.05217

Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding em...

Computer Science - Artificial Intelligence Computer Science - Machine Learning

Towards Automated Circuit Discovery for Mechanistic Interpretability

Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, et al. · · (2023) · DOI: 10.48550/arXiv.2304.14997

Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process the...

Computer Science - Machine Learning

Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes

Juergen Schmidhuber · · (2009) · DOI: 10.48550/arXiv.0812.4360

I argue that data becomes temporarily interesting by itself to some self-improving, but computationally limited, subjective observer once he learns to predict or compress the data in a better way, thu...

Computer Science - Artificial Intelligence Computer Science - Neural and Evolutionary Computing

Circuit Tracing: Revealing Computational Graphs in Language Models

Authors Affiliations Published Not published yet DOI No DOI yet · · DOI:

We describe an approach to tracing the “step-by-step” computation involved when a model responds to a single prompt.

On the Biology of a Large Language Model

Authors Affiliations Published Not published yet DOI No DOI yet · · DOI:

We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts, using our circuit tracing methodology.

Generalization through variance: how noise shapes inductive biases in diffusion models

John Vastola · · (2024) · DOI:

How diffusion models generalize beyond their training set is not known, and is somewhat mysterious given two facts: the optimum of the denoising score matching (DSM) objective usually used to train di...

GENERALIZATION THROUGH VARIANCE: HOW NOISE SHAPES INDUCTIVE BIASES IN DIFFUSION MODELS

John J Vastola · · (2025) · DOI:

How diffusion models generalize beyond their training set is not known, and is somewhat mysterious given two facts: the optimum of the denoising score matching (DSM) objective usually used to train di...

Training Large Language Models to Reason in a Continuous Latent Space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, et al. · · (2024) · DOI: 10.48550/arXiv.2412.06769

Large language models (LLMs) are restricted to reason in the "language space", where they typically express the reasoning process with a chain-of-thought (CoT) to solve a complex reasoning problem. Ho...

Computer Science - Computation and Language

Contextualization Machines

· · DOI:

Training Large Language Models to Reason in a Continuous Latent Space

· · DOI:

Do Llamas Work in English? On the Latent Language of Multilingual Transformers

Chris Wendler, Veniamin Veselovsky, Giovanni Monea, et al. · · (2024) · DOI: 10.48550/arXiv.2402.10588

We ask whether multilingual language models trained on unbalanced, English-dominated corpora use English as an internal pivot language -- a question of key importance for understanding how language mo...

Computer Science - Computation and Language Computer Science - Computers and Society

On the Emergence of Thinking in LLMs I: Searching for the Right Intuition

Guanghao Ye, Khiem Duc Pham, Xinzhi Zhang, et al. · · (2025) · DOI: 10.48550/arXiv.2502.06773

Recent advancements in AI, such as OpenAI’s new o models, Google’s Gemini Thinking model, and Deepseek R1, are transforming LLMs into LRMs (Large Reasoning Models). Unlike LLMs, LRMs perform thinking ...

Computer Science - Artificial Intelligence Computer Science - Machine Learning Computer Science - Computation and Language

On the Emergence of Thinking in LLMs I: Searching for the Right Intuition

Guanghao Ye, Khiem Duc Pham, Xinzhi Zhang, et al. · · (2025) · DOI:

Recent AI advancements, such as OpenAI's new models, are transforming LLMs into LRMs (Large Reasoning Models) that perform reasoning during inference, taking extra time and compute for higher-quality ...

Search Results

Research Papers

Papers by Tags

Timeline

July 2025

June 2025