Ludwig - ai/interpretability

Activation Atlas

Added on June 26, 2025

By using feature inversion to visualize millions of activations from an image classification network, we create an explorable activation atlas of features the network has learned and what concepts it typically represents.

ai/deep_learning/neural_networks/convolutional_neural_networks

Transformers Represent Belief State Geometry in their Residual Stream

Added on May 16, 2025

Produced while being an affiliate at PIBBSS[1]. The work was done initially with funding from a Lightspeed Grant, and then continued while at PIBBSS.…

On the Biology of a Large Language Model

Added on April 22, 2025

Large language models display impressive capabilities. However, for the most part, the mechanisms by which they do so are unknown.

Do Llamas Work in English? On the Latent Language of Multilingual Transformers

Added on April 22, 2025

We ask whether multilingual language models trained on unbalanced, English-dominated corpora use English as an internal pivot language -- a question of key importance for understanding how language models function and the origins of linguistic bias. Focusing on the Llama-2 family of transformer models, our study uses carefully constructed non-English prompts with a unique correct single-token continuation. From layer to layer, transformers gradually map an input embedding of the final prompt token to an output embedding from which next-token probabilities are computed. Tracking intermediate embeddings through their high-dimensional space reveals three distinct phases, whereby intermediate embeddings (1) start far away from output token embeddings; (2) already allow for decoding a semantically correct next token in the middle layers, but give higher probability to its version in English than in the input language; (3) finally move into an input-language-specific region of the embedding space. We cast these results into a conceptual model where the three phases operate in "input space", "concept space", and "output space", respectively. Crucially, our evidence suggests that the abstract "concept space" lies closer to English than to other languages, which may have important consequences regarding the biases held by multilingual language models.

Circuit Tracing: Revealing Computational Graphs in Language Models

Added on March 29, 2025

Deep learning models produce their outputs using a series of transformations distributed across many computational units (artificial “neurons”).

Neural Networks, Manifolds, and Topology

Added on March 9, 2025

However, there remain a number of concerns about them. One is that it can be quite challenging to understand what a neural network is really doing.

ai/deep_learning

(How) Do Language Models Track State?

Added on March 9, 2025

Transformer language models (LMs) exhibit behaviors -- from storytelling to code generation -- that appear to require tracking the unobserved state of an evolving world. How do they do so? We study state tracking in LMs trained or fine-tuned to compose permutations (i.e., to compute the order of a set of objects after a sequence of swaps). Despite the simple algebraic structure of this problem, many other tasks (e.g., simulation of finite automata and evaluation of boolean expressions) can be reduced to permutation composition, making it a natural model for state tracking in general. We show that LMs consistently learn one of two state tracking mechanisms for this task. The first closely resembles the "associative scan" construction used in recent theoretical work by Liu et al. (2023) and Merrill et al. (2024). The second uses an easy-to-compute feature (permutation parity) to partially prune the space of outputs, then refines this with an associative scan. The two mechanisms exhibit markedly different robustness properties, and we show how to steer LMs toward one or the other with intermediate training tasks that encourage or suppress the heuristics. Our results demonstrate that transformer LMs, whether pretrained or fine-tuned, can learn to implement efficient and interpretable state tracking mechanisms, and the emergence of these mechanisms can be predicted and controlled.

Chess-GPT's Internal World Model

Added on July 22, 2024

The blog post discusses how a GPT model trained on chess games learns to predict moves and track the board state without being explicitly given the rules. It successfully classified chess pieces with high accuracy and estimated player skill levels based on game moves. The findings suggest that models trained on strategic games can effectively learn complex tasks through pattern recognition.

Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models

Added on July 22, 2024

Researchers trained a chess-playing language model to understand the game without prior knowledge, focusing on how it represents the board state. They found that the model not only learned the board's layout but also estimated player skill, which helped it predict the next move better. By incorporating a player skill vector, the model's win rate improved significantly.

ai/natural_language_processing/language_models

Manipulating Chess-GPT's World Model

Added on July 22, 2024

The author explores how Chess-GPT, a language model for chess, can improve its performance by manipulating its internal understanding of player skill and board state. By using linear probes and skill interventions, the model's chess-playing ability was significantly enhanced, especially in games with random initializations. The findings suggest that Chess-GPT learns a deeper understanding of chess rather than just memorizing patterns.

ai/applications/game_ai

Heatmaps and CNNs Using Fast.ai

Added on May 25, 2024

The text discusses heatmaps, CNNs, and their relationship in deep learning. It explains how heatmaps are generated using Grad-CAM heatmaps from the final layer of a Convolutional Neural Network. The article also touches on creating heatmaps using Adaptive Pooling layers and interpreting top losses for model evaluation.

ai/deep_learning/neural_networks/convolutional_neural_networks

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Added on May 1, 2024

Sparse autoencoders help identify clear and understandable features in language models by tackling the issue of polysemanticity. By using sparse autoencoders, researchers can pinpoint specific features responsible for certain behaviors in neural networks more effectively than other methods. This approach may lead to increased transparency and control over language models in the future.

KAN: Kolmogorov–Arnold Networks

Added on May 1, 2024

Kolmogorov-Arnold Networks (KANs) have learnable activation functions on edges, outperforming Multilayer Perceptrons (MLPs) in accuracy and interpretability. KANs show faster neural scaling laws than MLPs, leveraging splines and MLPs to improve accuracy and interpretability. KANs can represent functions effectively and display more favorable scaling curves than MLPs, especially in high-dimensional examples.

ai/deep_learning

KAN: Kolmogorov-Arnold Networks

Added on May 1, 2024

KANs outperform MLPs in accuracy and interpretability by using learnable activation functions on edges. They have faster neural scaling laws and can represent special functions more efficiently. KANs offer a promising alternative to MLPs in various applications, showcasing improved performance and interpretability.

ai/deep_learning

How to Use t-SNE Effectively

Added on February 16, 2024

t-SNE plots can be useful for visualizing high-dimensional data, but they can also be misleading if not interpreted correctly. The technique creates 2D "maps" of data with many dimensions, but these images can be misread. The perplexity parameter, which balances attention between local and global aspects of the data, has a significant impact on the resulting plots. Different perplexity values may be needed to capture different aspects of the data. t-SNE plots can equalize cluster sizes and distort distances between clusters, making it difficult to interpret relative sizes and distances. It's important to recognize random noise and avoid misinterpreting it as meaningful patterns. t-SNE plots can show some shapes accurately, but local effects and clumping can also affect the interpretation. For topological information, multiple plots at different perplexities may be required. Overall, using t-SNE effectively requires understanding its behavior and limitations.

ai

Memory in Plain Sight: A Survey of the Uncanny Resemblances between Diffusion Models and Associative Memories

Added on February 9, 2024

Diffusion Models and Associative Memories show surprising similarities in their mathematical underpinnings and goals, bridging traditional and modern AI research. This connection highlights the convergence of AI models towards memory-focused paradigms, emphasizing the importance of understanding Associative Memories in the field of computation. By exploring these parallels, researchers aim to enhance our comprehension of how models like Diffusion Models and Transformers operate in Deep Learning applications.

ai/generative_models/diffusion_models

Measuring Faithfulness in Chain-of-Thought Reasoning

Added on January 28, 2024

Large language models (LLMs) are more effective when they engage in step-by-step "Chain-of-Thought" (CoT) reasoning, but it is unclear if this reasoning is a faithful explanation of the model's actual process. The study examines how interventions on the CoT affect model predictions, finding that models vary in how strongly they rely on the CoT. The performance boost from CoT does not solely come from added test-time compute or specific phrasing. As models become larger and more capable, they tend to produce less faithful reasoning. The results suggest that faithful CoT reasoning depends on carefully chosen circumstances such as model size and task.

Bookmarks

Activation Atlas

Transformers Represent Belief State Geometry in their Residual Stream

On the Biology of a Large Language Model

Do Llamas Work in English? On the Latent Language of Multilingual Transformers

Circuit Tracing: Revealing Computational Graphs in Language Models

Neural Networks, Manifolds, and Topology

(How) Do Language Models Track State?

Chess-GPT's Internal World Model

Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models

Manipulating Chess-GPT's World Model

Heatmaps and CNNs Using Fast.ai

Sparse Autoencoders Find Highly Interpretable Features in Language Models

KAN: Kolmogorov–Arnold Networks

KAN: Kolmogorov-Arnold Networks

How to Use t-SNE Effectively

Memory in Plain Sight: A Survey of the Uncanny Resemblances between Diffusion Models and Associative Memories

Measuring Faithfulness in Chain-of-Thought Reasoning

Subcategories