Bookmarks
Activation Atlas
By using feature inversion to visualize millions of activations from an image classification network, we create an explorable activation atlas of features the network has learned and what concepts it typically represents.
Transformers Represent Belief State Geometry in their Residual Stream
Produced while being an affiliate at PIBBSS[1]. The work was done initially with funding from a Lightspeed Grant, and then continued while at PIBBSS.…
On the Biology of a Large Language Model
Large language models display impressive capabilities. However, for the most part, the mechanisms by which they do so are unknown.
Do Llamas Work in English? On the Latent Language of Multilingual Transformers
We ask whether multilingual language models trained on unbalanced, English-dominated corpora use English as an internal pivot language -- a question of key importance for understanding how language models function and the origins of linguistic bias. Focusing on the Llama-2 family of transformer models, our study uses carefully constructed non-English prompts with a unique correct single-token continuation. From layer to layer, transformers gradually map an input embedding of the final prompt token to an output embedding from which next-token probabilities are computed. Tracking intermediate embeddings through their high-dimensional space reveals three distinct phases, whereby intermediate embeddings (1) start far away from output token embeddings; (2) already allow for decoding a semantically correct next token in the middle layers, but give higher probability to its version in English than in the input language; (3) finally move into an input-language-specific region of the embedding space. We cast these results into a conceptual model where the three phases operate in "input space", "concept space", and "output space", respectively. Crucially, our evidence suggests that the abstract "concept space" lies closer to English than to other languages, which may have important consequences regarding the biases held by multilingual language models.
Circuit Tracing: Revealing Computational Graphs in Language Models
Deep learning models produce their outputs using a series of transformations distributed across many computational units (artificial “neurons”).
Neural Networks, Manifolds, and Topology
However, there remain a number of concerns about them. One is that it can be quite challenging to understand what a neural network is really doing.
(How) Do Language Models Track State?
Transformer language models (LMs) exhibit behaviors -- from storytelling to code generation -- that appear to require tracking the unobserved state of an evolving world. How do they do so? We study state tracking in LMs trained or fine-tuned to compose permutations (i.e., to compute the order of a set of objects after a sequence of swaps). Despite the simple algebraic structure of this problem, many other tasks (e.g., simulation of finite automata and evaluation of boolean expressions) can be reduced to permutation composition, making it a natural model for state tracking in general. We show that LMs consistently learn one of two state tracking mechanisms for this task. The first closely resembles the "associative scan" construction used in recent theoretical work by Liu et al. (2023) and Merrill et al. (2024). The second uses an easy-to-compute feature (permutation parity) to partially prune the space of outputs, then refines this with an associative scan. The two mechanisms exhibit markedly different robustness properties, and we show how to steer LMs toward one or the other with intermediate training tasks that encourage or suppress the heuristics. Our results demonstrate that transformer LMs, whether pretrained or fine-tuned, can learn to implement efficient and interpretable state tracking mechanisms, and the emergence of these mechanisms can be predicted and controlled.
Chess-GPT's Internal World Model
The blog post discusses how a GPT model trained on chess games learns to predict moves and track the board state without being explicitly given the rules. It successfully classified chess pieces with high accuracy and estimated player skill levels based on game moves. The findings suggest that models trained on strategic games can effectively learn complex tasks through pattern recognition.
Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models
Researchers trained a chess-playing language model to understand the game without prior knowledge, focusing on how it represents the board state. They found that the model not only learned the board's layout but also estimated player skill, which helped it predict the next move better. By incorporating a player skill vector, the model's win rate improved significantly.
Manipulating Chess-GPT's World Model
The author explores how Chess-GPT, a language model for chess, can improve its performance by manipulating its internal understanding of player skill and board state. By using linear probes and skill interventions, the model's chess-playing ability was significantly enhanced, especially in games with random initializations. The findings suggest that Chess-GPT learns a deeper understanding of chess rather than just memorizing patterns.
Heatmaps and CNNs Using Fast.ai
The text discusses heatmaps, CNNs, and their relationship in deep learning. It explains how heatmaps are generated using Grad-CAM heatmaps from the final layer of a Convolutional Neural Network. The article also touches on creating heatmaps using Adaptive Pooling layers and interpreting top losses for model evaluation.
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Sparse autoencoders help identify clear and understandable features in language models by tackling the issue of polysemanticity. By using sparse autoencoders, researchers can pinpoint specific features responsible for certain behaviors in neural networks more effectively than other methods. This approach may lead to increased transparency and control over language models in the future.
KAN: Kolmogorov–Arnold Networks
Kolmogorov-Arnold Networks (KANs) have learnable activation functions on edges, outperforming Multilayer Perceptrons (MLPs) in accuracy and interpretability. KANs show faster neural scaling laws than MLPs, leveraging splines and MLPs to improve accuracy and interpretability. KANs can represent functions effectively and display more favorable scaling curves than MLPs, especially in high-dimensional examples.
KAN: Kolmogorov-Arnold Networks
KANs outperform MLPs in accuracy and interpretability by using learnable activation functions on edges. They have faster neural scaling laws and can represent special functions more efficiently. KANs offer a promising alternative to MLPs in various applications, showcasing improved performance and interpretability.
How to Use t-SNE Effectively
t-SNE plots can be useful for visualizing high-dimensional data, but they can also be misleading if not interpreted correctly. The technique creates 2D "maps" of data with many dimensions, but these images can be misread. The perplexity parameter, which balances attention between local and global aspects of the data, has a significant impact on the resulting plots. Different perplexity values may be needed to capture different aspects of the data. t-SNE plots can equalize cluster sizes and distort distances between clusters, making it difficult to interpret relative sizes and distances. It's important to recognize random noise and avoid misinterpreting it as meaningful patterns. t-SNE plots can show some shapes accurately, but local effects and clumping can also affect the interpretation. For topological information, multiple plots at different perplexities may be required. Overall, using t-SNE effectively requires understanding its behavior and limitations.
Memory in Plain Sight: A Survey of the Uncanny Resemblances between Diffusion Models and Associative Memories
Diffusion Models and Associative Memories show surprising similarities in their mathematical underpinnings and goals, bridging traditional and modern AI research. This connection highlights the convergence of AI models towards memory-focused paradigms, emphasizing the importance of understanding Associative Memories in the field of computation. By exploring these parallels, researchers aim to enhance our comprehension of how models like Diffusion Models and Transformers operate in Deep Learning applications.
Measuring Faithfulness in Chain-of-Thought Reasoning
Large language models (LLMs) are more effective when they engage in step-by-step "Chain-of-Thought" (CoT) reasoning, but it is unclear if this reasoning is a faithful explanation of the model's actual process. The study examines how interventions on the CoT affect model predictions, finding that models vary in how strongly they rely on the CoT. The performance boost from CoT does not solely come from added test-time compute or specific phrasing. As models become larger and more capable, they tend to produce less faithful reasoning. The results suggest that faithful CoT reasoning depends on carefully chosen circumstances such as model size and task.
Subcategories
- applications (9)
- compression (9)
- computer_vision (8)
- deep_learning (94)
- ethics (2)
- generative_models (25)
- interpretability (17)
- natural_language_processing (24)
- optimization (7)
- recommendation (2)
- reinforcement_learning (11)
- supervised_learning (1)