I thought it would be fun to dump the notes I took while catching up and reading through some of my bookmarks on this fine Saturday. The workflow was something like this:

  • I bookmarked these over time either after skimming them once (if they are good -> using Zotero’s Chrome plugin) or the title / vibe catching my eye (in which case they end up on my readwise)
  • I read the PDFs using Zotero
  • I had Obsidian on the side to jot down notes
  • I used SuperWhisper to write my initial thoughts as I was reading
  • I used ChatGPT 4o voice chat all the while to talk about the paper live as I was reading it, mostly asking stupid questions and definitions or stating my thoughts out loud. 4o often asked questions back which was good in forcing me to compress or re-iterate some specific aspects of the paper.
  • Once done with a paper, I took a short break and revisited my notes before moving on to the next one.

Some of the readings were inspired by some of the conversations I had here, here, here and there. Some are just papers I had in my backlog for a bit, or were recently sent to me.

The reading list

Today’s program includes: - Do Llamas Work in English? On the Latent Language of Multilingual Transformers

(I had already skimmed this one around the time it released but finally read it top to bottom today.)

(These two I wanted to refresh myself on, since I first read them when I had basically 0 clue about what I was reading. I was just force-feeding myself papers in the hopes I’d start understanding them)

(Already read it a few times, this time I just jotted down some impressions I had when I read it once more today!)

Tomorrow’s program is mostly composed of a few CS articles I saved up and wanna get through, but I also want to try and get rid of as many of these papers as possible:

Do Llamas Work in English? On the Latent Language of Multilingual Transformers

We ask whether multilingual language models trained on unbalanced, English-dominated corpora use English as an internal pivot language – a question of key importance for understanding how language models function and the origins of linguistic bias. Focusing on the Llama-2 family of transformer models, our study uses carefully constructed non-English prompts with a unique correct single-token continuation. From layer to layer, transformers gradually map an input embedding of the final prompt token to an output embedding from which next-token probabilities are computed. Tracking intermediate embeddings through their high-dimensional space reveals three distinct phases, whereby intermediate embeddings (1) start far away from output token embeddings; (2) already allow for decoding a semantically correct next token in the middle layers, but give higher probability to its version in English than in the input language; (3) finally move into an input-language-specific region of the embedding space. We cast these results into a conceptual model where the three phases operate in “input space”, “concept space”, and “output space”, respectively. Crucially, our evidence suggests that the abstract “concept space” lies closer to English than to other languages, which may have important consequences regarding the biases held by multilingual language models.

I had skimmed that paper before, but felt like reading it thoroughly after posting this and being linked to it.

The abstract TL;DR’d is that they basically take Llama 2 and feed it a few translation tasks from one language to another that is not English and then try to apply the unembedding matrix to the latents after N layers and see if the model is using English as a pivot while translation from language X to Y.

What they find is that while the language does seem to use English as a pivot language, but it’s more because the “concept space” is biased for English, is thinking in a conceptual space that just happens to be biased towards English tokens due to the training data having a whole bunch of it.

The parts I struggled with:

  • “Output token embeddings (rows of the unembedding matrix U) and latents h cohabitate the same d-dimensional Euclidean space.”
    • fancy way of saying that E and U (input embedding and unembedding matrix) are of the same shapes (or at least have the same amount of columns) but also share the same “semantic space”, so you get to use things eg cosine similarity between the two to make direct comparisons
  • “Due to RMS-normalization, latents live on a hypersphere of radius √d ≈ 90.1”
    • took me a bit to remember / re-learn that RMS causes all the latents to have the same length of sqrt(d) (where d is the embedding dimension of the model, here 8192) and that’s what “forces” the vectors onto a sphere, it’s basically a geometric consequence of RMS norm which I never really visualized like that before this paper
    • it’s also really key to this paper which is all about angles/alignments/directions and basically enables them to make the interpretations they proceed with
  • “If a latent h has a component orthogonal to the token subspace, it includes information irrelevant for predicting the next token based on h alone.” (since logits are scalar products of latent and token vectors). The orthogonal component can still be important for the computations carried out by later layers and for predicting the next token in those layers. But the logit lens, which decodes latents into tokens prematurely in intermediate layers, will be blind to the orthogonal component.”
    • wasn’t obvious why the logit lens would be “blind” to that orthogonal component.. but since the logit lens is literally just taking the “intermediate latents” (aka the residual stream post attention and FFN) and doing a dot product with the token embeddings, any direction that is orthogonal to the token embeddings is going to be 0 – it’s “blind” to it
  • Token energy and token angles
    • Token energy:
      • it wasn’t immediately clear what token energy was from their formal definition of it. the way I understood it is basically a quantification of how “aligned” the latent is with the token embeddings, aka how much it “points” toward it. High energy basically represents how good that latent is for predicting the actual tokens. Low energy, vice versa, means that the latent is mostly orthogonal aka useless/bad for immediate predictions (but still useful internally.)
      • it’s a “strength” indicator for how much the latent vectors influence token logits
    • Token angles
      • basically measures how close/far the latent vectors are from the output token embeddings
      • small angle = similar to high energy
      • large angle = similar to low energy

I fed these explanations I put together to ChatGPT 4.5 and the summary is very clean:

  • Latent vectors produced by transformer layers live in the same space as output token embeddings.
  • Due to normalization, latent vectors lie on a large hypersphere; token embeddings lie on a smaller hypersphere.
  • Initially, latent vectors contain mostly orthogonal (abstract) information unrelated to actual tokens. Later, latent vectors progressively become aligned with output tokens, discarding irrelevant information and concretely encoding predicted tokens.

Progress measures for grokking via mechanistic interpretability

Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous progress measures that underlie the seemingly discontinuous qualitative changes. We argue that progress measures can be found via mechanistic interpretability: reverseengineering learned behaviors into their individual components. As a case study, we investigate the recently-discovered phenomenon of “grokking” exhibited by small transformers trained on modular addition tasks. We fully reverse engineer the algorithm learned by these networks, which uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. We confirm the algorithm by analyzing the activations and weights and by performing ablations in Fourier space. Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup. Our results show that grokking, rather than being a sudden shift, arises from the gradual amplification of structured mechanisms encoded in the weights, followed by the later removal of memorizing components.

This one I just skimmed. I’m gonna be honest, there was a lot of math, I had already had a long day and I want to play MLB The Show 25.

  • TL;DR: they train small transformers on modular addition and study “grokking” (basically when the model suddenly generalizes right after initially just memorizing)
  • they reverse engineer the transformer and find out it’s doing Cthulu-level tricks to perform addition, mapping inputs onto a circle and solving it with rotations
  • grokking has clear phases through the training epochs, it’s not just a eureka moment
    • memorization (quickly learns the training set, doesn’t generalize)
    • circuit formation (“circuit formation likely happens due to the weight decay. Notably, the circuit is formed well before grokking occurs.)
    • cleanup (regularization kills the memorization / cleans it up in favor of the generalized circuit)

Towards Automated Circuit Discovery for Mechanistic Interpretability

Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process they followed. First, researchers choose a metric and dataset that elicit the desired model behavior. Then, they apply activation patching to find which abstract neural network units are involved in the behavior. By varying the dataset, metric, and units under investigation, researchers can understand the functionality of each component. We automate one of the process’ steps: to identify the circuit that implements the specified behavior in the model’s computational graph. We propose several algorithms and reproduce previous interpretability results to validate them. For example, the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes the Greater-Than operation. ACDC selected 68 of the 32,000 edges in GPT-2 Small, all of which were manually found by previous work.

Also skimmed. This one feels like it is the birth of modern mechanical interpretability (as I know it, at least). 2023 is a lifetime ago in that field and there are tons of more recent papers from Anthropic so I didn’t spend too much time on it. Some notes:

  • mech. interp. used to be incredibly tedious and manual
  • to scale it up, they came up with a way to automate the discovery of “computational subgraphs” aka circuits (that are responsible for specific behaviors)
  • it’s done by iteratively pruning model components + testing the effects on some metrics (eg KL divergence)
  • basically tries to remove edges or nodes from a computational graph and keeps only the ones who hurt performance if yo do remove

Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes

This one is a personal favorite. Caleb put me on a couple weeks ago and I’ve read it a couple times already, including on the flight back from SF. It’s from 2009, but hits on many of my core mental models as to how things work. There’s a few Wittgenstein-level hand-waves here and there that I didn’t especially care about, especially around the analogies to “subjective beauty”, art, etc…, hence why none of these concepts made it into notes.

It basically claims that agents seek out data that has regularities that aren’t yet known, that still has “patterns” yet to be compressed. This is what allows for improvements in prediction and compression. The drive rewards discovery of data whose encoding requires fewer and fewer bits over, increasing some subjective “simplicity”. It’s deeply anchored to information theory and the Kolmogorov complexity – the simplest explanation (shortest program) is the most valuable and it leads agents to seek out environments or experiences where you can make keep making these simplifications.

One of the interesting bits is this model of consciousness being “compression-driven”: Schmidhuber’s claim is that consciousness naturally arises as the compressor (eg the brain) develops internal symbolic representations (including self-representations) in order to keep improving encoding efficiency.

For the few who’ve read about my thoughts on consciousness on x.com or heard me speak about it in spaces last year, you can probably tell parts of this (and Friston’s Active Inference etc…) have been big influences on me.

Schmidhuber basically claims consciousness to be a “computationally straight forward byproduct of the ongoing compression process”, an evident necessity to “create some of sort internal symbol or code representing the agent itself” to “efficiently encode the entire data history”.

The idea that consciousness is simple byproduct of effective compressing is something I’ve thrown around off-handedly (1, 2, 3, 4, ). This is also a very Joscha Bach-esque model of consciousness which I feel comfortable thinking about. (discovering Joscha Bach helped me get some of the necessary vocabulary to formulate my thoughts on this).

One of the thing that ends up sticking out is that effectively making compression / your prediction models better and better is what underlies cognition, general intelligence etc…

“Since short and simple explanations of the past usually reflect some repetitive regularity that helps to predict the future as well, every intelligent system interested in achieving future goals should be motivated to compress the history of raw sensory inputs in response to its actions, simply to improve its ability to plan ahead.”

“The agent should monitor the improvements of the adaptive data compressor: whenever it learns to reduce the number of bits required to encode the historic data, generate an intrinsic reward signal or curiosity reward signal in proportion to the learning progress or compression progress, that is, the number of saved bits.”

“Generally speaking we may say that a major goal of traditional unsupervised learning is to improve the compression of the observed data, by discovering a program that computes and thus explains the history… but is clearly shorter than the shortest previously known program of this kind.”

Another interesting thing that sticks out is how “valuable” noise is here, as it basically signals the edge of compressibility. Pure noise isn’t interesting because it has no compression progress potential. Fully predictable data is also uninteresting as it’s already been compressed as efficiently as possible already. So highly interesting data basically becomes regions where noise is temporarily compressible once your compression model improves. You have this sort of sweet spot of noise where data appears random at first and your predictive models cannot compress it well, but where you get to reveal some deep hidden regularities that allow you to improve your model substantially.

Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers

A very simple, short paper – skimmed to get more food for thoughts re: circuit creation in transformers. It showcases attention heads organically “stack” to achieve the given chain-reasoning tasks pretty well.

I think it gives a good intuition for the kind of circuitry that attention heads can form for tasks where the setup is something like A = 7, B = A, C = B, D = C and the prompt is what is the value of D/. In that case, Layer 1 Head 1 can retrieve and encode the information that A = 7 into the residual stream. Then, Layer 2 Head 1 accesses this encoded representation to encode that B = 7, Layer 3 Head 1 then encodes C = 7, and finally Layer 4 Head 1 uses these accumulated representations to correctly infer D = 7.

Note: while writing this for myself and after realizing it might be fun to publish for anyone who is interested, I realized that paragraphs like this are actually dangerous because I make massive leaps in language since I am writing to myself. Models don’t actually encode symbolic statements like this, obviously. They only operate exclusively numerically, on vectors. But since that’s obvious to me (or all?!), I just skip stating the obvious.