Recent Bookmarks

Matrices and graphs

The single most undervalued fact of linear algebra: matrices are graphs, and graphs are matrices

Domain specific architectures for AI inference

fleetwood.dev

DeepSeek-V3 Explained 1: Multi-head Latent Attention

Key architecture innovation behind DeepSeek-V2 and DeepSeek-V3 for faster inference

Optimizing Transformer-Based Diffusion Models for Video Generation with NVIDIA TensorRT

State-of-the-art image diffusion models take tens of seconds to process a single image. This makes video diffusion even more challenging, requiring significant computational resources and high costs.

You could have designed state of the art positional encoding

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

attention is logarithmic, actually

supaiku dot com § attention is logarithmic, actually § time complexity is a very bad model when working with parallelism. in which i make the case for work-depth analysis instead of time complexity.

AI Arrives In The Middle East: US Strikes A Deal with UAE and KSA – SemiAnalysis

The US has signed two landmark agreements with the United Arab Emirates and Kingdom of Saudi Arabia (KSA) that that will noticeably shift the balance of power. The deals have economic, geopolitical…

Transformers Represent Belief State Geometry in their Residual Stream

Produced while being an affiliate at PIBBSS[1]. The work was done initially with funding from a Lightspeed Grant, and then continued while at PIBBSS.…

Llama from scratch (or how to implement a paper without crying)

I want to provide some tips from my experience implementing a paper. I'm going to cover my tips so far from implementing a dramatically scaled-down versio...

The Curse of Knowing How, or; Fixing Everything

A reflection on control, burnout, and the strange weight of technical fluency.

The MAP-Elites Algorithm: Finding Optimality Through Diversity

MAP-Elites is a method in reinforcement learning to avoid the local optimum of a search space by storing multiple candidate solutions…

How To Scale

While there are already excellent posts on scaling, I wanted to share my own understanding and things i've learned from my past few months and hopefully spark some discussion. I hope this post can shed light for anyone navigating the challenges of scaling up neural networks. And there may be mistakes or inaccuracies, so if you want to correct me or would like to discuss further, please feel free to DM me on X or leave a comment.

Deep Dive into Yann LeCun’s JEPA

ML blog.

Are Transformers universal approximators of sequence-to-sequence functions?

Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Furthermore, using positional encodings, we circumvent the restriction of permutation equivariance, and show that Transformer models can universally approximate arbitrary continuous sequence-to-sequence functions on a compact domain. Interestingly, our proof techniques clearly highlight the different roles of the self-attention and the feed-forward layers in Transformers. In particular, we prove that fixed width self-attention layers can compute contextual mappings of the input sequences, playing a key role in the universal approximation property of Transformers. Based on this insight from our analysis, we consider other simpler alternatives to self-attention layers and empirically evaluate them.

a Hugging Face Space by nanotron

The ultimate guide to training LLM on large GPU Clusters