Ludwig - ai/architectures/transformers/attention

How Attention Sinks Keep Language Models Stable

Added on August 26, 2025 · 9 min read

Statistics behind Block Sparse Attention

Added on August 26, 2025 · 9 min read

Street Fighting Transformers

Added on July 22, 2025 · 25:13 · 8.2K views

Sasha Rush delivers practical estimation techniques for Transformer/LLM models, beneficial for ML researchers and practitioners.

Re-thinking Transformers: Searching for Efficient Linear Layers over a Continuous Space of...

Added on July 22, 2025 · 41:35 · 8.0K views

Academic talk from the Simons Institute presenting a unified framework for efficient linear layers in Transformers—highly relevant to deep-learning researchers and practitioners.

How DeepSeek Rewrote the Transformer [MLA]

Added on July 22, 2025 · 18:09 · 680.7K views

In-depth analysis of a transformer variant (DeepSeek MLA) covering architecture, performance, and equations—highly relevant deep-learning material.

What is the Transformers’ Context Window in Deep Learning? (and how to make it LONG)

Added on July 22, 2025 · 27:03 · 3.0K views

What is a Transformer? (Transformer Walkthrough Part 1/2)

Added on July 22, 2025 · 63:00 · 26.2K views

In-depth technical walkthrough of Transformer architecture by an AI researcher, directly aligned with deep-learning educational content.

Fireside Chat With Ilya Sutskever and Jensen Huang AI Today and Vision of the Future March 2023

Added on July 22, 2025 · 53:06 · 27.1K views

Matt Squire - Diving into Transformer Model Internals | PyData London 25

Added on July 22, 2025 · 34:01 · 2.8K views

I Visualised Attention in Transformers

Added on July 22, 2025 · · views

The Attention Mechanism in Large Language Models

Added on July 22, 2025 · 21:01 · 126.5K views

Visual, high-level explanation of scaled dot-product attention and why it enables large language models to capture long-range dependencies.

Large Language Models in Five Formulas

Added on July 22, 2025 · 58:01 · 38.7K views

Tutorial distills LLM behavior into five key formulas—perplexity, attention, GEMM efficiency, scaling laws, and RASP reasoning.

Stanford CS25: V2 I Introduction to Transformers w/ Andrej Karpathy

Added on July 22, 2025 · 71:40 · 874.4K views

Andrej Karpathy kicks off Stanford CS25 with a primer on Transformer architecture, its history, and cross-domain applications.

How to Build an LLM from Scratch | An Overview

Added on July 22, 2025 · · views

Transformer Neural Network: Visually Explained

Added on July 22, 2025 · 10:50 · 10.1K views

Step-by-step visual and PyTorch implementation of the Transformer—covering self-attention, positional encoding, and multi-head mechanisms.

Sitan Chen - Provably learning a multi-head attention layer - IPAM at UCLA

Added on July 22, 2025 · · views

If we don’t get AGI by GPT-7 (~$1T), will we just never get it? – Sholto Douglas & Trenton Bricken

Added on July 22, 2025 · · views

FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention

Added on July 2, 2025 · 19 min read

The Illustrated AlphaFold

Added on June 28, 2025 · 51 min read

DeepSeek-V3 Explained 1: Multi-head Latent Attention

Added on May 29, 2025 · 9 min read

diffusion transformers

Added on April 5, 2025 · 1 min read

Softmax Attention is a Fluke

Added on March 24, 2025 · 13 min read

Transformers Laid Out

Added on March 23, 2025 · 38 min read

Attention from Beginners Point of View

Added on March 9, 2025 · 2 min read

BLT__Patches_Scale_Better_Than_Tokens

Added on December 17, 2024 · 1h 3m read

Humans in 4D: Reconstructing and Tracking Humans with Transformers

Added on September 30, 2024 · 1 min read

Exploring architectures- Transformers II

Added on June 6, 2024 · 6 min read

Transformers Meet Small Datasets

Added on April 10, 2024 · 1 min read

Understanding and Coding Self-Attention, Multi-Head Attention, Cross-Attention, and Causal-Attention in LLMs

Added on January 14, 2024 · 26 min read

Transformers From Scratch

Added on January 7, 2024 · 1 min read

The Illustrated Transformer

Added on January 4, 2024 · 19 min read

CS25: Transformers United V3

Added on January 4, 2024 · 6 min read

Bookmarks

How Attention Sinks Keep Language Models Stable

Statistics behind Block Sparse Attention

Street Fighting Transformers

Re-thinking Transformers: Searching for Efficient Linear Layers over a Continuous Space of...

How DeepSeek Rewrote the Transformer [MLA]

What is the Transformers’ Context Window in Deep Learning? (and how to make it LONG)

What is a Transformer? (Transformer Walkthrough Part 1/2)

Fireside Chat With Ilya Sutskever and Jensen Huang AI Today and Vision of the Future March 2023

Matt Squire - Diving into Transformer Model Internals | PyData London 25

I Visualised Attention in Transformers

The Attention Mechanism in Large Language Models

Large Language Models in Five Formulas

Stanford CS25: V2 I Introduction to Transformers w/ Andrej Karpathy

How to Build an LLM from Scratch | An Overview

Transformer Neural Network: Visually Explained

Sitan Chen - Provably learning a multi-head attention layer - IPAM at UCLA

If we don’t get AGI by GPT-7 (~$1T), will we just never get it? – Sholto Douglas & Trenton Bricken

FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention

The Illustrated AlphaFold

DeepSeek-V3 Explained 1: Multi-head Latent Attention

diffusion transformers

Softmax Attention is a Fluke

Transformers Laid Out

Attention from Beginners Point of View

BLT__Patches_Scale_Better_Than_Tokens

Humans in 4D: Reconstructing and Tracking Humans with Transformers

Exploring architectures- Transformers II

Transformers Meet Small Datasets

Understanding and Coding Self-Attention, Multi-Head Attention, Cross-Attention, and Causal-Attention in LLMs

Transformers From Scratch

The Illustrated Transformer

CS25: Transformers United V3

Subcategories