Ludwig - ai/architectures

How Attention Sinks Keep Language Models Stable

Added on August 26, 2025 · 9 min read

Statistics behind Block Sparse Attention

Added on August 26, 2025 · 9 min read

Street Fighting Transformers

Added on July 22, 2025 · 25:13 · 8.2K views

Sasha Rush delivers practical estimation techniques for Transformer/LLM models, beneficial for ML researchers and practitioners.

Re-thinking Transformers: Searching for Efficient Linear Layers over a Continuous Space of...

Added on July 22, 2025 · 41:35 · 8.0K views

Academic talk from the Simons Institute presenting a unified framework for efficient linear layers in Transformers—highly relevant to deep-learning researchers and practitioners.

How DeepSeek Rewrote the Transformer [MLA]

Added on July 22, 2025 · 18:09 · 680.7K views

In-depth analysis of a transformer variant (DeepSeek MLA) covering architecture, performance, and equations—highly relevant deep-learning material.

What is the Transformers’ Context Window in Deep Learning? (and how to make it LONG)

Added on July 22, 2025 · 27:03 · 3.0K views

What is a Transformer? (Transformer Walkthrough Part 1/2)

Added on July 22, 2025 · 63:00 · 26.2K views

In-depth technical walkthrough of Transformer architecture by an AI researcher, directly aligned with deep-learning educational content.

Fireside Chat With Ilya Sutskever and Jensen Huang AI Today and Vision of the Future March 2023

Added on July 22, 2025 · 53:06 · 27.1K views

Matt Squire - Diving into Transformer Model Internals | PyData London 25

Added on July 22, 2025 · 34:01 · 2.8K views

I Visualised Attention in Transformers

Added on July 22, 2025 · · views

The Attention Mechanism in Large Language Models

Added on July 22, 2025 · 21:01 · 126.5K views

Visual, high-level explanation of scaled dot-product attention and why it enables large language models to capture long-range dependencies.

Large Language Models in Five Formulas

Added on July 22, 2025 · 58:01 · 38.7K views

Tutorial distills LLM behavior into five key formulas—perplexity, attention, GEMM efficiency, scaling laws, and RASP reasoning.

Stanford CS25: V2 I Introduction to Transformers w/ Andrej Karpathy

Added on July 22, 2025 · 71:40 · 874.4K views

Andrej Karpathy kicks off Stanford CS25 with a primer on Transformer architecture, its history, and cross-domain applications.

How to Build an LLM from Scratch | An Overview

Added on July 22, 2025 · · views

Transformer Neural Network: Visually Explained

Added on July 22, 2025 · 10:50 · 10.1K views

Step-by-step visual and PyTorch implementation of the Transformer—covering self-attention, positional encoding, and multi-head mechanisms.

Sitan Chen - Provably learning a multi-head attention layer - IPAM at UCLA

Added on July 22, 2025 · · views

Let's build GPT: from scratch, in code, spelled out.

Added on July 22, 2025 · 116:20 · 6.0M views

End-to-end coding tutorial constructs a minimal GPT Transformer—including dataset, BPE tokenizer, self-attention, and training loop—from scratch.

If we don’t get AGI by GPT-7 (~$1T), will we just never get it? – Sholto Douglas & Trenton Bricken

Added on July 22, 2025 · · views

FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention

Added on July 2, 2025 · 19 min read

The Illustrated AlphaFold

Added on June 28, 2025 · 51 min read

DeepSeek-V3 Explained 1: Multi-head Latent Attention

Added on May 29, 2025 · 9 min read

You could have designed state of the art positional encoding

Added on May 16, 2025 · 13 min read

attention is logarithmic, actually

Added on May 16, 2025 · 11 min read

Multi-layer language heads: the output latent is for text (and nothing else)

Added on April 19, 2025 · 4 min read

CS336: Language Modeling from Scratch

Added on April 19, 2025 · 4 min read

diffusion transformers

Added on April 5, 2025 · 1 min read

Softmax Attention is a Fluke

Added on March 24, 2025 · 13 min read

Transformers Laid Out

Added on March 23, 2025 · 38 min read

Attention from Beginners Point of View

Added on March 9, 2025 · 2 min read

Why Attention Is All You NeedWhy Attention Is All You Need

Added on March 9, 2025 · 10 min read

BLT__Patches_Scale_Better_Than_Tokens

Added on December 17, 2024 · 1h 3m read

Humans in 4D: Reconstructing and Tracking Humans with Transformers

Added on September 30, 2024 · 1 min read

MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

Added on June 18, 2024 · 1 min read

Exploring architectures- Transformers II

Added on June 6, 2024 · 6 min read

The Annotated Transformer

Added on May 27, 2024 · 32 min read

Transformers Meet Small Datasets

Added on April 10, 2024 · 1 min read

Visual Guides to understand the basics of Large Language Models

Added on January 14, 2024 · 1 min read

Understanding and Coding Self-Attention, Multi-Head Attention, Cross-Attention, and Causal-Attention in LLMs

Added on January 14, 2024 · 26 min read

Mixtral of Experts

Added on January 10, 2024 · 1 min read

An Intuition for Attention

Added on January 7, 2024 · 9 min read

Transformers From Scratch

Added on January 7, 2024 · 1 min read

How GPT3 Works - Visualizations and Animations

Added on January 5, 2024 · 3 min read

The Annotated Transformer

Added on January 4, 2024 · 16 min read

The Illustrated Transformer

Added on January 4, 2024 · 19 min read

The Random Transformer

Added on January 4, 2024 · 1 min read

CS25: Transformers United V3

Added on January 4, 2024 · 6 min read

Spaces using openai/whisper-large-v2 232

Added on January 3, 2024 · 7 min read

Bookmarks

Subcategories