Bookmarks

Street Fighting Transformers

Sasha Rush delivers practical estimation techniques for Transformer/LLM models, beneficial for ML researchers and practitioners.

Re-thinking Transformers: Searching for Efficient Linear Layers over a Continuous Space of...

Academic talk from the Simons Institute presenting a unified framework for efficient linear layers in Transformers—highly relevant to deep-learning researchers and practitioners.

How DeepSeek Rewrote the Transformer [MLA]

In-depth analysis of a transformer variant (DeepSeek MLA) covering architecture, performance, and equations—highly relevant deep-learning material.

What is a Transformer? (Transformer Walkthrough Part 1/2)

In-depth technical walkthrough of Transformer architecture by an AI researcher, directly aligned with deep-learning educational content.

The Attention Mechanism in Large Language Models

Visual, high-level explanation of scaled dot-product attention and why it enables large language models to capture long-range dependencies.

Large Language Models in Five Formulas

Tutorial distills LLM behavior into five key formulas—perplexity, attention, GEMM efficiency, scaling laws, and RASP reasoning.

Stanford CS25: V2 I Introduction to Transformers w/ Andrej Karpathy

Andrej Karpathy kicks off Stanford CS25 with a primer on Transformer architecture, its history, and cross-domain applications.

Transformer Neural Network: Visually Explained

Step-by-step visual and PyTorch implementation of the Transformer—covering self-attention, positional encoding, and multi-head mechanisms.

Let's build GPT: from scratch, in code, spelled out.

End-to-end coding tutorial constructs a minimal GPT Transformer—including dataset, BPE tokenizer, self-attention, and training loop—from scratch.

Mixtral of Experts

Subcategories