Bookmarks
DeepSeek Debrief: >128 Days Later – SemiAnalysis
SemiAnalysis is hiring an analyst in New York City for Core Research, our world class research product for the finance industry. Please apply here It’s been a bit over 150 days since the launc…
FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention
Although these fused attention implementations have substantially improved performance and enabled long contexts, this efficiency has come with a loss of flexibility.
The Illustrated AlphaFold
A visual walkthrough of the AlphaFold3 architecture, with more details and diagrams than you were probably looking for.
You could have designed state of the art positional encoding
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
attention is logarithmic, actually
supaiku dot com § attention is logarithmic, actually § time complexity is a very bad model when working with parallelism. in which i make the case for work-depth analysis instead of time complexity.
How To Scale
While there are already excellent posts on scaling, I wanted to share my own understanding and things i've learned from my past few months and hopefully spark some discussion. I hope this post can shed light for anyone navigating the challenges of scaling up neural networks. And there may be mistakes or inaccuracies, so if you want to correct me or would like to discuss further, please feel free to DM me on X or leave a comment.
Multi-layer language heads: the output latent is for text (and nothing else)
The last layer’s hidden state in a transformer is meant only for being decoded into token probabilities. Don’t use it for autoregressive image generation Dont’t use it for looped latent transformers Only use it to produce the next token in a language model It is a compressed representation of the...
CS336: Language Modeling from Scratch
Language models serve as the cornerstone of modern natural language processing (NLP) applications and open up a new paradigm of having a single general purpose system address a range of downstream tasks.
Contextualization Machines
Astro description
diffusion transformers
Metaphorically, you can think of Vision Transformers as the eyes of the system, able to understand and contextualize what it sees, while Stable Diffusion is the hand of the system, able to generate and manipulate images based on this understanding.
Softmax Attention is a Fluke
Calibrated AttentionCalibrated Attention NanoGPTAttention is the magic ingredient of modern neural networks. It is the core of what has launched performant language models into the spotlight starting with GPT, and since then, it has extended its hands across all modalities.There are a number of desirable properties that make attention a first-class building block. Namely: • It handles variable sequence lengths with ease • It allows for a global receptive field without needing to scale parameters
Transformers Laid Out
I have encountered that there are mainly three types of blogs/videos/tutorials talking about transformers
Attention from Beginners Point of View
Transformers are a type of neural network architecture which is popularly used for text generations, machine translations, etc.
Why Attention Is All You NeedWhy Attention Is All You Need
The Transformer architecture introduced in this paper was a major breakthrough in sequence transduction methodologies, particularly within neural machine translation (NMT) and broader natural language processing (NLP).
Unveiling_DeepSeek.pdf
successful modifications since its inception, let alone large-scale validation.
DeepSeek-V3 Explained: A Deep Dive into the Next-Generation AI Model
Artificial Intelligence (AI) is advancing at an unprecedented pace, and the DeepSeek-V3 model is at the forefront of this revolution. As…
Oasis: A Universe in a Transformer
Generating Worlds in Realtime
Humans in 4D: Reconstructing and Tracking Humans with Transformers
Join the discussion on this paper page
MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding
MLKV introduces Multi-Layer Key-Value sharing to reduce memory usage in transformer decoding. This approach improves efficiency without sacrificing performance on NLP benchmarks. MLKV significantly reduces memory requirements compared to existing methods like Multi-Query Attention.
Exploring architectures- Transformers II
The text explains how Transformers utilize queries, keys, and values to calculate self-attention weights for tokens. It details the process of obtaining the self-attention weights and generating output tokens through neural networks. The final steps involve calculating loss using cross-entropy and backpropagating to update the weight parameters.
The Annotated Transformer
The text discusses the architecture and training of a Transformer model.
It explains the use of self-attention and feed-forward networks in the encoder and decoder.
The model is demonstrated through examples of prediction and visualization of attention mechanisms.
Root Mean Square Layer Normalization
The text discusses a technique called Root Mean Square Layer Normalization proposed by Biao Zhang and Rico Sennrich. This technique is likely related to a method for normalizing data in neural networks. The authors' work can be found on arxiv.org.
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks
The text discusses a method called Parameter-Efficient Sparsity Crafting (PESC) that enhances sparse models for natural language processing tasks. PESC involves integrating adapters into sparse models, improving performance without changing individual weights. The approach outperforms other sparse models and even competes with GPT-3.5 in various tasks.
gemini_v1_5_report
Gemini 1.5 Pro is a highly compute-efficient multimodal model that can recall and reason over millions of tokens of context, including long documents, videos, and audio. It achieves near-perfect recall on long-context retrieval tasks and outperforms the state-of-the-art in long-document QA, long-video QA, and long-context ASR. Gemini 1.5 Pro also showcases surprising new capabilities, such as learning to translate a new language from a grammar manual. The model surpasses the previous Gemini 1.0 Pro and performs at a similar level to 1.0 Ultra on a wide range of benchmarks while requiring less compute to train.
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
BERT and RoBERTa have achieved impressive results on sentence-pair regression tasks like semantic textual similarity, but they have a significant computational overhead when comparing large collections of sentences. To address this, Sentence-BERT (SBERT) has been developed as a modification of BERT that uses siamese and triplet network structures to generate semantically meaningful sentence embeddings. SBERT reduces the time required to find the most similar pair from 65 hours with BERT to just 5 seconds, while maintaining accuracy. SBERT outperforms other state-of-the-art sentence embedding methods on various tasks, including STS and transfer learning.
Visual Guides to understand the basics of Large Language Models
This article provides a compilation of tools and articles that aim to break down the complicated concepts of Large Language Models (LLMs) in an intuitive way. It acknowledges that many people struggle with understanding the basics of LLMs and offers resources to help solidify their understanding. The article includes a table of contents with links to various resources, such as "The Illustrated Transformer" by Jay Alammar, which provides visualizations to explain the transformer architecture, a fundamental building block of LLMs. The goal is to make the concepts of LLMs easily understood and accessible.
Understanding and Coding Self-Attention, Multi-Head Attention, Cross-Attention, and Causal-Attention in LLMs
This article provides a comprehensive understanding and coding guide for self-attention mechanisms in transformer architectures and large language models (LLMs) like GPT-4 and Llama. It covers the concept of self-attention, its importance in NLP, and the implementation of the self-attention mechanism in Python and PyTorch. The article also discusses the scaled dot-product attention, computing unnormalized attention weights, computing attention weights, and computing the context vector. Additionally, it explores multi-head attention and provides code examples for implementing multiple attention heads.
mlx-examples/lora at main · ml-explore/mlx-examples · GitHub
This document provides an example of using MLX to fine-tune either a Llama 7B1 or Mistral 7B2 model with low rank adaptation (LoRA) for a target task. The example demonstrates using the WikiSQL dataset to train the model to generate SQL queries from natural language. It includes instructions for setup, running the script, fine-tuning the model, evaluating the model, generating output, and dealing with memory issues. The document also provides results from the training process and offers tips for reducing memory consumption during fine-tuning.
Mixtral of Experts
Mixtral 8x7B is a Sparse Mixture of Experts (SMoE) language model that outperforms or matches other models like Llama 2 70B and GPT-3.5 across various benchmarks. It has the same architecture as Mistral 7B but uses 8 feedforward blocks (experts) in each layer. A router network selects two experts for each token at each layer, allowing for dynamic selection of different experts at each timestep. This results in each token having access to 47B parameters but only using 13B active parameters during inference. Mixtral also offers a fine-tuned model, Mixtral 8x7B - Instruct, which surpasses other models on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.
An Intuition for Attention
The transformer neural network, used by models like ChatGPT, incorporates an attention mechanism to improve performance. Attention is a key feature of transformers and is defined by an equation that involves the softmax function. Attention can take different forms, but the scaled dot product attention is commonly used. This attention mechanism is based on the idea of key-value lookups, where a query is matched with keys to retrieve corresponding values. The attention scores, which determine how much attention is given to each key-value pair, are computed using dot product similarity and transformed into decimal percentages using the softmax function. This process allows for meaningful and efficient processing of queries in large language models.
Transformers From Scratch
This blog provides a step-by-step guide on creating and training a transformer from scratch. The author explains each foundational element and provides a Jupyter notebook with the code for readers to run and experiment with. The blog references a YouTube video and the Attention Is All You Need paper for further understanding. The author also mentions the availability of the final code and a dataset for download.
How GPT3 Works - Visualizations and Animations
Discussions:
Hacker News (397 points, 97 comments), Reddit r/MachineLearning (247 points, 27 comments)
Translations: German, Korean, Chinese (Simplified), Russian
The tech world is abuzz with GPT3 hype. Massive language models (like GPT3) are starting to surprise us with their abilities. While not yet completely reliable for most businesses to put in front of their customers, these models are showing sparks of cleverness that are sure to accelerate the march of automation and the possibilities of intelligent computer systems. Let’s remove the aura of mystery around GPT3 and learn how it’s trained and how it works.
A trained language model generates text.
We can optionally pass it some text as input, which influences its output.
The output is generated from what the model “learned” during its training period where it scanned vast amounts of text.
Tensor2Tensor Intro
The content below is not provided.
The Annotated Transformer
"The Annotated Transformer" is a paper that introduces a new architecture for natural language processing tasks, with a focus on translation. The paper provides an annotated version of the original paper, giving a line-by-line implementation of the model. The Transformer model relies on self-attention to compute representations of its input and output without using sequence-aligned recurrent neural networks or convolutions. The model consists of an encoder and decoder stack, each containing self-attention layers and position-wise feed-forward networks. The paper also discusses the use of multi-head attention and positional encoding in the model. The model is trained using the WMT 2014 English-German dataset and the Adam optimizer.
The Illustrated Transformer
"The Illustrated Transformer" is a comprehensive guide to understanding the Transformer model, which utilizes attention to improve the training speed of neural machine translation models. The model consists of stacked encoders and decoders, with each encoder and decoder having self-attention layers. Self-attention allows the model to incorporate information from other words in the input sequence, resulting in better encoding. The model also employs multi-headed attention, which allows it to focus on different positions and creates multiple sets of Query/Key/Value weight matrices. Positional encoding is used to account for the order of words in the input sequence. The architecture includes residual connections and layer normalization for each sub-layer.
The Random Transformer
This blog post provides an end-to-end example of the math within a transformer model, with a focus on the encoder part. The goal is to understand how the model works, and to make it more manageable, simplifications are made and the dimensions of the model are reduced. The post recommends reading "The Illustrated Transformer" blog for a more intuitive explanation of the transformer model. The prerequisites for understanding the content include basic knowledge of linear algebra, machine learning, and deep learning. The post covers the math within a transformer model during inference, attention mechanisms, residual connections and layer normalization, and provides some code to scale it up.
CS25: Transformers United V3
Transformers have revolutionized Natural Language Processing (NLP) and are now being applied in various fields, including Computer Vision, Reinforcement Learning, and Speech. This seminar explores the details of how Transformers work and their applications, with a focus on large language models (LLMs). The seminar includes instructor and guest lectures from experts in Transformers research. The schedule includes topics such as the creation of fine-tuned chat models, low-level embodied intelligence with foundation models, and training helpful chatbots. The seminar also covers the motivations behind Transformers, scaling human-centered machine translation, and going beyond LLMs to explore emergent abilities and intermediate-guided reasoning.
Spaces using openai/whisper-large-v2 232
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It was trained on 680k hours of labelled data and demonstrates strong generalization abilities without the need for fine-tuning. The large-v2 model, trained for 2.5x more epochs with added regularization, offers improved performance. The models can be used for transcription and translation tasks, with context tokens indicating the language and task. While the models show robustness and accuracy in many languages, they may exhibit limitations such as generating repetitive texts and hallucinations. The models have potential applications in accessibility tools but also raise concerns about dual use and surveillance capabilities.
Text Summarization: How to Calculate BertScore
BERTScore is a metric used to measure the quality of text summarization by calculating the similarity between the summary and the original text. It addresses issues that n-gram-based metrics face, such as incorrect matching of paraphrases and the inability to capture long-range dependencies. The BERTScore architecture involves contextual embeddings, cosine similarity, token matching for precision and recall, importance weighting, and baseline rescaling. The metric has the potential to improve various natural language processing tasks and can be applied in domains such as translation quality assessment, text generation, and document comparison. Future developments include broader language coverage and adaptation for multilingual texts.
MotionGPT: Human Motion as a Foreign Language
MotionGPT is a unified model for language and motion tasks, achieving top performance in text-driven motion generation. It combines natural language models with human motion tasks, benefiting fields like gaming and robotics. The model treats human motion like a foreign language, offering a versatile solution for diverse motion synthesis problems.
Subcategories
- applications (9)
- compression (9)
- computer_vision (8)
- deep_learning (94)
- ethics (2)
- generative_models (25)
- interpretability (17)
- natural_language_processing (24)
- optimization (7)
- recommendation (2)
- reinforcement_learning (11)
- supervised_learning (1)