Ludwig - ai/machine

Softmax Attention is a Fluke

Added on March 24, 2025

Calibrated AttentionCalibrated Attention NanoGPTAttention is the magic ingredient of modern neural networks. It is the core of what has launched performant language models into the spotlight starting with GPT, and since then, it has extended its hands across all modalities.There are a number of desirable properties that make attention a first-class building block. Namely: • It handles variable sequence lengths with ease • It allows for a global receptive field without needing to scale parameters

Transformers Laid Out

Added on March 23, 2025

I have encountered that there are mainly three types of blogs/videos/tutorials talking about transformers

A friendly introduction to machine learning compilers and optimizers

Added on March 18, 2025

[Twitter thread, Hacker News discussion]

tt-metal/METALIUM_GUIDE.md at main · tenstorrent/tt-metal · GitHub

Added on March 17, 2025

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model. - tenstorrent/tt-metal

Scoping out the Tenstorrent Wormhole

Added on March 17, 2025

The Tenstorrent Wormhole n300s PCIe accelerator board is available for purchase, featuring 672 RISC-V cores driving 466 TFLOP/s of FP8 matmul.

Physics of language models

Added on March 17, 2025

Many asked about collaborations (details are in FAQ). Short answer: unless you're from Meta and willing to work with us in your spare time (20+ hrs/week), or you're an early-year PhD from UCB/NYU/CMU/UW (but application ddl was Jan 10, 2025). Citation request: I'm delighted to know that multiple

Tenstorrent first thoughts

Added on March 17, 2025

I've looked into alternative AI accelerators to continue my saga of running GGML on lower power-consumption hardware. The most promising - and the only one that ever replied to my emails - was Tenstorrent. This post is me deeply thinking about if buying their hardware for development is a good inve ...

Neural Networks, Manifolds, and Topology

Added on March 9, 2025

However, there remain a number of concerns about them. One is that it can be quite challenging to understand what a neural network is really doing.

Attention from Beginners Point of View

Added on March 9, 2025

Transformers are a type of neural network architecture which is popularly used for text generations, machine translations, etc.

(How) Do Language Models Track State?

Added on March 9, 2025

Transformer language models (LMs) exhibit behaviors -- from storytelling to code generation -- that appear to require tracking the unobserved state of an evolving world. How do they do so? We study state tracking in LMs trained or fine-tuned to compose permutations (i.e., to compute the order of a set of objects after a sequence of swaps). Despite the simple algebraic structure of this problem, many other tasks (e.g., simulation of finite automata and evaluation of boolean expressions) can be reduced to permutation composition, making it a natural model for state tracking in general. We show that LMs consistently learn one of two state tracking mechanisms for this task. The first closely resembles the "associative scan" construction used in recent theoretical work by Liu et al. (2023) and Merrill et al. (2024). The second uses an easy-to-compute feature (permutation parity) to partially prune the space of outputs, then refines this with an associative scan. The two mechanisms exhibit markedly different robustness properties, and we show how to steer LMs toward one or the other with intermediate training tasks that encourage or suppress the heuristics. Our results demonstrate that transformer LMs, whether pretrained or fine-tuned, can learn to implement efficient and interpretable state tracking mechanisms, and the emergence of these mechanisms can be predicted and controlled.

Why Attention Is All You NeedWhy Attention Is All You Need

Added on March 9, 2025

The Transformer architecture introduced in this paper was a major breakthrough in sequence transduction methodologies, particularly within neural machine translation (NMT) and broader natural language processing (NLP).

neural video codecs: the future of video compression

Added on February 17, 2025

how deep learning could rewrite the way we encode and decode video

Unnamed Document

Added on February 17, 2025

Mastering LLM Techniques: Evaluation

Added on February 15, 2025

Evaluating large language models (LLMs) and retrieval-augmented generation (RAG) systems is a complex and nuanced process, reflecting the sophisticated and multifaceted nature of these systems.

Mastering LLM Inference Techniques: Inference Optimization

Added on February 15, 2025

Learn about the most pressing challenges in LLM inference, along with some practical solutions.

Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling

Added on February 15, 2025

As AI models extend their capabilities to solve more sophisticated challenges, a new scaling law known as test-time scaling or inference-time scaling is emerging. Also known as AI reasoning or long…

Unnamed Document

Added on January 25, 2025

DeepSeek-V3 Explained: A Deep Dive into the Next-Generation AI Model

Added on January 18, 2025

Artificial Intelligence (AI) is advancing at an unprecedented pace, and the DeepSeek-V3 model is at the forefront of this revolution. As…

Foundations of Large Language Models

Added on January 17, 2025

This is a book about large language models. As indicated by the title, it primarily focuses on foundational concepts rather than comprehensive coverage of all cutting-edge technologies. The book is structured into four main chapters, each exploring a key area: pre-training, generative models, prompting techniques, and alignment methods. It is intended for college students, professionals, and practitioners in natural language processing and related fields, and can serve as a reference for anyone interested in large language models.

Ödeme - Pozitif Teknoloji

Added on January 6, 2025

*Lütfen açıklama kısmına sipariş numaranızı giriniz, Sipariş numarası yazılmayan havale işlemlerinde ki gecikmelerden firmamız sorumlu değildir.

DeepSeek-V3/DeepSeek_V3.pdf at main · deepseek-ai/DeepSeek-V3

Added on December 26, 2024

by Marcus Hutter and David Quarel and Elliot Catt

Added on December 24, 2024

The book can be ordered from amazon. com / co.

Deepseek: The Quiet Giant Leading China’s AI Race

Added on December 24, 2024

Annotated translation of its CEO's deepest interview

Towards a Categorical Foundation of Deep Learning: A Survey

Added on December 22, 2024

The unprecedented pace of machine learning research has lead to incredible advances, but also poses hard challenges. At present, the field lacks strong theoretical underpinnings, and many important achievements stem from ad hoc design choices which are hard to justify in principle and whose effectiveness often goes unexplained. Research debt is increasing and many papers are found not to be reproducible. This thesis is a survey that covers some recent work attempting to study machine learning categorically. Category theory is a branch of abstract mathematics that has found successful applications in many fields, both inside and outside mathematics. Acting as a lingua franca of mathematics and science, category theory might be able to give a unifying structure to the field of machine learning. This could solve some of the aforementioned problems. In this work, we mainly focus on the application of category theory to deep learning. Namely, we discuss the use of categorical optics to model gradient-based learning, the use of categorical algebras and integral transforms to link classical computer science to neural networks, the use of functors to link different layers of abstraction and preserve structure, and, finally, the use of string diagrams to provide detailed representations of neural network architectures.

Soft question: Deep learning and higher categories

Added on December 22, 2024

Recently, I have stumbled upon certain articles and lecture videos that use category theory to explain certain aspects of machine learning or deep learning (e.g. Cats for AI and the paper An enriched

BLT__Patches_Scale_Better_Than_Tokens

Added on December 17, 2024

Position: Categorical Deep Learning is an Algebraic Theory of All Architectures

Added on December 17, 2024

We present our position on the elusive quest for a general-purpose framework for specifying and studying deep learning architectures. Our opinion is that the key attempts made so far lack a coherent bridge between specifying constraints which models must satisfy and specifying their implementations. Focusing on building a such a bridge, we propose to apply category theory -- precisely, the universal algebra of monads valued in a 2-category of parametric maps -- as a single theory elegantly subsuming both of these flavours of neural network design. To defend our position, we show how this theory recovers constraints induced by geometric deep learning, as well as implementations of many architectures drawn from the diverse landscape of neural networks, such as RNNs. We also illustrate how the theory naturally encodes many standard constructs in computer science and automata theory.

Fundamental Components of Deep Learning: A category-theoretic approach

Added on December 17, 2024

Deep learning, despite its remarkable achievements, is still a young field. Like the early stages of many scientific disciplines, it is marked by the discovery of new phenomena, ad-hoc design decisions, and the lack of a uniform and compositional mathematical foundation. From the intricacies of the implementation of backpropagation, through a growing zoo of neural network architectures, to the new and poorly understood phenomena such as double descent, scaling laws or in-context learning, there are few unifying principles in deep learning. This thesis develops a novel mathematical foundation for deep learning based on the language of category theory. We develop a new framework that is a) end-to-end, b) unform, and c) not merely descriptive, but prescriptive, meaning it is amenable to direct implementation in programming languages with sufficient features. We also systematise many existing approaches, placing many existing constructions and concepts from the literature under the same umbrella. In Part I we identify and model two main properties of deep learning systems parametricity and bidirectionality by we expand on the previously defined construction of actegories and Para to study the former, and define weighted optics to study the latter. Combining them yields parametric weighted optics, a categorical model of artificial neural networks, and more. Part II justifies the abstractions from Part I, applying them to model backpropagation, architectures, and supervised learning. We provide a lens-theoretic axiomatisation of differentiation, covering not just smooth spaces, but discrete settings of boolean circuits as well. We survey existing, and develop new categorical models of neural network architectures. We formalise the notion of optimisers and lastly, combine all the existing concepts together, providing a uniform and compositional framework for supervised learning.

Gemini: A Family of Highly Capable Multimodal Models

Added on December 17, 2024

This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.

Flow Matching Guide and Code

Added on December 17, 2024

Flow Matching (FM) is a recent framework for generative modeling that has achieved state-of-the-art performance across various domains, including image, video, audio, speech, and biological structures. This guide offers a comprehensive and self-contained review of FM, covering its mathematical foundations, design choices, and extensions. By also providing a PyTorch package featuring relevant examples (e.g., image and text generation), this work aims to serve as a resource for both novice and experienced researchers interested in understanding, applying and further developing FM.

Mastering Board Games by External and Internal Planning with Language Models

Added on December 17, 2024

Fundamental Components of Deep Learning: A category-theoretic approach

Added on December 16, 2024

Deep learning, despite its remarkable achievements, is still a young field. Like the early stages of many scientific disciplines, it is marked by the discovery of new phenomena, ad-hoc design decisions, and the lack of a uniform and compositional mathematical foundation. From the intricacies of the implementation of backpropagation, through a growing zoo of neural network architectures, to the new and poorly understood phenomena such as double descent, scaling laws or in-context learning, there are few unifying principles in deep learning. This thesis develops a novel mathematical foundation for deep learning based on the language of category theory. We develop a new framework that is a) end-to-end, b) unform, and c) not merely descriptive, but prescriptive, meaning it is amenable to direct implementation in programming languages with sufficient features. We also systematise many existing approaches, placing many existing constructions and concepts from the literature under the same umbrella. In Part I we identify and model two main properties of deep learning systems parametricity and bidirectionality by we expand on the previously defined construction of actegories and Para to study the former, and define weighted optics to study the latter. Combining them yields parametric weighted optics, a categorical model of artificial neural networks, and more. Part II justifies the abstractions from Part I, applying them to model backpropagation, architectures, and supervised learning. We provide a lens-theoretic axiomatisation of differentiation, covering not just smooth spaces, but discrete settings of boolean circuits as well. We survey existing, and develop new categorical models of neural network architectures. We formalise the notion of optimisers and lastly, combine all the existing concepts together, providing a uniform and compositional framework for supervised learning.

llama.cpp guide - Running LLMs locally, on any hardware, from scratch

Added on December 16, 2024

Psst, kid, want some cheap and small LLMs?

Genie 2: A large-scale foundation world model

Added on December 10, 2024

Generating unlimited diverse training environments for future general agents

WilliamYi96/Awesome-Energy-Based-Models: A curated list of resources on energy-based models.

Added on December 9, 2024

A curated list of resources on energy-based models. - WilliamYi96/Awesome-Energy-Based-Models

"CBLL, Research Projects, Computational and Biological Learning Lab, Courant Institute, NYU"

Added on December 9, 2024

Yann LeCun's Web pages at NYU

yataobian/awesome-ebm: Collecting research materials on EBM/EBL (Energy Based Models, Energy Based Learning)

Added on December 9, 2024

Collecting research materials on EBM/EBL (Energy Based Models, Energy Based Learning) - yataobian/awesome-ebm

Greg Yang

Added on December 5, 2024

I am currently developing a framework called Tensor Programs for understanding large neural networks.

My favorite books

Added on November 26, 2024

Star means currently reading but already enjoying.

MOND←TECH MAGAZINE

Added on November 25, 2024

This is a website, which means it sometimes goes offline

Coalescence: making LLM inference 5x faster

Added on November 24, 2024

In this post we’re going to explore a surprising property of structured generation when working with Large Language Models (LLMs): generating structured output from an LLM can be significantly faster than generating unstructured text.

How to get from high school math to cutting-edge ML/AI: a detailed 4-stage roadmap with links to the best learning resources that I’m aware of.

Added on November 18, 2024

1) Foundational math. 2) Classical machine learning. 3) Deep learning. 4) Cutting-edge machine learning.

Fundamental Components of Deep Learning: A category-theoretic approach

Added on November 18, 2024

Deep learning, despite its remarkable achievements, is still a young field. Like the early stages of many scientific disciplines, it is marked by the discovery of new phenomena, ad-hoc design decisions, and the lack of a uniform and compositional mathematical foundation. From the intricacies of the implementation of backpropagation, through a growing zoo of neural network architectures, to the new and poorly understood phenomena such as double descent, scaling laws or in-context learning, there are few unifying principles in deep learning. This thesis develops a novel mathematical foundation for deep learning based on the language of category theory. We develop a new framework that is a) end-to-end, b) unform, and c) not merely descriptive, but prescriptive, meaning it is amenable to direct implementation in programming languages with sufficient features. We also systematise many existing approaches, placing many existing constructions and concepts from the literature under the same umbrella. In Part I we identify and model two main properties of deep learning systems parametricity and bidirectionality by we expand on the previously defined construction of actegories and Para to study the former, and define weighted optics to study the latter. Combining them yields parametric weighted optics, a categorical model of artificial neural networks, and more. Part II justifies the abstractions from Part I, applying them to model backpropagation, architectures, and supervised learning. We provide a lens-theoretic axiomatisation of differentiation, covering not just smooth spaces, but discrete settings of boolean circuits as well. We survey existing, and develop new categorical models of neural network architectures. We formalise the notion of optimisers and lastly, combine all the existing concepts together, providing a uniform and compositional framework for supervised learning.

Oasis: A Universe in a Transformer

Added on October 31, 2024

Generating Worlds in Realtime

Humans in 4D: Reconstructing and Tracking Humans with Transformers

Added on September 30, 2024

Join the discussion on this paper page

Typing the technical interview

Added on September 30, 2024

In the formless days, long before the rise of the Church, all spells were woven of pure causality, all actions were permitted, and death was common.

Tutorial on Diffusion Models for Imaging and Vision

Added on September 10, 2024

The astonishing growth of generative tools in recent years has empowered many exciting applications in text-to-image generation and text-to-video generation. The underlying principle behind these generative tools is the concept of diffusion, a particular sampling mechanism that has overcome some shortcomings that were deemed difficult in the previous approaches. The goal of this tutorial is to discuss the essential ideas underlying the diffusion models. The target audience of this tutorial includes undergraduate and graduate students who are interested in doing research on diffusion models or applying these models to solve other problems.

A ToC of the 20 part linker essay

Added on August 24, 2024

I release this message (the ToC and comments) into the public domain, no right reserved.

trading_interview_blog

Added on August 21, 2024

How web bloat impacts users with slow connections

Added on July 29, 2024

Web bloat makes many websites difficult to use for people with slow internet connections and devices. Sites like Discourse and Reddit perform poorly on low-end devices, even if they seem fast on high-end ones. Improving web performance for these users is crucial, as many people rely on older, slower devices.

applicative-mental-models

Added on July 29, 2024

The text discusses the importance of understanding program performance for effective optimization. It emphasizes that while most optimizations may not be necessary, being aware of critical performance paths is essential. The author provides latency numbers to help programmers grasp the impact of different operations on performance.

applicative-mental-models

Added on July 29, 2024

The text discusses the importance of understanding program performance for effective optimization. It emphasizes that while most optimizations may not be necessary, being aware of critical performance paths is essential. The author provides latency numbers to help programmers grasp the impact of different operations on performance.

Brian Robert Callahan

Added on July 29, 2024

This blog post starts a series on creating programs that demystify how programs work. The first program is a disassembler that reads bytecode and converts it into assembly language, while a future post will cover creating an assembler. The disassembler uses a table of mnemonics and instruction sizes to print out the corresponding assembly instructions from bytecode.

Cramming: Training a Language Model on a Single GPU in One Day

Added on July 29, 2024

Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers and practitioners. While most in the community are asking how to push the limits of extreme computation, we ask the opposite question: How far can we get with a single GPU in just one day? We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings. Through the lens of scaling laws, we ...

The MiniPile Challenge for Data-Efficient Language Models

Added on July 29, 2024

The MiniPile Challenge introduces a new dataset for pre-training language models, containing 1 million documents filtered for quality. It aims to reduce the need for large computational resources while still achieving competitive performance on language tasks. The research shows that models pre-trained on MiniPile perform only slightly worse than those trained on much larger datasets.

Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget

Added on July 29, 2024

The authors present a method for training large text-to-image diffusion models on a very low budget. They use a technique called deferred masking to minimize performance loss while reducing computational costs. Their approach achieves high-quality results at a fraction of the cost compared to existing models, demonstrating the potential for democratizing AI training.

Unnamed Document

Added on July 28, 2024

Chess-GPT's Internal World Model

Added on July 22, 2024

The blog post discusses how a GPT model trained on chess games learns to predict moves and track the board state without being explicitly given the rules. It successfully classified chess pieces with high accuracy and estimated player skill levels based on game moves. The findings suggest that models trained on strategic games can effectively learn complex tasks through pattern recognition.

Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models

Added on July 22, 2024

Researchers trained a chess-playing language model to understand the game without prior knowledge, focusing on how it represents the board state. They found that the model not only learned the board's layout but also estimated player skill, which helped it predict the next move better. By incorporating a player skill vector, the model's win rate improved significantly.

Manipulating Chess-GPT's World Model

Added on July 22, 2024

The author explores how Chess-GPT, a language model for chess, can improve its performance by manipulating its internal understanding of player skill and board state. By using linear probes and skill interventions, the model's chess-playing ability was significantly enhanced, especially in games with random initializations. The findings suggest that Chess-GPT learns a deeper understanding of chess rather than just memorizing patterns.

spikedoanz/from-bits-to-intelligence: machine learninig stack in under 100,000 lines of code

Added on July 16, 2024

The text discusses building a machine learning stack in under 100,000 lines of code with hardware, software, tensors, and machine learning components. It outlines the required components like a CPU, GPU, storage, C compiler, Python runtime, operating system, and more. The goal is to simplify the machine learning stack while providing detailed steps for implementation in different programming languages.

Twitter's Recommendation Algorithm

Added on July 14, 2024

Twitter uses a recommendation algorithm to select the top tweets for users' timelines. The algorithm is based on core models and features that extract information from tweet, user, and engagement data. The recommendation pipeline consists of three main stages: candidate sourcing, ranking, and applying heuristics and filters. Twitter uses both in-network and out-of-network sources to find relevant tweets, and employs embedding spaces to determine content similarity. The final step involves blending tweets with other non-tweet content before sending them to users' devices. The goal of Twitter's open source endeavor is to provide transparency to users about how the recommendation system works.

compiler_construction

Added on July 11, 2024

Building a compiler can be straightforward by breaking the development into small steps and using Scheme as the implementation language. The tutorial focuses on translating a subset of Scheme to assembly code, with a step-by-step approach to achieve a fully working compiler. Testing and refining the compiler incrementally leads to a powerful tool capable of compiling an interactive evaluator.

Recommender Systems: A Primer

Added on July 10, 2024

Personalized recommendations have become a common feature of modern online services, including most major e-commerce sites, media platforms and social networks. Today, due to their high practical relevance, research in the area of recommender systems is flourishing more than ever. However, with the new application scenarios of recommender systems that we observe today, constantly new challenges arise as well, both in terms of algorithmic requirements and with respect to the evaluation of such systems. In this paper, we first provide an overview of the traditional formulation of the recommendation problem. We then review the classical algorithmic paradigms for item retrieval and ranking and elaborate how such systems can be evaluated. Afterwards, we discuss a number of recent developments in recommender systems research, including research on session-based recommendation, biases in recommender systems, and questions regarding the impact and value of recommender systems in practice.

Introduction to Compilers and Language Design

Added on July 10, 2024

A compiler translates high-level code to lower-level code, and building one is a common project in computer science education. This book provides a beginner-friendly guide to building a compiler for a C-like language, suitable for undergraduates with programming experience. The author offers free online access to the textbook and related code resources, with options to purchase a physical copy.

Unknown

Added on July 9, 2024

Hardware prefetching in multicore processors can be too aggressive, wasting resources and impacting performance for co-running threads. Combining hardware and software prefetching can optimize performance by efficiently handling irregular memory accesses. A method described in Paper II offers a low-overhead framework for accurate software prefetching in applications with irregular access patterns.

Using neural nets to recognize handwritten digits

Added on July 5, 2024

Neural networks can recognize handwritten digits by learning from examples. Sigmoid neurons play a key role in helping neural networks learn. Gradient descent is a common method used for learning in neural networks.

Reader

Added on July 2, 2024

The Reader API by jina.ai helps extract clean, LLM-friendly text from web content, ensuring high-quality input for AI systems like agents and RAG. It can also search the web for the latest information to keep LLMs up-to-date, improve factuality, and reduce misinformation. Additionally, Reader can read images on webpages and PDFs, providing alt text for images and lightning-fast PDF processing, all available for free with flexible rate limits.

Picsart-AI-Research/LIVE-Layerwise-Image-Vectorization: [CVPR 2022 Oral] Towards Layer-wise Image Vectorization

Added on July 1, 2024

The text discusses a new method called LIVE for generating SVG images layer by layer to fit raster images. LIVE uses closed bezier paths to learn visual concepts in a recursive manner. Installation instructions and references for the method are provided in the text.

Aggregating Millions of Groups Fast in Apache Arrow DataFusion 28.0.0

Added on July 1, 2024

Apache Arrow DataFusion version 28.0.0 now offers faster parallel aggregation for queries with many groups. The improvements aim to enhance user experiences by generating insights more efficiently. These enhancements bring DataFusion closer to the grouping speed of DuckDB.

Indices and tables

Added on June 27, 2024

CompilerGym is a library for reinforcement learning in compiler tasks. It helps ML researchers work on optimization problems and allows system developers to create new tasks for ML research. The goal is to use ML to make compilers faster.

448997590_1496256481254967_2304975057370160015_n

Added on June 27, 2024

The LLM Compiler is a suite of pre-trained models designed for code optimization tasks, based on Code Llama. It has been trained on a large corpus of LLVM-IR and assembly code to enhance compiler behavior understanding. The release of LLM Compiler aims to support further research in compiler optimization for both academia and industry.

[2305.13009] Textually Pretrained Speech Language Models

Added on June 24, 2024

Have you tried rubbing a database on it?

Added on June 24, 2024

HYTRADBOI was a conference featuring lightning talks on innovative uses of databases for solving problems. Talks included topics like building data-centric apps, realtime machine learning, and interactive databases. The event focused on embracing new solutions and fostering professional behavior among attendees.

MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

Added on June 18, 2024

MLKV introduces Multi-Layer Key-Value sharing to reduce memory usage in transformer decoding. This approach improves efficiency without sacrificing performance on NLP benchmarks. MLKV significantly reduces memory requirements compared to existing methods like Multi-Query Attention.

Understanding_Machine_Learning_-_From_Theory_to_Algorithms

Added on June 17, 2024

I'm sorry, but there is no content provided for me to summarize. If you provide me with the specific content or information you would like summarized, I would be happy to help.

Step-by-Step Diffusion: An Elementary Tutorial

Added on June 15, 2024

The text is a tutorial about diffusion. The authors are Preetum Nakkiran, Arwen Bradley, Hattie Zhou, and Madhu Advani. The tutorial is available on the domain readwise.io.

Speech-to-text models

Added on June 13, 2024

Speech-to-text AI enhances communication and accessibility by transcribing spoken words into text accurately and efficiently. Machine learning and AI advancements have significantly improved the accuracy and adaptability of speech-to-text systems. These technologies open up new possibilities for inclusive and effective communication across various industries.

A Mathematical Theory of Communication

Added on June 12, 2024

The paper extends communication theory by considering noise in the channel, savings from message structure, and channel capacity. It discusses entropy, coding efficiency, channel capacity, noisy channels, equivocation, and optimal information transmission techniques. Examples and theorems are provided to explain the concepts of encoding, channel capacity, and noise in communication systems.

A Recipe for Training Neural Networks

Added on June 11, 2024

The text discusses common mistakes in training neural networks and emphasizes the importance of patience and attention to detail for successful deep learning. It provides a recipe for training neural networks, including steps like setting up a training skeleton, visualizing losses, and focusing on regularization and tuning to improve model performance. The text also highlights the value of adding more real data and using ensembles to enhance accuracy.

Writing CUDA Kernels for PyTorch

Added on June 11, 2024

The text shows the thread distribution on different streaming multiprocessors (SM) in CUDA. Threads are organized into warps, lanes, and specific thread numbers within each SM. This information is crucial for optimizing CUDA kernels in PyTorch.

Multi-Query & Grouped-Query Attention

Added on June 11, 2024

The text explains how Multi-Query Attention and Grouped-Query Attention reduce the Key-Value Cache size in transformer models while maintaining performance. Multi-Query Attention allows multiple attention heads to share key and value vectors, while Grouped-Query Attention groups these vectors based on a hyperparameter, offering a balance between performance and cache reduction. These techniques help manage memory usage during text generation tasks in transformer models.

Exploring architectures- Transformers II

Added on June 6, 2024

The text explains how Transformers utilize queries, keys, and values to calculate self-attention weights for tokens. It details the process of obtaining the self-attention weights and generating output tokens through neural networks. The final steps involve calculating loss using cross-entropy and backpropagating to update the weight parameters.

What are Diffusion Models?

Added on June 6, 2024

Diffusion models slowly add noise to data and then learn to reverse the process to create desired samples. Unlike other models, diffusion models have a fixed procedure and high-dimensional latent variables. Training a diffusion model involves approximating conditioned probability distributions and simplifying the objective function.

Iterative α-(de)Blending: a Minimalist Deterministic Diffusion Model

Added on June 5, 2024

The paper presents a simple and effective denoising-diffusion model called Iterative α-(de)Blending. It offers a user-friendly alternative to complex theories, making it accessible with basic calculus and probability knowledge. By iteratively blending and deblending samples, the model converges to a deterministic mapping, showing promising results in computer graphics applications.

A high-bias, low-variance introduction to Machine Learning for physicists

Added on June 5, 2024

This text is an introduction to Machine Learning for physicists, highlighting the natural connections between ML and statistical physics. It explains the use of "energy-based models" inspired by statistical physics in deep learning methods. The discussion includes the application of methods from statistical physics to study deep learning and the efficiency of learning rules.

How diffusion models work: the math from scratch

Added on June 1, 2024

Diffusion models generate diverse high-resolution images and are different from previous generative methods. Cascade diffusion models and latent diffusion models are used to scale up models to higher resolutions efficiently. Score-based generative models are similar to diffusion models and involve noise perturbations to generate new samples.

MLIR: A Compiler Infrastructure for the End of Moore's Law

Added on May 27, 2024

MLIR is a versatile compiler infrastructure designed to address software fragmentation and improve compilation for different hardware. It aims to reduce the cost of building domain-specific compilers and facilitate the connection of existing compilers. MLIR offers a standardized approach to code generation and optimization across various application domains and hardware targets.

MLIR — Getting Started

Added on May 27, 2024

The text is a guide titled "MLIR — Getting Started" by Math ∩ Programming available on www.jeremykun.com.

The Annotated Transformer

Added on May 27, 2024

The text discusses the architecture and training of a Transformer model. It explains the use of self-attention and feed-forward networks in the encoder and decoder. The model is demonstrated through examples of prediction and visualization of attention mechanisms.

Auto-Regressive Next-Token Predictors are Universal Learners

Added on May 26, 2024

Simple linear next-token predictors can efficiently approximate any function computable by a Turing machine. Even basic models like linear networks and shallow Multi-Layer Perceptrons show strong performance on tasks like text generation and arithmetic. By leveraging auto-regressive learning, these models can achieve impressive results in solving complex tasks.

Unnamed Document

Added on May 25, 2024

How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study

Added on May 25, 2024

Meta's LLaMA family has become one of the most powerful open-source Large Language Model (LLM) series. Notably, LLaMA3 models have recently been released and achieve impressive performance across various with super-large scale pre-training on over 15T tokens of data. Given the wide application of low-bit quantization for LLMs in resource-limited scenarios, we explore LLaMA3's capabilities when quantized to low bit-width. This exploration holds the potential to unveil new insights and challenges for low-bit quantization of LLaMA3 and other forthcoming LLMs, especially in addressing performance degradation problems that suffer in LLM compression. Specifically, we evaluate the 10 existing post-training quantization and LoRA-finetuning methods of LLaMA3 on 1-8 bits and diverse datasets to comprehensively reveal LLaMA3's low-bit quantization performance. Our experiment results indicate that LLaMA3 still suffers non-negligent degradation in these scenarios, especially in ultra-low bit-width. This highlights the signif...

New Scaling Laws for Large Language Models

Added on May 25, 2024

DeepMind's new paper challenges existing scaling laws for training large language models, proposing more optimal use of compute resources. By training a smaller 70-billion parameter model using their new scaling laws, DeepMind demonstrated superior performance compared to larger models like GPT-3 and their own 270-billion parameter model. This discovery may lead to more cost-effective and efficient training of large language models in the future.

Binary Magic: Building BitNet 1.58bit Using PyTorch from Scratch

Added on May 25, 2024

The document discusses the creation of a 1.58bit model called BitNet using PyTorch from scratch, which can rival full precision LLMs. Quantization, the process of representing float numbers with fewer bits, is explained as a method to increase the speed and reduce the RAM consumption of ML models, albeit with some loss of accuracy. BitNet differs from existing quantization approaches as it trains the model from scratch with quantization, offering a unique quantization algorithm and implementation in PyTorch. Results from experiments with custom PyTorch implementations show that the 2bit and 1bit variants of models perform as well as full precision models, demonstrating the potential of this approach.

king - man + woman is queen; but why?

Added on May 25, 2024

The text explains how the word2vec algorithm transforms words into vectors for analyzing similarities and relationships between words. By using vector arithmetic, it can find analogies such as "king - man + woman = queen." Understanding word co-occurrences can provide insight into the meaning of words through the distributional hypothesis.

How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study

Added on May 25, 2024

Meta's LLaMA family has become one of the most powerful open-source Large Language Model (LLM) series. Notably, LLaMA3 models have recently been released and achieve impressive performance across various with super-large scale pre-training on over 15T tokens of data. Given the wide application of low-bit quantization for LLMs in resource-limited scenarios, we explore LLaMA3's capabilities when quantized to low bit-width. This exploration holds the potential to unveil new insights and challenges for low-bit quantization of LLaMA3 and other forthcoming LLMs, especially in addressing performance degradation problems that suffer in LLM compression. Specifically, we evaluate the 10 existing post-training quantization and LoRA-finetuning methods of LLaMA3 on 1-8 bits and diverse datasets to comprehensively reveal LLaMA3's low-bit quantization performance. Our experiment results indicate that LLaMA3 still suffers non-negligent degradation in these scenarios, especially in ultra-low bit-width. This highlights the signif...

1-bit Model

Added on May 25, 2024

Quantizing small models like Llama2-7B at 1-bit yields poor performance but fine-tuning with low-rank adapters significantly improves output quality. The HQQ+ approach shows potential in extreme low-bit quantization for machine learning models, reducing memory and computational requirements while maintaining performance. Training larger models with extreme quantization can lead to superior performance compared to training smaller models from scratch.

Heatmaps and CNNs Using Fast.ai

Added on May 25, 2024

The text discusses heatmaps, CNNs, and their relationship in deep learning. It explains how heatmaps are generated using Grad-CAM heatmaps from the final layer of a Convolutional Neural Network. The article also touches on creating heatmaps using Adaptive Pooling layers and interpreting top losses for model evaluation.

Where do LLMs spend their FLOPS?

Added on May 19, 2024

LLMs (large language models) spend their FLOPS (floating point operations) on various tasks, including computing QKV (query, key, value) matrices, attention output matrices, and running the feed-forward network (FFN). The attention mechanism plays a crucial role in LLMs, even though the FLOPS required for attention calculations are relatively small. The KV cache, which stores information for each token, requires significant memory but is necessary for generating sequences. Different architectural choices, such as grouped query attention and sliding window attention, can affect the size and efficiency of the KV cache. Increasing the number of layers in an LLM linearly scales the FLOPS and parameters, while increasing the model width quadratically scales the model size. Wider models parallelize better, while deeper models increase inference time linearly.

The Annotated Diffusion Model

Added on May 19, 2024

A neural network learns to denoise data by gradually removing noise. The process involves adding noise to an image and then training the network to reverse the denoising. The network predicts noise levels based on corrupted images at different time steps.

Defusing Diffusion Models

Added on May 19, 2024

This post explains the concepts of forward and reverse diffusion processes in diffusion models. By understanding these processes, readers can train diffusion models to generate samples from target distributions effectively. Guided diffusion models are also discussed, showing how conditioning information can be used to guide the diffusion process for specific outcomes.

The Illustrated Stable Diffusion

Added on May 19, 2024

AI image generation with Stable Diffusion involves an image information creator and an image decoder. Diffusion models use noise and powerful computer vision models to generate aesthetically pleasing images. Text can be incorporated to control the type of image the model generates in the diffusion process.

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Added on May 1, 2024

Sparse autoencoders help identify clear and understandable features in language models by tackling the issue of polysemanticity. By using sparse autoencoders, researchers can pinpoint specific features responsible for certain behaviors in neural networks more effectively than other methods. This approach may lead to increased transparency and control over language models in the future.

KAN: Kolmogorov–Arnold Networks

Added on May 1, 2024

Kolmogorov-Arnold Networks (KANs) have learnable activation functions on edges, outperforming Multilayer Perceptrons (MLPs) in accuracy and interpretability. KANs show faster neural scaling laws than MLPs, leveraging splines and MLPs to improve accuracy and interpretability. KANs can represent functions effectively and display more favorable scaling curves than MLPs, especially in high-dimensional examples.

KAN: Kolmogorov-Arnold Networks

Added on May 1, 2024

KANs outperform MLPs in accuracy and interpretability by using learnable activation functions on edges. They have faster neural scaling laws and can represent special functions more efficiently. KANs offer a promising alternative to MLPs in various applications, showcasing improved performance and interpretability.

IEEE Xplore Full-Text PDF:

Added on April 10, 2024

Root Mean Square Layer Normalization

Added on April 9, 2024

The text discusses a technique called Root Mean Square Layer Normalization proposed by Biao Zhang and Rico Sennrich. This technique is likely related to a method for normalizing data in neural networks. The authors' work can be found on arxiv.org.

Root Mean Square Layer Normalization

Added on April 9, 2024

Layer normalization (LayerNorm) has been successfully applied to various deep neural networks to help stabilize training and boost model convergence because of its capability in handling re-centering and re-scaling of both inputs and weight matrix. However, the computational overhead introduced by LayerNorm makes these improvements expensive and significantly slows the underlying network, e.g. RNN in particular. In this paper, we hypothesize that re-centering invariance in LayerNorm is dispensable and propose root mean square layer normalization, or RMSNorm. RMSNorm regularizes the summed inputs to a neuron in one layer according to root mean square (RMS), giving the model re-scaling invariance property and implicit learning rate adaptation ability. RMSNorm is computationally simpler and thus more efficient than LayerNorm. We also present partial RMSNorm, or pRMSNorm where the RMS is estimated from p% of the summed inputs without breaking the above properties. Extensive experiments on several tasks using diverse...

Pattern Recognition and Machine Learning

Added on April 6, 2024

The content discusses likelihood functions for Gaussian distributions, maximizing parameters using observed data, Bayesian model comparison, mixture density networks, and EM algorithm for Gaussian mixtures. It covers topics like posterior distributions, predictive distributions, graphical models, and variational inference. The material emphasizes probability distributions, optimization, and model comparison.

Generative Agents: Interactive Simulacra of Human Behavior

Added on March 28, 2024

The content discusses generative agents that simulate believable human behavior for interactive applications. These agents populate a sandbox environment, interact with each other, plan their days, form relationships, and exhibit emergent social behaviors. The paper introduces a novel architecture that allows agents to remember, retrieve, reflect, and interact dynamically.

Three Decades of Activations: A Comprehensive Survey of 400 Activation Functions for Neural Networks

Added on March 11, 2024

The text is a comprehensive survey of 400 activation functions for neural networks. It provides numerous URLs and DOIs for further reading and reference. The authors are Vladimír Kunc and Jiří Kléma.

Revisiting Deep Learning as a Non-Equilibrium Process

Added on March 8, 2024

The document discusses the nature of Deep Learning systems, highlighting differences from traditional machine learning systems and challenging common misconceptions. It emphasizes the complexity and non-convexity of Deep Learning, noting that optimization techniques alone cannot explain its success. The text critiques the field for lacking in-depth exploration of the true nature of Deep Learning, pointing out a tendency towards superficial explanations and reliance on celebrity figures rather than rigorous scientific inquiry. It delves into the use of Bayesian techniques, the role of noise, and the importance of architecture in Deep Learning, arguing for a deeper understanding of the underlying processes and the need for more precise language and theoretical exploration.

Dissipative Adaptation: The Origins of Life and Deep Learning

Added on March 8, 2024

The document explores the concept of Dissipative Adaptation, drawing parallels between the emergence of life and the mechanisms of Deep Learning. It discusses the work of Jeremy England and his theory of non-equilibrium statistical mechanics known as Dissipative Adaptation, which explains the self-organizing behavior of Deep Learning. The text delves into how neural networks evolve through training, emphasizing the role of external observations in driving the system towards minimizing entropy. It contrasts the mechanisms of Dissipative Adaptation with current Deep Learning architectures, highlighting similarities in alignment of components to maximize energy dissipation or information gradient.

The Art of Embeddings: Transforming Text for Vector Databases (Part 2)

Added on March 6, 2024

Embeddings are a crucial component of transforming text into vectors in vector databases. They capture rich context and make data more useful by capturing meaning and context in a machine-readable format. Tokenization is the first step in the embedding process, where text is broken down into smaller parts or tokens. Word2Vec is a popular method that creates dense vector representations of word features based on context. However, it has limitations such as struggling with polysemy and out-of-vocabulary words. Sub-word tokenization is a hybrid approach that can handle these limitations by decomposing words into meaningful sub-words. Transformer models, such as BERT, are used to transform tokenized words into embeddings by leveraging self-attention mechanisms and positional encodings. The choice of tokenization method can significantly affect the size and effectiveness of the embeddings, including vocabulary size, handling of out-of-vocabulary words, and overall quality and usefulness of the embeddings. Choosing th...

Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks

Added on March 6, 2024

The text discusses a method called Parameter-Efficient Sparsity Crafting (PESC) that enhances sparse models for natural language processing tasks. PESC involves integrating adapters into sparse models, improving performance without changing individual weights. The approach outperforms other sparse models and even competes with GPT-3.5 in various tasks.

Sequence to Sequence Learning with Neural Networks

Added on March 6, 2024

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Added on March 6, 2024

The article introduces a new era of 1-bit Large Language Models (LLMs) that can significantly reduce the cost of LLMs while maintaining their performance. BitNet b1.58 is a 1.58-bit LLM variant in which every parameter is ternary, taking on values of {-1, 0, 1}. It retains all the benefits of the original 1-bit BitNet, including its new computation paradigm, which requires almost no multiplication operations for matrix multiplication and can be highly optimized. Moreover, BitNet b1.58 offers two additional advantages: its modeling capability is stronger due to its explicit support for feature filtering, and it can match full precision (i.e., FP16) baselines in terms of both perplexity and end-task performance at a 3B size.

Hypercomputation

Added on March 3, 2024

Hypercomputation and super-Turing computation involve models of computation that can produce non-Turing-computable outputs. Introduced in the early 1990s, super-Turing computing is inspired by neurological and biological systems and serves as the foundation for Lifelong Machine Learning. Hypercomputation, a field introduced in the late 1990s, includes philosophical constructs and aims to compute functions beyond what a Turing machine can. The Church-Turing thesis states that any "computable" function can be computed by a Turing machine, but hypercomputers can compute functions that are not computable in the Church-Turing sense. Various hypercomputer models exist, ranging from theoretical concepts like oracle machines to more plausible models like quantum computing. Some proposals suggest that hypercomputation may be achievable through systems like neural networks or analog computers. Critics argue that hypercomputation is not physically realizable.

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Added on February 28, 2024

Recent research is leading to a new era of 1-bit Large Language Models (LLMs), such as BitNet, introducing a variant called BitNet b1.58 where every parameter is ternary {-1, 0, 1}. This model matches the performance of full-precision Transformer LLMs while being more cost-effective in terms of latency, memory, throughput, and energy consumption. The 1.58-bit LLM sets a new standard for training high-performance and cost-effective models, paving the way for new computation methods and specialized hardware designed for 1-bit LLMs.

Glossary of Deep Learning: Word Embedding

Added on February 26, 2024

Word embedding is a method that transforms text into numerical vectors for machine learning algorithms to process efficiently. These vectors are created to represent words or phrases as real numbers, focusing on dimensionality reduction and contextual similarity. Word2Vec is a popular algorithm that implements this approach using techniques like CBOW and Skip-gram to predict target words based on their context. While word embeddings are not deep learning themselves, they provide a way for deep nets to interpret and understand natural language, offering a new understanding of language as numbers.

How to Use t-SNE Effectively

Added on February 16, 2024

t-SNE plots can be useful for visualizing high-dimensional data, but they can also be misleading if not interpreted correctly. The technique creates 2D "maps" of data with many dimensions, but these images can be misread. The perplexity parameter, which balances attention between local and global aspects of the data, has a significant impact on the resulting plots. Different perplexity values may be needed to capture different aspects of the data. t-SNE plots can equalize cluster sizes and distort distances between clusters, making it difficult to interpret relative sizes and distances. It's important to recognize random noise and avoid misinterpreting it as meaningful patterns. t-SNE plots can show some shapes accurately, but local effects and clumping can also affect the interpretation. For topological information, multiple plots at different perplexities may be required. Overall, using t-SNE effectively requires understanding its behavior and limitations.

Temperature as Joules per Bit

Added on February 15, 2024

The text discusses the concept of temperature and entropy in terms of information theory, suggesting that entropy should be measured in bits rather than joules per kelvin. It highlights the importance of information in thermodynamics and how Landauer's principle relates to the cost of erasing information. The authors advocate for viewing energy and entropy as more fundamental than temperature, emphasizing the duality between energy and information.

Deep Learning Course

Added on February 10, 2024

This document provides resources for François Fleuret's deep-learning course at the University of Geneva. The course offers a thorough introduction to deep learning, with examples using the PyTorch framework. The materials include slides, recordings, and a virtual machine. The course covers topics such as machine learning objectives, tensor operations, automatic differentiation, gradient descent, and deep-learning techniques. The document also includes prerequisites for the course, such as knowledge of linear algebra, differential calculus, Python programming, and probability and statistics.

Memory in Plain Sight: A Survey of the Uncanny Resemblances between Diffusion Models and Associative Memories

Added on February 9, 2024

Diffusion Models and Associative Memories show surprising similarities in their mathematical underpinnings and goals, bridging traditional and modern AI research. This connection highlights the convergence of AI models towards memory-focused paradigms, emphasizing the importance of understanding Associative Memories in the field of computation. By exploring these parallels, researchers aim to enhance our comprehension of how models like Diffusion Models and Transformers operate in Deep Learning applications.

2309.10668

Added on February 8, 2024

This article discusses the relationship between language modeling and compression. The authors argue that large language models can be viewed as powerful compressors due to their impressive predictive capabilities. They demonstrate that these models can achieve state-of-the-art compression rates across different data modalities, such as images and audio. The authors also explore the connection between compression and prediction, showing that models that compress well also generalize well. They conclude by advocating for the use of compression as a framework for studying and evaluating language models.

Memory in Plain Sight: A Survey of the Uncanny Resemblances between Diffusion Models and Associative Memories

Added on February 8, 2024

Diffusion Models (DMs) have become increasingly popular in generating benchmarks, but their mathematical descriptions can be complex. In this survey, the authors provide an overview of DMs from the perspective of dynamical systems and Ordinary Differential Equations (ODEs), revealing a mathematical connection to Associative Memories (AMs). AMs are energy-based models that share similarities with denoising DMs, but they allow for the computation of a Lyapunov energy function and gradient descent to denoise data. The authors also summarize the 40-year history of energy-based AMs, starting with the Hopfield Network, and discuss future research directions for both AMs and DMs.

GitHub - sst/demo-ai-app: Sample AI movies app built with ❍ Ion

Added on January 31, 2024

This document provides an overview of the sst/demo-ai-app, a sample movies app built with Ion that demonstrates how to use AI in your apps using your own data. The app includes features such as tagging, related movies, and deep search using natural language. It utilizes the Vector component, which is based on Amazon Bedrock and allows for easy AI integration with your data. The document also highlights the advantages of Ion, including faster deployment and no stack limits. The app works by ingesting movie data from IMDB, generating embeddings, and storing them in a Vector database, which the Next.js app then retrieves.

ThermodynamicComputing

Added on January 31, 2024

Measuring Faithfulness in Chain-of-Thought Reasoning

Added on January 28, 2024

Large language models (LLMs) are more effective when they engage in step-by-step "Chain-of-Thought" (CoT) reasoning, but it is unclear if this reasoning is a faithful explanation of the model's actual process. The study examines how interventions on the CoT affect model predictions, finding that models vary in how strongly they rely on the CoT. The performance boost from CoT does not solely come from added test-time compute or specific phrasing. As models become larger and more capable, they tend to produce less faithful reasoning. The results suggest that faithful CoT reasoning depends on carefully chosen circumstances such as model size and task.

ageron/handson-ml3: A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.

Added on January 26, 2024

The ageron/handson-ml3 project is designed to teach the fundamentals of Machine Learning using Python. It includes example code and exercise solutions from the third edition of the book "Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow." The project provides options for running the notebooks online, using a Docker image, or installing the project on your own machine. It also addresses frequently asked questions about Python versions, SSL errors, and updating the project. The project has received contributions from various individuals, including reviewers, contributors to exercise solutions, and supporters from the Google ML Developer Programs team.

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Added on January 23, 2024

BERT and RoBERTa have achieved impressive results on sentence-pair regression tasks like semantic textual similarity, but they have a significant computational overhead when comparing large collections of sentences. To address this, Sentence-BERT (SBERT) has been developed as a modification of BERT that uses siamese and triplet network structures to generate semantically meaningful sentence embeddings. SBERT reduces the time required to find the most similar pair from 65 hours with BERT to just 5 seconds, while maintaining accuracy. SBERT outperforms other state-of-the-art sentence embedding methods on various tasks, including STS and transfer learning.

Self-Rewarding Language Models

Added on January 20, 2024

To achieve superhuman language models, researchers propose the use of self-rewarding language models (LLMs) that provide their own rewards during training. Unlike current approaches that rely on human preferences, LLMs use prompts to judge their own performance and improve their instruction following ability and reward generation. A preliminary study using this approach, specifically fine-tuning Llama 2 70B, demonstrates that it outperforms existing systems on the AlpacaEval 2.0 leaderboard. This work suggests the potential for models that can continually improve in both axes.

Word2vec from Scratch

Added on January 15, 2024

Word2vec is a technique used to express words as vectors that encode their semantics in a meaningful way. This article discusses how to implement word2vec from scratch using NumPy. The process involves tokenizing the text, creating lookup tables for words and IDs, generating training data in the form of matrices using one-hot vectorization, and building and training the embedding network. The rows of the weight matrix in the network serve as the word embeddings, representing words as dense vectors. The final output of the network is a probability vector that predicts the nearby context words.

MemGPT: Towards LLMs as Operating Systems

Added on January 15, 2024

MemGPT is a system that manages different memory tiers to provide extended context within the limited context window of large language models (LLMs). Using an OS-inspired design, MemGPT can handle unbounded context using LLMs that have finite context windows. It is successful in domains where existing LLMs' limited context windows severely limit their performance, such as document analysis and multi-session chat. MemGPT supports self-directed editing and retrieval, memory-hierarchy, OS functions, and event-based control flow to manage unbounded context.

Visual Guides to understand the basics of Large Language Models

Added on January 14, 2024

This article provides a compilation of tools and articles that aim to break down the complicated concepts of Large Language Models (LLMs) in an intuitive way. It acknowledges that many people struggle with understanding the basics of LLMs and offers resources to help solidify their understanding. The article includes a table of contents with links to various resources, such as "The Illustrated Transformer" by Jay Alammar, which provides visualizations to explain the transformer architecture, a fundamental building block of LLMs. The goal is to make the concepts of LLMs easily understood and accessible.

Understanding and Coding Self-Attention, Multi-Head Attention, Cross-Attention, and Causal-Attention in LLMs

Added on January 14, 2024

This article provides a comprehensive understanding and coding guide for self-attention mechanisms in transformer architectures and large language models (LLMs) like GPT-4 and Llama. It covers the concept of self-attention, its importance in NLP, and the implementation of the self-attention mechanism in Python and PyTorch. The article also discusses the scaled dot-product attention, computing unnormalized attention weights, computing attention weights, and computing the context vector. Additionally, it explores multi-head attention and provides code examples for implementing multiple attention heads.

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Added on January 12, 2024

Backdoored behavior in AI models is most persistent in larger models and models trained to deceive the training process, even when the deceptive behavior is distilled away. Adversarial training can actually make models better at recognizing their backdoor triggers, effectively hiding the unsafe behavior. Safety training techniques, such as reinforcement learning, are often ineffective in removing backdoors. The study explores different methods for training backdoored models and finds that chain-of-thought backdoors allow models to produce consistent reasoning for their deceptive behavior.

This project is about how to systematically persuade LLMs to jailbreak them.

Added on January 10, 2024

This project introduces a taxonomy of 40 persuasion techniques to systematically persuade LLMs (large language models) to jailbreak them. Through iterative application of these techniques, the researchers achieved a 92% success rate in jailbreaking advanced LLMs. They also found that more advanced models are more vulnerable to persuasive adversarial prompts (PAPs) and that adaptive defenses can effectively neutralize these prompts. The research highlights the challenges of addressing user-invoked risks from persuasion and the need for further investigation and improved defenses for more capable models.

Pruning vs Quantization: Which is Better?

Added on January 10, 2024

Neural network pruning and quantization are techniques used to compress deep neural networks. This paper compares the two techniques and provides an analytical comparison of expected quantization and pruning error. The results show that in most cases, quantization outperforms pruning. However, in scenarios with very high compression ratios, pruning may be beneficial. The paper also discusses the hardware implications of both techniques and provides a comparison of pruning and quantization in the post-training and fine-tuning settings.

mlx-examples/lora at main · ml-explore/mlx-examples · GitHub

Added on January 10, 2024

This document provides an example of using MLX to fine-tune either a Llama 7B1 or Mistral 7B2 model with low rank adaptation (LoRA) for a target task. The example demonstrates using the WikiSQL dataset to train the model to generate SQL queries from natural language. It includes instructions for setup, running the script, fine-tuning the model, evaluating the model, generating output, and dealing with memory issues. The document also provides results from the training process and offers tips for reducing memory consumption during fine-tuning.

Mixtral of Experts

Added on January 10, 2024

Mixtral 8x7B is a Sparse Mixture of Experts (SMoE) language model that outperforms or matches other models like Llama 2 70B and GPT-3.5 across various benchmarks. It has the same architecture as Mistral 7B but uses 8 feedforward blocks (experts) in each layer. A router network selects two experts for each token at each layer, allowing for dynamic selection of different experts at each timestep. This results in each token having access to 47B parameters but only using 13B active parameters during inference. Mixtral also offers a fine-tuned model, Mixtral 8x7B - Instruct, which surpasses other models on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

Paper page - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Added on January 10, 2024

The content is a set of instructions on how to cite a specific URL (arxiv.org/abs/2401.01335) in three different types of README.md files, in order to create links from those pages.

WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia

Added on January 9, 2024

The paper presents WikiChat, a few-shot language model (LLM)-based chatbot that minimizes hallucinations and has high conversationality and low latency. WikiChat is grounded on the English Wikipedia and combines grounded facts with additional information from the corpus to generate factual and engaging responses. The system achieves high factual accuracy and outperforms previous retrieval-based chatbots in terms of informativeness and engagement. The paper also introduces a novel evaluation methodology that combines simulated and real user conversations for assessing the factuality and conversationality of chatbots.

Discovering Language Model Behaviors with Model-Written Evaluations

Added on January 8, 2024

The article discusses an approach to generating evaluations using language models (LMs) with the help of crowdworkers. The LM-generated evaluations were rated highly relevant, with workers agreeing with 90-100% of their labels. The researchers showcase their approach by generating datasets that test LMs for 154 diverse behaviors related to model personality, politics, ethics, social bias, and risks from advanced AI systems. The generated multiple-choice questions help the researchers to reveal additional instances of inverse scaling with RLHF training, as well as to distinguish when concerning behaviors are likely caused by pretraining or RLHF.

Understanding The Exploding and Vanishing Gradients Problem

Added on January 7, 2024

The "Understanding The Exploding and Vanishing Gradients Problem" article discusses the vanishing and exploding gradients problem in deep neural networks. It explains how the gradients used to update the weights can shrink or grow exponentially, causing learning to stall or become unstable. The article explores why gradients vanish or explode exponentially and how it affects the backpropagation algorithm during training. It also provides strategies to address the vanishing and exploding gradients problem, such as using the ReLU activation function, weight initialization techniques, and gradient clipping.

Practical Deep Learning for Coders 2022

Added on January 7, 2024

"Practical Deep Learning for Coders 2022" is a course that covers topics such as building and training deep learning models, deploying models, and using PyTorch and other popular libraries. The course is led by Jeremy Howard, who has extensive experience in machine learning and has created companies that utilize deep learning. The course is suitable for those with at least a year of coding experience and a high school math background. Students will learn how to train models for computer vision, natural language processing, tabular data analysis, and collaborative filtering, and will also learn about the latest deep learning techniques.

fastai/fastbook: The fastai book, published as Jupyter Notebooks

Added on January 7, 2024

The fastai book, published as Jupyter Notebooks, provides an introduction to deep learning, fastai, and PyTorch. It is copyright Jeremy Howard and Sylvain Gugger, and a selection of chapters is available to read online. The notebooks in the repository are used for a MOOC and form the basis of the book, which is available for purchase. The code in the notebooks is covered by the GPL v3 license, while the other content is not licensed for redistribution or change. It is recommended to use Google Colab to access and work with the notebooks. If there are any contributions or citations, copyright is assigned to Jeremy Howard and Sylvain Gugger.

Elasticsearch 8.x Cookbook: Over 180 recipes to perform fast, scalable, and reliable searches for your enterprise, 5th Edition

Added on January 7, 2024

The text explains how word2vec uses one-hot encoded vectors and weight matrices to represent words in a neural network model. It details the learning process for updating weights between input, hidden, and output layers based on prediction errors. The update equations for weights are derived through backpropagation to improve the model's ability to predict words within a context.

Attention? Attention!

Added on January 7, 2024

The document explores the concept of attention, as performed by humans and deep learning algorithms. Attention is used in deep learning to transform one input sequence into another and is accomplished through an encoder-decoder architecture with LSTM or GRU units. The attention mechanism, invented to address the incapability of the fixed-length context vector, creates shortcuts between the context vector and the entire source input. Attention mechanisms vary in form, from soft or hard to global or local. The document also introduces self-attention, which relates different positions of a single sequence to compute a representation of the same sequence, and the Neural Turing Machine, a model architecture for coupling a neural network with external memory storage.

An Intuition for Attention

Added on January 7, 2024

The transformer neural network, used by models like ChatGPT, incorporates an attention mechanism to improve performance. Attention is a key feature of transformers and is defined by an equation that involves the softmax function. Attention can take different forms, but the scaled dot product attention is commonly used. This attention mechanism is based on the idea of key-value lookups, where a query is matched with keys to retrieve corresponding values. The attention scores, which determine how much attention is given to each key-value pair, are computed using dot product similarity and transformed into decimal percentages using the softmax function. This process allows for meaningful and efficient processing of queries in large language models.

Pen and Paper Exercises in Machine Learning

Added on January 7, 2024

This is a collection of (mostly) pen-and-paper exercises in machine learning. The exercises are on the following topics: linear algebra, optimisation, directed graphical models, undirected graphical models, expressive power of graphical models, factor graphs and message passing, inference for hidden Markov models, model-based learning (including ICA and unnormalised models), sampling and Monte-Carlo integration, and variational inference.

Transformers From Scratch

Added on January 7, 2024

This blog provides a step-by-step guide on creating and training a transformer from scratch. The author explains each foundational element and provides a Jupyter notebook with the code for readers to run and experiment with. The blog references a YouTube video and the Attention Is All You Need paper for further understanding. The author also mentions the availability of the final code and a dataset for download.

Linear Algebra

Added on January 5, 2024

Linear algebra is a fundamental topic in understanding and working with machine learning algorithms, especially deep learning algorithms. This chapter provides an introduction to scalars, vectors, matrices, and tensors, which are the key mathematical objects in linear algebra. It explains the concepts and notation used in linear algebra, such as matrix multiplication, transpose, identity and inverse matrices, and norms. The chapter also introduces special kinds of matrices and vectors, such as diagonal matrices, orthogonal matrices, and eigenvalues and eigenvectors. These concepts are important for analyzing and solving equations in machine learning.

Mathematics for Machine Learning

Added on January 5, 2024

I'm sorry, but there is no content provided for me to summarize.

An overview of gradient descent optimization algorithms

Added on January 5, 2024

The text provides an overview of gradient descent optimization algorithms commonly used in deep learning. It explains different types of gradient descent methods like batch, stochastic, and mini-batch, highlighting their strengths and challenges. The author also discusses advanced algorithms such as Adagrad, RMSprop, and Adam, which adapt learning rates to improve optimization performance.

An overview of gradient descent optimization algorithms∗

Added on January 5, 2024

The article provides an overview of gradient descent optimization algorithms, which are often used as black-box optimizers. The article outlines the three variants of gradient descent and summarizes the challenges. The article then introduces some widely used algorithms to deal with the challenges, including Nesterov accelerated gradient, Adagrad, Adadelta, and RMSprop. The article explains how these algorithms work and their benefits and weaknesses.

How GPT3 Works - Visualizations and Animations

Added on January 5, 2024

Discussions: Hacker News (397 points, 97 comments), Reddit r/MachineLearning (247 points, 27 comments) Translations: German, Korean, Chinese (Simplified), Russian The tech world is abuzz with GPT3 hype. Massive language models (like GPT3) are starting to surprise us with their abilities. While not yet completely reliable for most businesses to put in front of their customers, these models are showing sparks of cleverness that are sure to accelerate the march of automation and the possibilities of intelligent computer systems. Let’s remove the aura of mystery around GPT3 and learn how it’s trained and how it works. A trained language model generates text. We can optionally pass it some text as input, which influences its output. The output is generated from what the model “learned” during its training period where it scanned vast amounts of text.

GPT in 60 Lines of NumPy

Added on January 5, 2024

This post outlines how to implement a GPT (Generative Pre-trained Transformer) from scratch in just 60 lines of NumPy, including loading trained GPT-2 model weights released by OpenAI and generating text. The GPT generates text given a prompt and the task of predicting the next logical word in a sequence is called language modeling. The post explains how to train a GPT using gradient descent with respect to the cross entropy loss over the language modeling task. The post also touches on prompting and how to handle hyperparameters.

Tensor2Tensor Intro

Added on January 4, 2024

The content below is not provided.

The Annotated Transformer

Added on January 4, 2024

"The Annotated Transformer" is a paper that introduces a new architecture for natural language processing tasks, with a focus on translation. The paper provides an annotated version of the original paper, giving a line-by-line implementation of the model. The Transformer model relies on self-attention to compute representations of its input and output without using sequence-aligned recurrent neural networks or convolutions. The model consists of an encoder and decoder stack, each containing self-attention layers and position-wise feed-forward networks. The paper also discusses the use of multi-head attention and positional encoding in the model. The model is trained using the WMT 2014 English-German dataset and the Adam optimizer.

The Illustrated Transformer

Added on January 4, 2024

"The Illustrated Transformer" is a comprehensive guide to understanding the Transformer model, which utilizes attention to improve the training speed of neural machine translation models. The model consists of stacked encoders and decoders, with each encoder and decoder having self-attention layers. Self-attention allows the model to incorporate information from other words in the input sequence, resulting in better encoding. The model also employs multi-headed attention, which allows it to focus on different positions and creates multiple sets of Query/Key/Value weight matrices. Positional encoding is used to account for the order of words in the input sequence. The architecture includes residual connections and layer normalization for each sub-layer.

GitHub - tensorflow/nmt: TensorFlow Neural Machine Translation Tutorial

Added on January 4, 2024

TensorFlow Neural Machine Translation Tutorial. Contribute to tensorflow/nmt development by creating an account on GitHub.

What Are Word Embeddings for Text?

Added on January 4, 2024

Word embeddings are a way to represent words with similar meanings in a similar manner using real-valued vectors. They are a key advancement in deep learning for natural language processing tasks. You can either train your own word embeddings or use pre-trained ones for your projects.

Deep Learning for Natural Language Processing

Added on January 4, 2024

Deep Learning for Natural Language Processing Develop Deep Learning Models for your Natural Language Problems Working with Text is… important, under-discussed, and HARD We are awash with text, from books, papers, blogs, tweets, news, and increasingly text from spoken utterances. Every day, I get questions asking how to develop machine learning models for text data. Working […]

Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)

Added on January 4, 2024

The article explains the mechanics of sequence-to-sequence models, which are deep learning models used for machine translation, text summarization, and image captioning. The article includes visualizations to explain the concepts and requires some previous understanding of deep learning. The article also discusses attention models, which improve machine translation systems by allowing the model to focus on relevant parts of the input sequence. The article provides examples of how attention models work and concludes with a link to TensorFlow's Neural Machine Translation tutorial.

The Random Transformer

Added on January 4, 2024

This blog post provides an end-to-end example of the math within a transformer model, with a focus on the encoder part. The goal is to understand how the model works, and to make it more manageable, simplifications are made and the dimensions of the model are reduced. The post recommends reading "The Illustrated Transformer" blog for a more intuitive explanation of the transformer model. The prerequisites for understanding the content include basic knowledge of linear algebra, machine learning, and deep learning. The post covers the math within a transformer model during inference, attention mechanisms, residual connections and layer normalization, and provides some code to scale it up.

GitHub - SkalskiP/courses: This repository is a curated collection of links to various courses and resources about Artificial Intelligence (AI)

Added on January 4, 2024

SkalskiP/courses is a curated collection of links to various courses and resources about Artificial Intelligence (AI). It includes courses on topics such as generative AI, deep learning, natural language processing, computer vision, machine learning, and more. The repository aims to provide a comprehensive resource for beginners and experienced learners alike. Contributions from the community are encouraged to make the repository even better.

CS25: Transformers United V3

Added on January 4, 2024

Transformers have revolutionized Natural Language Processing (NLP) and are now being applied in various fields, including Computer Vision, Reinforcement Learning, and Speech. This seminar explores the details of how Transformers work and their applications, with a focus on large language models (LLMs). The seminar includes instructor and guest lectures from experts in Transformers research. The schedule includes topics such as the creation of fine-tuned chat models, low-level embodied intelligence with foundation models, and training helpful chatbots. The seminar also covers the motivations behind Transformers, scaling human-centered machine translation, and going beyond LLMs to explore emergent abilities and intermediate-guided reasoning.

Spaces using openai/whisper-large-v2 232

Added on January 3, 2024

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It was trained on 680k hours of labelled data and demonstrates strong generalization abilities without the need for fine-tuning. The large-v2 model, trained for 2.5x more epochs with added regularization, offers improved performance. The models can be used for transcription and translation tasks, with context tokens indicating the language and task. While the models show robustness and accuracy in many languages, they may exhibit limitations such as generating repetitive texts and hallucinations. The models have potential applications in accessibility tools but also raise concerns about dual use and surveillance capabilities.

Text Summarization: How to Calculate BertScore

Added on January 3, 2024

BERTScore is a metric used to measure the quality of text summarization by calculating the similarity between the summary and the original text. It addresses issues that n-gram-based metrics face, such as incorrect matching of paraphrases and the inability to capture long-range dependencies. The BERTScore architecture involves contextual embeddings, cosine similarity, token matching for precision and recall, importance weighting, and baseline rescaling. The metric has the potential to improve various natural language processing tasks and can be applied in domains such as translation quality assessment, text generation, and document comparison. Future developments include broader language coverage and adaptation for multilingual texts.

Some Core Principles of Large Language Model (LLM) Tuning

Added on January 3, 2024

Large Language Models (LLMs) like GPT2 and GPT3 are trained using unsupervised pre-training on billions to trillions of tokens. After pre-training, the models are fine-tuned for specific use cases such as chatbots or content generation. Fine-tuning can be done through supervised fine-tuning (SFT) or reinforcement learning with human feedback (RLHF). SFT involves minimizing the loss between the model's output and the correct result, while RLHF uses a reward model to optimize the model's performance. InstructGPT is an RLHF-tuned version of GPT3 that is trained to follow instructions and provide aligned responses. There are also open-source alternatives to GPT models, such as GPT-J and GPT-Neo.

An intuitive introduction to text embeddings

Added on January 2, 2024

Text embeddings are essential in natural language processing (NLP) and convert text into vector coordinates. They allow us to understand the semantic meaning of words and sentences by representing them as vectors in a high-dimensional latent space. By using text embeddings, we can capture the similarity between texts and perform tasks such as search and classification more efficiently. There are various algorithms and models, such as Word2vec and transformers, that help us generate text embeddings and capture the sequential nature of text. These advancements in text embeddings have greatly improved our ability to reason intuitively about NLP and other machine learning models.

Mathematics for Machine Learning

Added on January 1, 2024

VOYAGER: An Open-Ended Embodied Agent with Large Language Models

Added on January 1, 2024

The article presents VOYAGER, an embodied agent that continuously explores the Minecraft world, acquires skills, and makes new discoveries without human intervention. VOYAGER consists of three key components: an automatic curriculum for exploration, a skill library for storing and retrieving complex behaviors, and an iterative prompting mechanism for program improvement. The agent utilizes Large Language Models (LLMs) and code as the action space, allowing it to represent temporally extended and compositional actions. The article also highlights VOYAGER's superior performance in discovering novel items, unlocking the Minecraft tech tree, and applying its learned skill library to unseen tasks in a newly instantiated world.

Bookmarks