Ludwig - ai/deep

DeepSeek Debrief: >128 Days Later – SemiAnalysis

Added on July 9, 2025

SemiAnalysis is hiring an analyst in New York City for Core Research, our world class research product for the finance industry. Please apply here It’s been a bit over 150 days since the launc…

ai/deep_learning/transformers

FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention

Added on July 2, 2025

Although these fused attention implementations have substantially improved performance and enabled long contexts, this efficiency has come with a loss of flexibility.

ai/deep_learning/transformers

The Illustrated AlphaFold

Added on June 28, 2025

A visual walkthrough of the AlphaFold3 architecture, with more details and diagrams than you were probably looking for.

ai/deep_learning/transformers

Continuous Thought Machines

Added on June 27, 2025

Introducing Continuous Thought Machines: a new kind of neural network model that unfolds and uses neural dynamics as a powerful representation for thought.

ai/deep_learning/neural_networks

By using feature inversion to visualize millions of activations from an image classification network, we create an explorable activation atlas of features the network has learned and what concepts it typically represents.

ai/interpretability ai/deep_learning/neural_networks/convolutional_neural_networks

DeepSeek-V3 Explained 1: Multi-head Latent Attention

Added on May 29, 2025

Key architecture innovation behind DeepSeek-V2 and DeepSeek-V3 for faster inference

You could have designed state of the art positional encoding

Added on May 16, 2025

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

ai/deep_learning/transformers

attention is logarithmic, actually

Added on May 16, 2025

supaiku dot com § attention is logarithmic, actually § time complexity is a very bad model when working with parallelism. in which i make the case for work-depth analysis instead of time complexity.

cs/theory/algorithms ai/deep_learning/transformers

How To Scale

Added on May 13, 2025

While there are already excellent posts on scaling, I wanted to share my own understanding and things i've learned from my past few months and hopefully spark some discussion. I hope this post can shed light for anyone navigating the challenges of scaling up neural networks. And there may be mistakes or inaccuracies, so if you want to correct me or would like to discuss further, please feel free to DM me on X or leave a comment.

ai/deep_learning/transformers ai/optimization

Deep Dive into Yann LeCun’s JEPA

Added on May 6, 2025

ML blog.

cs/software_development

Are Transformers universal approximators of sequence-to-sequence functions?

Added on May 3, 2025

Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Furthermore, using positional encodings, we circumvent the restriction of permutation equivariance, and show that Transformer models can universally approximate arbitrary continuous sequence-to-sequence functions on a compact domain. Interestingly, our proof techniques clearly highlight the different roles of the self-attention and the feed-forward layers in Transformers. In particular, we prove that fixed width self-attention layers can compute contextual mappings of the input sequences, playing a key role in the universal approximation property of Transformers. Based on this insight from our analysis, we consider other simpler alternatives to self-attention layers and empirically evaluate them.

cs/theory/algorithms

a Hugging Face Space by nanotron

Added on May 3, 2025

The ultimate guide to training LLM on large GPU Clusters

cs/software_development/tutorials

Training Large Language Models to Reason in a Continuous Latent Space

Added on April 24, 2025

Large language models (LLMs) are restricted to reason in the “language space”, where they typically express the reasoning process with a chain-of-thought (CoT) to solve a complex reasoning problem.

cs/theory/algorithms

Multi-layer language heads: the output latent is for text (and nothing else)

Added on April 19, 2025

The last layer’s hidden state in a transformer is meant only for being decoded into token probabilities. Don’t use it for autoregressive image generation Dont’t use it for looped latent transformers Only use it to produce the next token in a language model It is a compressed representation of the...

ai/deep_learning/transformers

CS336: Language Modeling from Scratch

Added on April 19, 2025

Language models serve as the cornerstone of modern natural language processing (NLP) applications and open up a new paradigm of having a single general purpose system address a range of downstream tasks.

ai/deep_learning/transformers ai/natural_language_processing/language_models

Contextualization Machines

Added on April 17, 2025

Astro description

ai/deep_learning/transformers

diffusion transformers

Added on April 5, 2025

Metaphorically, you can think of Vision Transformers as the eyes of the system, able to understand and contextualize what it sees, while Stable Diffusion is the hand of the system, able to generate and manipulate images based on this understanding.

ai/deep_learning/transformers ai/generative_models/diffusion_models

Advanced Performance Optimizations for Models

Added on March 29, 2025

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model. - tenstorrent/tt-metal

cs/computer_architecture/hardware/optimization

Softmax Attention is a Fluke

Added on March 24, 2025

Calibrated AttentionCalibrated Attention NanoGPTAttention is the magic ingredient of modern neural networks. It is the core of what has launched performant language models into the spotlight starting with GPT, and since then, it has extended its hands across all modalities.There are a number of desirable properties that make attention a first-class building block. Namely: • It handles variable sequence lengths with ease • It allows for a global receptive field without needing to scale parameters

ai/deep_learning/transformers

Transformers Laid Out

Added on March 23, 2025

I have encountered that there are mainly three types of blogs/videos/tutorials talking about transformers

ai/deep_learning/transformers

A friendly introduction to machine learning compilers and optimizers

Added on March 18, 2025

[Twitter thread, Hacker News discussion]

cs/theory/compilers

Neural Networks, Manifolds, and Topology

Added on March 9, 2025

However, there remain a number of concerns about them. One is that it can be quite challenging to understand what a neural network is really doing.

ai/interpretability

Attention from Beginners Point of View

Added on March 9, 2025

Transformers are a type of neural network architecture which is popularly used for text generations, machine translations, etc.

ai/deep_learning/transformers

Why Attention Is All You NeedWhy Attention Is All You Need

Added on March 9, 2025

The Transformer architecture introduced in this paper was a major breakthrough in sequence transduction methodologies, particularly within neural machine translation (NMT) and broader natural language processing (NLP).

ai/deep_learning/transformers

How to Think About TPUs

Added on February 26, 2025

All about how TPUs work, how they're networked together to enable multi-chip training and inference, and how they limit the performance of our favorite algorithms. While this may seem a little dry, it's super important for actually making models efficient.

cs/computer_architecture/hardware

neural video codecs: the future of video compression

Added on February 17, 2025

how deep learning could rewrite the way we encode and decode video

ai/compression

Unnamed Document

Added on January 25, 2025

cs/theory/algorithms ai/deep_learning/neural_networks

Unveiling_DeepSeek.pdf

Added on January 22, 2025

successful modifications since its inception, let alone large-scale validation.

ai/deep_learning/transformers

DeepSeek-V3 Explained: A Deep Dive into the Next-Generation AI Model

Added on January 18, 2025

Artificial Intelligence (AI) is advancing at an unprecedented pace, and the DeepSeek-V3 model is at the forefront of this revolution. As…

ai/deep_learning/transformers

Towards a Categorical Foundation of Deep Learning: A Survey

Added on December 22, 2024

The unprecedented pace of machine learning research has lead to incredible advances, but also poses hard challenges. At present, the field lacks strong theoretical underpinnings, and many important achievements stem from ad hoc design choices which are hard to justify in principle and whose effectiveness often goes unexplained. Research debt is increasing and many papers are found not to be reproducible. This thesis is a survey that covers some recent work attempting to study machine learning categorically. Category theory is a branch of abstract mathematics that has found successful applications in many fields, both inside and outside mathematics. Acting as a lingua franca of mathematics and science, category theory might be able to give a unifying structure to the field of machine learning. This could solve some of the aforementioned problems. In this work, we mainly focus on the application of category theory to deep learning. Namely, we discuss the use of categorical optics to model gradient-based learning, the use of categorical algebras and integral transforms to link classical computer science to neural networks, the use of functors to link different layers of abstraction and preserve structure, and, finally, the use of string diagrams to provide detailed representations of neural network architectures.

mathematics/category_theory

Soft question: Deep learning and higher categories

Added on December 22, 2024

Recently, I have stumbled upon certain articles and lecture videos that use category theory to explain certain aspects of machine learning or deep learning (e.g. Cats for AI and the paper An enriched

mathematics/category_theory

BLT__Patches_Scale_Better_Than_Tokens

Added on December 17, 2024

ai/deep_learning/transformers ai/computer_vision

Position: Categorical Deep Learning is an Algebraic Theory of All Architectures

Added on December 17, 2024

We present our position on the elusive quest for a general-purpose framework for specifying and studying deep learning architectures. Our opinion is that the key attempts made so far lack a coherent bridge between specifying constraints which models must satisfy and specifying their implementations. Focusing on building a such a bridge, we propose to apply category theory -- precisely, the universal algebra of monads valued in a 2-category of parametric maps -- as a single theory elegantly subsuming both of these flavours of neural network design. To defend our position, we show how this theory recovers constraints induced by geometric deep learning, as well as implementations of many architectures drawn from the diverse landscape of neural networks, such as RNNs. We also illustrate how the theory naturally encodes many standard constructs in computer science and automata theory.

mathematics/category_theory

Fundamental Components of Deep Learning: A category-theoretic approach

Added on December 17, 2024

Deep learning, despite its remarkable achievements, is still a young field. Like the early stages of many scientific disciplines, it is marked by the discovery of new phenomena, ad-hoc design decisions, and the lack of a uniform and compositional mathematical foundation. From the intricacies of the implementation of backpropagation, through a growing zoo of neural network architectures, to the new and poorly understood phenomena such as double descent, scaling laws or in-context learning, there are few unifying principles in deep learning. This thesis develops a novel mathematical foundation for deep learning based on the language of category theory. We develop a new framework that is a) end-to-end, b) unform, and c) not merely descriptive, but prescriptive, meaning it is amenable to direct implementation in programming languages with sufficient features. We also systematise many existing approaches, placing many existing constructions and concepts from the literature under the same umbrella. In Part I we identify and model two main properties of deep learning systems parametricity and bidirectionality by we expand on the previously defined construction of actegories and Para to study the former, and define weighted optics to study the latter. Combining them yields parametric weighted optics, a categorical model of artificial neural networks, and more. Part II justifies the abstractions from Part I, applying them to model backpropagation, architectures, and supervised learning. We provide a lens-theoretic axiomatisation of differentiation, covering not just smooth spaces, but discrete settings of boolean circuits as well. We survey existing, and develop new categorical models of neural network architectures. We formalise the notion of optimisers and lastly, combine all the existing concepts together, providing a uniform and compositional framework for supervised learning.

mathematics/category_theory

Fundamental Components of Deep Learning: A category-theoretic approach

Added on December 16, 2024

Deep learning, despite its remarkable achievements, is still a young field. Like the early stages of many scientific disciplines, it is marked by the discovery of new phenomena, ad-hoc design decisions, and the lack of a uniform and compositional mathematical foundation. From the intricacies of the implementation of backpropagation, through a growing zoo of neural network architectures, to the new and poorly understood phenomena such as double descent, scaling laws or in-context learning, there are few unifying principles in deep learning. This thesis develops a novel mathematical foundation for deep learning based on the language of category theory. We develop a new framework that is a) end-to-end, b) unform, and c) not merely descriptive, but prescriptive, meaning it is amenable to direct implementation in programming languages with sufficient features. We also systematise many existing approaches, placing many existing constructions and concepts from the literature under the same umbrella. In Part I we identify and model two main properties of deep learning systems parametricity and bidirectionality by we expand on the previously defined construction of actegories and Para to study the former, and define weighted optics to study the latter. Combining them yields parametric weighted optics, a categorical model of artificial neural networks, and more. Part II justifies the abstractions from Part I, applying them to model backpropagation, architectures, and supervised learning. We provide a lens-theoretic axiomatisation of differentiation, covering not just smooth spaces, but discrete settings of boolean circuits as well. We survey existing, and develop new categorical models of neural network architectures. We formalise the notion of optimisers and lastly, combine all the existing concepts together, providing a uniform and compositional framework for supervised learning.

mathematics/category_theory

"CBLL, Research Projects, Computational and Biological Learning Lab, Courant Institute, NYU"

Added on December 9, 2024

Yann LeCun's Web pages at NYU

ai/generative_models

yataobian/awesome-ebm: Collecting research materials on EBM/EBL (Energy Based Models, Energy Based Learning)

Added on December 9, 2024

Collecting research materials on EBM/EBL (Energy Based Models, Energy Based Learning) - yataobian/awesome-ebm

ai/generative_models

Greg Yang

Added on December 5, 2024

I am currently developing a framework called Tensor Programs for understanding large neural networks.

ai/deep_learning/neural_networks

How to get from high school math to cutting-edge ML/AI: a detailed 4-stage roadmap with links to the best learning resources that I’m aware of.

Added on November 18, 2024

1) Foundational math. 2) Classical machine learning. 3) Deep learning. 4) Cutting-edge machine learning.

cs/software_development/educational_resources

Fundamental Components of Deep Learning: A category-theoretic approach

Added on November 18, 2024

Deep learning, despite its remarkable achievements, is still a young field. Like the early stages of many scientific disciplines, it is marked by the discovery of new phenomena, ad-hoc design decisions, and the lack of a uniform and compositional mathematical foundation. From the intricacies of the implementation of backpropagation, through a growing zoo of neural network architectures, to the new and poorly understood phenomena such as double descent, scaling laws or in-context learning, there are few unifying principles in deep learning. This thesis develops a novel mathematical foundation for deep learning based on the language of category theory. We develop a new framework that is a) end-to-end, b) unform, and c) not merely descriptive, but prescriptive, meaning it is amenable to direct implementation in programming languages with sufficient features. We also systematise many existing approaches, placing many existing constructions and concepts from the literature under the same umbrella. In Part I we identify and model two main properties of deep learning systems parametricity and bidirectionality by we expand on the previously defined construction of actegories and Para to study the former, and define weighted optics to study the latter. Combining them yields parametric weighted optics, a categorical model of artificial neural networks, and more. Part II justifies the abstractions from Part I, applying them to model backpropagation, architectures, and supervised learning. We provide a lens-theoretic axiomatisation of differentiation, covering not just smooth spaces, but discrete settings of boolean circuits as well. We survey existing, and develop new categorical models of neural network architectures. We formalise the notion of optimisers and lastly, combine all the existing concepts together, providing a uniform and compositional framework for supervised learning.

mathematics/category_theory

Oasis: A Universe in a Transformer

Added on October 31, 2024

Generating Worlds in Realtime

ai/deep_learning/transformers ai/generative_models

2305.20091

Added on September 30, 2024

ai/deep_learning/transformers

Humans in 4D: Reconstructing and Tracking Humans with Transformers

Added on September 30, 2024

Join the discussion on this paper page

ai/deep_learning/transformers ai/computer_vision

spikedoanz/from-bits-to-intelligence: machine learninig stack in under 100,000 lines of code

Added on July 16, 2024

The text discusses building a machine learning stack in under 100,000 lines of code with hardware, software, tensors, and machine learning components. It outlines the required components like a CPU, GPU, storage, C compiler, Python runtime, operating system, and more. The goal is to simplify the machine learning stack while providing detailed steps for implementation in different programming languages.

cs/computer_architecture/hardware

Using neural nets to recognize handwritten digits

Added on July 5, 2024

Neural networks can recognize handwritten digits by learning from examples. Sigmoid neurons play a key role in helping neural networks learn. Gradient descent is a common method used for learning in neural networks.

ai/deep_learning/neural_networks

MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

Added on June 18, 2024

MLKV introduces Multi-Layer Key-Value sharing to reduce memory usage in transformer decoding. This approach improves efficiency without sacrificing performance on NLP benchmarks. MLKV significantly reduces memory requirements compared to existing methods like Multi-Query Attention.

ai/deep_learning/transformers

A Recipe for Training Neural Networks

Added on June 11, 2024

The text discusses common mistakes in training neural networks and emphasizes the importance of patience and attention to detail for successful deep learning. It provides a recipe for training neural networks, including steps like setting up a training skeleton, visualizing losses, and focusing on regularization and tuning to improve model performance. The text also highlights the value of adding more real data and using ensembles to enhance accuracy.

Writing CUDA Kernels for PyTorch

Added on June 11, 2024

The text shows the thread distribution on different streaming multiprocessors (SM) in CUDA. Threads are organized into warps, lanes, and specific thread numbers within each SM. This information is crucial for optimizing CUDA kernels in PyTorch.

cs/computer_architecture/hardware/gpus

Exploring architectures- Transformers II

Added on June 6, 2024

The text explains how Transformers utilize queries, keys, and values to calculate self-attention weights for tokens. It details the process of obtaining the self-attention weights and generating output tokens through neural networks. The final steps involve calculating loss using cross-entropy and backpropagating to update the weight parameters.

ai/deep_learning/transformers ai/natural_language_processing/language_models

A high-bias, low-variance introduction to Machine Learning for physicists

Added on June 5, 2024

This text is an introduction to Machine Learning for physicists, highlighting the natural connections between ML and statistical physics. It explains the use of "energy-based models" inspired by statistical physics in deep learning methods. The discussion includes the application of methods from statistical physics to study deep learning and the efficiency of learning rules.

physics/statistical_physics

How diffusion models work: the math from scratch

Added on June 1, 2024

Diffusion models generate diverse high-resolution images and are different from previous generative methods. Cascade diffusion models and latent diffusion models are used to scale up models to higher resolutions efficiently. Score-based generative models are similar to diffusion models and involve noise perturbations to generate new samples.

ai/generative_models/diffusion_models

The Annotated Transformer

Added on May 27, 2024

The text discusses the architecture and training of a Transformer model. It explains the use of self-attention and feed-forward networks in the encoder and decoder. The model is demonstrated through examples of prediction and visualization of attention mechanisms.

ai/deep_learning/transformers

Binary Magic: Building BitNet 1.58bit Using PyTorch from Scratch

Added on May 25, 2024

The document discusses the creation of a 1.58bit model called BitNet using PyTorch from scratch, which can rival full precision LLMs. Quantization, the process of representing float numbers with fewer bits, is explained as a method to increase the speed and reduce the RAM consumption of ML models, albeit with some loss of accuracy. BitNet differs from existing quantization approaches as it trains the model from scratch with quantization, offering a unique quantization algorithm and implementation in PyTorch. Results from experiments with custom PyTorch implementations show that the 2bit and 1bit variants of models perform as well as full precision models, demonstrating the potential of this approach.

ai/deep_learning/neural_networks

Heatmaps and CNNs Using Fast.ai

Added on May 25, 2024

The text discusses heatmaps, CNNs, and their relationship in deep learning. It explains how heatmaps are generated using Grad-CAM heatmaps from the final layer of a Convolutional Neural Network. The article also touches on creating heatmaps using Adaptive Pooling layers and interpreting top losses for model evaluation.

ai/interpretability ai/deep_learning/neural_networks/convolutional_neural_networks

Mamba-UNet: UNet-Like Pure Visual Mamba for Medical Image Segmentation

Added on May 10, 2024

Mamba-UNet is a new architecture combining U-Net with Mamba technology for better medical image segmentation performance. It addresses limitations in modeling long-range dependencies within medical images. Results show that Mamba-UNet outperforms other UNet variations in medical image segmentation tasks.

ai/computer_vision

KAN: Kolmogorov–Arnold Networks

Added on May 1, 2024

Kolmogorov-Arnold Networks (KANs) have learnable activation functions on edges, outperforming Multilayer Perceptrons (MLPs) in accuracy and interpretability. KANs show faster neural scaling laws than MLPs, leveraging splines and MLPs to improve accuracy and interpretability. KANs can represent functions effectively and display more favorable scaling curves than MLPs, especially in high-dimensional examples.

ai/interpretability

KAN: Kolmogorov-Arnold Networks

Added on May 1, 2024

KANs outperform MLPs in accuracy and interpretability by using learnable activation functions on edges. They have faster neural scaling laws and can represent special functions more efficiently. KANs offer a promising alternative to MLPs in various applications, showcasing improved performance and interpretability.

ai/interpretability

Root Mean Square Layer Normalization

Added on April 9, 2024

The text discusses a technique called Root Mean Square Layer Normalization proposed by Biao Zhang and Rico Sennrich. This technique is likely related to a method for normalizing data in neural networks. The authors' work can be found on arxiv.org.

ai/deep_learning/transformers

Root Mean Square Layer Normalization

Added on April 9, 2024

Layer normalization (LayerNorm) has been successfully applied to various deep neural networks to help stabilize training and boost model convergence because of its capability in handling re-centering and re-scaling of both inputs and weight matrix. However, the computational overhead introduced by LayerNorm makes these improvements expensive and significantly slows the underlying network, e.g. RNN in particular. In this paper, we hypothesize that re-centering invariance in LayerNorm is dispensable and propose root mean square layer normalization, or RMSNorm. RMSNorm regularizes the summed inputs to a neuron in one layer according to root mean square (RMS), giving the model re-scaling invariance property and implicit learning rate adaptation ability. RMSNorm is computationally simpler and thus more efficient than LayerNorm. We also present partial RMSNorm, or pRMSNorm where the RMS is estimated from p% of the summed inputs without breaking the above properties. Extensive experiments on several tasks using diverse...

Three Decades of Activations: A Comprehensive Survey of 400 Activation Functions for Neural Networks

Added on March 11, 2024

The text is a comprehensive survey of 400 activation functions for neural networks. It provides numerous URLs and DOIs for further reading and reference. The authors are Vladimír Kunc and Jiří Kléma.

ai/deep_learning/neural_networks

Revisiting Deep Learning as a Non-Equilibrium Process

Added on March 8, 2024

The document discusses the nature of Deep Learning systems, highlighting differences from traditional machine learning systems and challenging common misconceptions. It emphasizes the complexity and non-convexity of Deep Learning, noting that optimization techniques alone cannot explain its success. The text critiques the field for lacking in-depth exploration of the true nature of Deep Learning, pointing out a tendency towards superficial explanations and reliance on celebrity figures rather than rigorous scientific inquiry. It delves into the use of Bayesian techniques, the role of noise, and the importance of architecture in Deep Learning, arguing for a deeper understanding of the underlying processes and the need for more precise language and theoretical exploration.

ai/optimization

Dissipative Adaptation: The Origins of Life and Deep Learning

Added on March 8, 2024

The document explores the concept of Dissipative Adaptation, drawing parallels between the emergence of life and the mechanisms of Deep Learning. It discusses the work of Jeremy England and his theory of non-equilibrium statistical mechanics known as Dissipative Adaptation, which explains the self-organizing behavior of Deep Learning. The text delves into how neural networks evolve through training, emphasizing the role of external observations in driving the system towards minimizing entropy. It contrasts the mechanisms of Dissipative Adaptation with current Deep Learning architectures, highlighting similarities in alignment of components to maximize energy dissipation or information gradient.

physics/statistical_physics

Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks

Added on March 6, 2024

The text discusses a method called Parameter-Efficient Sparsity Crafting (PESC) that enhances sparse models for natural language processing tasks. PESC involves integrating adapters into sparse models, improving performance without changing individual weights. The approach outperforms other sparse models and even competes with GPT-3.5 in various tasks.

ai/deep_learning/transformers

þÿThe Little Book of Deep Learning

Added on March 6, 2024

I'm sorry, but there is no content provided to summarize. If you have any text or information you would like me to summarize, please provide it so I can assist you.

Sequence to Sequence Learning with Neural Networks

Added on March 6, 2024

ai/natural_language_processing

gemini_v1_5_report

Added on February 18, 2024

Gemini 1.5 Pro is a highly compute-efficient multimodal model that can recall and reason over millions of tokens of context, including long documents, videos, and audio. It achieves near-perfect recall on long-context retrieval tasks and outperforms the state-of-the-art in long-document QA, long-video QA, and long-context ASR. Gemini 1.5 Pro also showcases surprising new capabilities, such as learning to translate a new language from a grammar manual. The model surpasses the previous Gemini 1.0 Pro and performs at a similar level to 1.0 Ultra on a wide range of benchmarks while requiring less compute to train.

ai/deep_learning/transformers

Deep Learning Course

Added on February 10, 2024

This document provides resources for François Fleuret's deep-learning course at the University of Geneva. The course offers a thorough introduction to deep learning, with examples using the PyTorch framework. The materials include slides, recordings, and a virtual machine. The course covers topics such as machine learning objectives, tensor operations, automatic differentiation, gradient descent, and deep-learning techniques. The document also includes prerequisites for the course, such as knowledge of linear algebra, differential calculus, Python programming, and probability and statistics.

ageron/handson-ml3: A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.

Added on January 26, 2024

The ageron/handson-ml3 project is designed to teach the fundamentals of Machine Learning using Python. It includes example code and exercise solutions from the third edition of the book "Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow." The project provides options for running the notebooks online, using a Docker image, or installing the project on your own machine. It also addresses frequently asked questions about Python versions, SSL errors, and updating the project. The project has received contributions from various individuals, including reviewers, contributors to exercise solutions, and supporters from the Google ML Developer Programs team.

cs/software_development/tutorials

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Added on January 23, 2024

BERT and RoBERTa have achieved impressive results on sentence-pair regression tasks like semantic textual similarity, but they have a significant computational overhead when comparing large collections of sentences. To address this, Sentence-BERT (SBERT) has been developed as a modification of BERT that uses siamese and triplet network structures to generate semantically meaningful sentence embeddings. SBERT reduces the time required to find the most similar pair from 65 hours with BERT to just 5 seconds, while maintaining accuracy. SBERT outperforms other state-of-the-art sentence embedding methods on various tasks, including STS and transfer learning.

ai/deep_learning/transformers ai/natural_language_processing/sentence_embeddings

Visual Guides to understand the basics of Large Language Models

Added on January 14, 2024

This article provides a compilation of tools and articles that aim to break down the complicated concepts of Large Language Models (LLMs) in an intuitive way. It acknowledges that many people struggle with understanding the basics of LLMs and offers resources to help solidify their understanding. The article includes a table of contents with links to various resources, such as "The Illustrated Transformer" by Jay Alammar, which provides visualizations to explain the transformer architecture, a fundamental building block of LLMs. The goal is to make the concepts of LLMs easily understood and accessible.

ai/deep_learning/transformers

Understanding and Coding Self-Attention, Multi-Head Attention, Cross-Attention, and Causal-Attention in LLMs

Added on January 14, 2024

This article provides a comprehensive understanding and coding guide for self-attention mechanisms in transformer architectures and large language models (LLMs) like GPT-4 and Llama. It covers the concept of self-attention, its importance in NLP, and the implementation of the self-attention mechanism in Python and PyTorch. The article also discusses the scaled dot-product attention, computing unnormalized attention weights, computing attention weights, and computing the context vector. Additionally, it explores multi-head attention and provides code examples for implementing multiple attention heads.

ai/deep_learning/transformers

Pruning vs Quantization: Which is Better?

Added on January 10, 2024

Neural network pruning and quantization are techniques used to compress deep neural networks. This paper compares the two techniques and provides an analytical comparison of expected quantization and pruning error. The results show that in most cases, quantization outperforms pruning. However, in scenarios with very high compression ratios, pruning may be beneficial. The paper also discusses the hardware implications of both techniques and provides a comparison of pruning and quantization in the post-training and fine-tuning settings.

ai/compression

mlx-examples/lora at main · ml-explore/mlx-examples · GitHub

Added on January 10, 2024

This document provides an example of using MLX to fine-tune either a Llama 7B1 or Mistral 7B2 model with low rank adaptation (LoRA) for a target task. The example demonstrates using the WikiSQL dataset to train the model to generate SQL queries from natural language. It includes instructions for setup, running the script, fine-tuning the model, evaluating the model, generating output, and dealing with memory issues. The document also provides results from the training process and offers tips for reducing memory consumption during fine-tuning.

ai/deep_learning/transformers

Mixtral of Experts

Added on January 10, 2024

Mixtral 8x7B is a Sparse Mixture of Experts (SMoE) language model that outperforms or matches other models like Llama 2 70B and GPT-3.5 across various benchmarks. It has the same architecture as Mistral 7B but uses 8 feedforward blocks (experts) in each layer. A router network selects two experts for each token at each layer, allowing for dynamic selection of different experts at each timestep. This results in each token having access to 47B parameters but only using 13B active parameters during inference. Mixtral also offers a fine-tuned model, Mixtral 8x7B - Instruct, which surpasses other models on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

ai/deep_learning/transformers

Understanding The Exploding and Vanishing Gradients Problem

Added on January 7, 2024

The "Understanding The Exploding and Vanishing Gradients Problem" article discusses the vanishing and exploding gradients problem in deep neural networks. It explains how the gradients used to update the weights can shrink or grow exponentially, causing learning to stall or become unstable. The article explores why gradients vanish or explode exponentially and how it affects the backpropagation algorithm during training. It also provides strategies to address the vanishing and exploding gradients problem, such as using the ReLU activation function, weight initialization techniques, and gradient clipping.

ai/optimization ai/deep_learning/neural_networks

Practical Deep Learning for Coders 2022

Added on January 7, 2024

"Practical Deep Learning for Coders 2022" is a course that covers topics such as building and training deep learning models, deploying models, and using PyTorch and other popular libraries. The course is led by Jeremy Howard, who has extensive experience in machine learning and has created companies that utilize deep learning. The course is suitable for those with at least a year of coding experience and a high school math background. Students will learn how to train models for computer vision, natural language processing, tabular data analysis, and collaborative filtering, and will also learn about the latest deep learning techniques.

ai/applications/education

fastai/fastbook: The fastai book, published as Jupyter Notebooks

Added on January 7, 2024

The fastai book, published as Jupyter Notebooks, provides an introduction to deep learning, fastai, and PyTorch. It is copyright Jeremy Howard and Sylvain Gugger, and a selection of chapters is available to read online. The notebooks in the repository are used for a MOOC and form the basis of the book, which is available for purchase. The code in the notebooks is covered by the GPL v3 license, while the other content is not licensed for redistribution or change. It is recommended to use Google Colab to access and work with the notebooks. If there are any contributions or citations, copyright is assigned to Jeremy Howard and Sylvain Gugger.

ai/applications/education

Attention? Attention!

Added on January 7, 2024

The document explores the concept of attention, as performed by humans and deep learning algorithms. Attention is used in deep learning to transform one input sequence into another and is accomplished through an encoder-decoder architecture with LSTM or GRU units. The attention mechanism, invented to address the incapability of the fixed-length context vector, creates shortcuts between the context vector and the entire source input. Attention mechanisms vary in form, from soft or hard to global or local. The document also introduces self-attention, which relates different positions of a single sequence to compute a representation of the same sequence, and the Neural Turing Machine, a model architecture for coupling a neural network with external memory storage.

An Intuition for Attention

Added on January 7, 2024

The transformer neural network, used by models like ChatGPT, incorporates an attention mechanism to improve performance. Attention is a key feature of transformers and is defined by an equation that involves the softmax function. Attention can take different forms, but the scaled dot product attention is commonly used. This attention mechanism is based on the idea of key-value lookups, where a query is matched with keys to retrieve corresponding values. The attention scores, which determine how much attention is given to each key-value pair, are computed using dot product similarity and transformed into decimal percentages using the softmax function. This process allows for meaningful and efficient processing of queries in large language models.

ai/deep_learning/transformers

Transformers From Scratch

Added on January 7, 2024

This blog provides a step-by-step guide on creating and training a transformer from scratch. The author explains each foundational element and provides a Jupyter notebook with the code for readers to run and experiment with. The blog references a YouTube video and the Attention Is All You Need paper for further understanding. The author also mentions the availability of the final code and a dataset for download.

ai/deep_learning/transformers

An overview of gradient descent optimization algorithms

Added on January 5, 2024

The text provides an overview of gradient descent optimization algorithms commonly used in deep learning. It explains different types of gradient descent methods like batch, stochastic, and mini-batch, highlighting their strengths and challenges. The author also discusses advanced algorithms such as Adagrad, RMSprop, and Adam, which adapt learning rates to improve optimization performance.

ai/optimization

An overview of gradient descent optimization algorithms∗

Added on January 5, 2024

The article provides an overview of gradient descent optimization algorithms, which are often used as black-box optimizers. The article outlines the three variants of gradient descent and summarizes the challenges. The article then introduces some widely used algorithms to deal with the challenges, including Nesterov accelerated gradient, Adagrad, Adadelta, and RMSprop. The article explains how these algorithms work and their benefits and weaknesses.

ai/optimization

How GPT3 Works - Visualizations and Animations

Added on January 5, 2024

Discussions: Hacker News (397 points, 97 comments), Reddit r/MachineLearning (247 points, 27 comments) Translations: German, Korean, Chinese (Simplified), Russian The tech world is abuzz with GPT3 hype. Massive language models (like GPT3) are starting to surprise us with their abilities. While not yet completely reliable for most businesses to put in front of their customers, these models are showing sparks of cleverness that are sure to accelerate the march of automation and the possibilities of intelligent computer systems. Let’s remove the aura of mystery around GPT3 and learn how it’s trained and how it works. A trained language model generates text. We can optionally pass it some text as input, which influences its output. The output is generated from what the model “learned” during its training period where it scanned vast amounts of text.

ai/deep_learning/transformers

Tensor2Tensor Intro

Added on January 4, 2024

The content below is not provided.

ai/deep_learning/transformers

The Annotated Transformer

Added on January 4, 2024

"The Annotated Transformer" is a paper that introduces a new architecture for natural language processing tasks, with a focus on translation. The paper provides an annotated version of the original paper, giving a line-by-line implementation of the model. The Transformer model relies on self-attention to compute representations of its input and output without using sequence-aligned recurrent neural networks or convolutions. The model consists of an encoder and decoder stack, each containing self-attention layers and position-wise feed-forward networks. The paper also discusses the use of multi-head attention and positional encoding in the model. The model is trained using the WMT 2014 English-German dataset and the Adam optimizer.

ai/deep_learning/transformers ai/natural_language_processing

The Illustrated Transformer

Added on January 4, 2024

"The Illustrated Transformer" is a comprehensive guide to understanding the Transformer model, which utilizes attention to improve the training speed of neural machine translation models. The model consists of stacked encoders and decoders, with each encoder and decoder having self-attention layers. Self-attention allows the model to incorporate information from other words in the input sequence, resulting in better encoding. The model also employs multi-headed attention, which allows it to focus on different positions and creates multiple sets of Query/Key/Value weight matrices. Positional encoding is used to account for the order of words in the input sequence. The architecture includes residual connections and layer normalization for each sub-layer.

ai/deep_learning/transformers

GitHub - tensorflow/nmt: TensorFlow Neural Machine Translation Tutorial

Added on January 4, 2024

TensorFlow Neural Machine Translation Tutorial. Contribute to tensorflow/nmt development by creating an account on GitHub.

ai/natural_language_processing

Deep Learning for Natural Language Processing

Added on January 4, 2024

Deep Learning for Natural Language Processing Develop Deep Learning Models for your Natural Language Problems Working with Text is… important, under-discussed, and HARD We are awash with text, from books, papers, blogs, tweets, news, and increasingly text from spoken utterances. Every day, I get questions asking how to develop machine learning models for text data. Working […]

ai/natural_language_processing

Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)

Added on January 4, 2024

The article explains the mechanics of sequence-to-sequence models, which are deep learning models used for machine translation, text summarization, and image captioning. The article includes visualizations to explain the concepts and requires some previous understanding of deep learning. The article also discusses attention models, which improve machine translation systems by allowing the model to focus on relevant parts of the input sequence. The article provides examples of how attention models work and concludes with a link to TensorFlow's Neural Machine Translation tutorial.

ai/natural_language_processing

The Random Transformer

Added on January 4, 2024

This blog post provides an end-to-end example of the math within a transformer model, with a focus on the encoder part. The goal is to understand how the model works, and to make it more manageable, simplifications are made and the dimensions of the model are reduced. The post recommends reading "The Illustrated Transformer" blog for a more intuitive explanation of the transformer model. The prerequisites for understanding the content include basic knowledge of linear algebra, machine learning, and deep learning. The post covers the math within a transformer model during inference, attention mechanisms, residual connections and layer normalization, and provides some code to scale it up.

ai/deep_learning/transformers

CS25: Transformers United V3

Added on January 4, 2024

Transformers have revolutionized Natural Language Processing (NLP) and are now being applied in various fields, including Computer Vision, Reinforcement Learning, and Speech. This seminar explores the details of how Transformers work and their applications, with a focus on large language models (LLMs). The seminar includes instructor and guest lectures from experts in Transformers research. The schedule includes topics such as the creation of fine-tuned chat models, low-level embodied intelligence with foundation models, and training helpful chatbots. The seminar also covers the motivations behind Transformers, scaling human-centered machine translation, and going beyond LLMs to explore emergent abilities and intermediate-guided reasoning.

ai/deep_learning/transformers

Spaces using openai/whisper-large-v2 232

Added on January 3, 2024

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It was trained on 680k hours of labelled data and demonstrates strong generalization abilities without the need for fine-tuning. The large-v2 model, trained for 2.5x more epochs with added regularization, offers improved performance. The models can be used for transcription and translation tasks, with context tokens indicating the language and task. While the models show robustness and accuracy in many languages, they may exhibit limitations such as generating repetitive texts and hallucinations. The models have potential applications in accessibility tools but also raise concerns about dual use and surveillance capabilities.

ai/deep_learning/transformers ai/natural_language_processing

Text Summarization: How to Calculate BertScore

Added on January 3, 2024

BERTScore is a metric used to measure the quality of text summarization by calculating the similarity between the summary and the original text. It addresses issues that n-gram-based metrics face, such as incorrect matching of paraphrases and the inability to capture long-range dependencies. The BERTScore architecture involves contextual embeddings, cosine similarity, token matching for precision and recall, importance weighting, and baseline rescaling. The metric has the potential to improve various natural language processing tasks and can be applied in domains such as translation quality assessment, text generation, and document comparison. Future developments include broader language coverage and adaptation for multilingual texts.

ai/deep_learning/transformers ai/natural_language_processing

MotionGPT: Human Motion as a Foreign Language

Added on January 3, 2024

MotionGPT is a unified model for language and motion tasks, achieving top performance in text-driven motion generation. It combines natural language models with human motion tasks, benefiting fields like gaming and robotics. The model treats human motion like a foreign language, offering a versatile solution for diverse motion synthesis problems.

ai/deep_learning/transformers ai/generative_models

Bookmarks

Subcategories