Bookmarks

Muon and a Selective Survey on Steepest Descent in Riemannian and Non-Riemannian Manifolds

Muon from first principles, what makes it different from other optimizers, and why it works so well.

How To Scale

While there are already excellent posts on scaling, I wanted to share my own understanding and things i've learned from my past few months and hopefully spark some discussion. I hope this post can shed light for anyone navigating the challenges of scaling up neural networks. And there may be mistakes or inaccuracies, so if you want to correct me or would like to discuss further, please feel free to DM me on X or leave a comment.

bytecode interpreters for tiny computers

I've previously come to the conclusion that there's little reason for using bytecode in the modern world, except in order to get more compact code, for which it can be very effective.

Revisiting Deep Learning as a Non-Equilibrium Process

The document discusses the nature of Deep Learning systems, highlighting differences from traditional machine learning systems and challenging common misconceptions. It emphasizes the complexity and non-convexity of Deep Learning, noting that optimization techniques alone cannot explain its success. The text critiques the field for lacking in-depth exploration of the true nature of Deep Learning, pointing out a tendency towards superficial explanations and reliance on celebrity figures rather than rigorous scientific inquiry. It delves into the use of Bayesian techniques, the role of noise, and the importance of architecture in Deep Learning, arguing for a deeper understanding of the underlying processes and the need for more precise language and theoretical exploration.

Understanding The Exploding and Vanishing Gradients Problem

The "Understanding The Exploding and Vanishing Gradients Problem" article discusses the vanishing and exploding gradients problem in deep neural networks. It explains how the gradients used to update the weights can shrink or grow exponentially, causing learning to stall or become unstable. The article explores why gradients vanish or explode exponentially and how it affects the backpropagation algorithm during training. It also provides strategies to address the vanishing and exploding gradients problem, such as using the ReLU activation function, weight initialization techniques, and gradient clipping.

An overview of gradient descent optimization algorithms

The text provides an overview of gradient descent optimization algorithms commonly used in deep learning. It explains different types of gradient descent methods like batch, stochastic, and mini-batch, highlighting their strengths and challenges. The author also discusses advanced algorithms such as Adagrad, RMSprop, and Adam, which adapt learning rates to improve optimization performance.

An overview of gradient descent optimization algorithms∗

The article provides an overview of gradient descent optimization algorithms, which are often used as black-box optimizers. The article outlines the three variants of gradient descent and summarizes the challenges. The article then introduces some widely used algorithms to deal with the challenges, including Nesterov accelerated gradient, Adagrad, Adadelta, and RMSprop. The article explains how these algorithms work and their benefits and weaknesses.

Subcategories