Ludwig - ai/reinforcement

Intrinsically Motivated Discovery of Diverse Patterns in Self-Organizing Systems

bookmark · Added on February 27, 2026 · 5 min read

In many complex dynamical systems, artificial or natural, one can observe selforganization of patterns emerging from local rules.

Exploring Flow-Lenia Universes with a Curiosity-driven AI Scientist: Discovering Diverse Ecosystem Dynamics

bookmark · Added on February 27, 2026 · 1 min read

Automated discovery of diverse ecosystem dynamics in Flow-Lenia using AI-driven exploration. Features interactive visualization of 2000+ discovered evolutionary patterns.

Theory of Diversity (RL)

bookmark · Added on October 27, 2025 · 7 min read

Theory of Diversity (RL) - Powered by Obsidian Publish.

DeepMind x UCL RL Lecture Series - Policy-Gradient and Actor-Critic methods [9/13]

video · Added on July 22, 2025 · 1:38:49 · 42.5K views

Research Scientist Hado van Hasselt covers policy algorithms that can learn policies directly and actor critic algorithms that combine value predictions for mor...

DeepSeek's GRPO (Group Relative Policy Optimization) | Reinforcement Learning for LLMs

video · Added on July 22, 2025 · 23:16 · 18.5K views

In this video, I break down DeepSeek's Group Relative Policy Optimization (GRPO) from first principles, without assuming prior knowledge of Reinforcement Learni...

General agents need world models

bookmark · Added on June 12, 2025 · 1h 34m read

Are world models a necessary ingredient for flexible, goal-directed behaviour, or is model-free learning sufficient? We provide a formal answer to this question, showing that any agent capable of generalizing to multi-step goal-directed tasks must have learned a predictive model of its environment. We show that this model can be extracted from the agent's policy, and that increasing the agents performance or the complexity of the goals it can achieve requires learning increasingly accurate world models. This has a number of consequences: from developing safe and general agents, to bounding agent capabilities in complex environments, and providing new algorithms for eliciting world models from agents.

The Era of Experience Paper

bookmark · Added on April 22, 2025 · 31 min read

"Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?"

bookmark · Added on April 22, 2025 · 3 min read

This isn't a new intuition, but a nice new set of results.

Device Placement Optimization with Reinforcement Learning

bookmark · Added on April 6, 2025 · 31 min read

The past few years have witnessed a growth in size and computational requirements for training and inference with neural networks. Currently, a common approach to address these requirements is to use a heterogeneous distributed environment with a mixture of hardware devices such as CPUs and GPUs. Importantly, the decision of placing parts of the neural models on devices is often made by human experts based on simple heuristics and intuitions. In this paper, we propose a method which learns to optimize device placement for TensorFlow computational graphs. Key to our method is the use of a sequence-to-sequence model to predict which subsets of operations in a TensorFlow graph should run on which of the available devices. The execution time of the predicted placements is then used as the reward signal to optimize the parameters of the sequence-to-sequence model. Our main result is that on Inception-V3 for ImageNet classification, and on RNN LSTM, for language modeling and neural machine translation, our model finds non-trivial device placements that outperform hand-crafted heuristics and traditional algorithmic methods.

Execution-based Code Generation using Deep Reinforcement Learning

bookmark · Added on February 20, 2025 · 1h 1m read

The utilization of programming language (PL) models, pre-trained on large-scale code corpora, as a means of automating software engineering processes has demonstrated considerable potential in streamlining various code generation tasks such as code completion, code translation, and program synthesis. However, current approaches mainly rely on supervised fine-tuning objectives borrowed from text generation, neglecting unique sequence-level characteristics of code, including but not limited to compilability as well as syntactic and functional correctness. To address this limitation, we propose PPOCoder, a new framework for code generation that synergistically combines pre-trained PL models with Proximal Policy Optimization (PPO) which is a widely used deep reinforcement learning technique. By utilizing non-differentiable feedback from code execution and structure alignment, PPOCoder seamlessly integrates external code-specific knowledge into the model optimization process. It's important to note that PPOCoder is a task-agnostic and model-agnostic framework that can be used across different code generation tasks and PLs. Extensive experiments on three code generation tasks demonstrate the effectiveness of our proposed approach compared to SOTA methods, achieving significant improvements in compilation success rates and functional correctness across different PLs.

Self-Rewarding Language Models

bookmark · Added on January 20, 2024 · 1 min read

To achieve superhuman language models, researchers propose the use of self-rewarding language models (LLMs) that provide their own rewards during training. Unlike current approaches that rely on human preferences, LLMs use prompts to judge their own performance and improve their instruction following ability and reward generation. A preliminary study using this approach, specifically fine-tuning Llama 2 70B, demonstrates that it outperforms existing systems on the AlpacaEval 2.0 leaderboard. This work suggests the potential for models that can continually improve in both axes.

Paper page - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

bookmark · Added on January 10, 2024 · 1 min read

The content is a set of instructions on how to cite a specific URL (arxiv.org/abs/2401.01335) in three different types of README.md files, in order to create links from those pages.

Some Core Principles of Large Language Model (LLM) Tuning

bookmark · Added on January 3, 2024 · 29 min read

Large Language Models (LLMs) like GPT2 and GPT3 are trained using unsupervised pre-training on billions to trillions of tokens. After pre-training, the models are fine-tuned for specific use cases such as chatbots or content generation. Fine-tuning can be done through supervised fine-tuning (SFT) or reinforcement learning with human feedback (RLHF). SFT involves minimizing the loss between the model's output and the correct result, while RLHF uses a reward model to optimize the model's performance. InstructGPT is an RLHF-tuned version of GPT3 that is trained to follow instructions and provide aligned responses. There are also open-source alternatives to GPT models, such as GPT-J and GPT-Neo.

Bookmarks