Bookmarks
Intrinsically Motivated Discovery of Diverse Patterns in Self-Organizing Systems
In many complex dynamical systems, artificial or natural, one can observe selforganization of patterns emerging from local rules.
Exploring Flow-Lenia Universes with a Curiosity-driven AI Scientist: Discovering Diverse Ecosystem Dynamics
Automated discovery of diverse ecosystem dynamics in Flow-Lenia using AI-driven exploration. Features interactive visualization of 2000+ discovered evolutionary patterns.
Theory of Diversity (RL)
Theory of Diversity (RL) - Powered by Obsidian Publish.
DeepMind x UCL RL Lecture Series - Policy-Gradient and Actor-Critic methods [9/13]
Research Scientist Hado van Hasselt covers policy algorithms that can learn policies directly and actor critic algorithms that combine value predictions for mor...
DeepSeek's GRPO (Group Relative Policy Optimization) | Reinforcement Learning for LLMs
In this video, I break down DeepSeek's Group Relative Policy Optimization (GRPO) from first principles, without assuming prior knowledge of Reinforcement Learni...
General agents need world models
Are world models a necessary ingredient for flexible, goal-directed
behaviour, or is model-free learning sufficient? We provide a formal answer to
this question, showing that any agent capable of generalizing to multi-step
goal-directed tasks must have learned a predictive model of its environment. We
show that this model can be extracted from the agent's policy, and that
increasing the agents performance or the complexity of the goals it can achieve
requires learning increasingly accurate world models. This has a number of
consequences: from developing safe and general agents, to bounding agent
capabilities in complex environments, and providing new algorithms for
eliciting world models from agents.
"Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?"
This isn't a new intuition, but a nice new set of results.
Device Placement Optimization with Reinforcement Learning
The past few years have witnessed a growth in size and computational
requirements for training and inference with neural networks. Currently, a
common approach to address these requirements is to use a heterogeneous
distributed environment with a mixture of hardware devices such as CPUs and
GPUs. Importantly, the decision of placing parts of the neural models on
devices is often made by human experts based on simple heuristics and
intuitions. In this paper, we propose a method which learns to optimize device
placement for TensorFlow computational graphs. Key to our method is the use of
a sequence-to-sequence model to predict which subsets of operations in a
TensorFlow graph should run on which of the available devices. The execution
time of the predicted placements is then used as the reward signal to optimize
the parameters of the sequence-to-sequence model. Our main result is that on
Inception-V3 for ImageNet classification, and on RNN LSTM, for language
modeling and neural machine translation, our model finds non-trivial device
placements that outperform hand-crafted heuristics and traditional algorithmic
methods.
Execution-based Code Generation using Deep Reinforcement Learning
The utilization of programming language (PL) models, pre-trained on
large-scale code corpora, as a means of automating software engineering
processes has demonstrated considerable potential in streamlining various code
generation tasks such as code completion, code translation, and program
synthesis. However, current approaches mainly rely on supervised fine-tuning
objectives borrowed from text generation, neglecting unique sequence-level
characteristics of code, including but not limited to compilability as well as
syntactic and functional correctness. To address this limitation, we propose
PPOCoder, a new framework for code generation that synergistically combines
pre-trained PL models with Proximal Policy Optimization (PPO) which is a widely
used deep reinforcement learning technique. By utilizing non-differentiable
feedback from code execution and structure alignment, PPOCoder seamlessly
integrates external code-specific knowledge into the model optimization
process. It's important to note that PPOCoder is a task-agnostic and
model-agnostic framework that can be used across different code generation
tasks and PLs. Extensive experiments on three code generation tasks demonstrate
the effectiveness of our proposed approach compared to SOTA methods, achieving
significant improvements in compilation success rates and functional
correctness across different PLs.
Self-Rewarding Language Models
To achieve superhuman language models, researchers propose the use of self-rewarding language models (LLMs) that provide their own rewards during training. Unlike current approaches that rely on human preferences, LLMs use prompts to judge their own performance and improve their instruction following ability and reward generation. A preliminary study using this approach, specifically fine-tuning Llama 2 70B, demonstrates that it outperforms existing systems on the AlpacaEval 2.0 leaderboard. This work suggests the potential for models that can continually improve in both axes.
Paper page - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
The content is a set of instructions on how to cite a specific URL (arxiv.org/abs/2401.01335) in three different types of README.md files, in order to create links from those pages.
Some Core Principles of Large Language Model (LLM) Tuning
Large Language Models (LLMs) like GPT2 and GPT3 are trained using unsupervised pre-training on billions to trillions of tokens. After pre-training, the models are fine-tuned for specific use cases such as chatbots or content generation. Fine-tuning can be done through supervised fine-tuning (SFT) or reinforcement learning with human feedback (RLHF). SFT involves minimizing the loss between the model's output and the correct result, while RLHF uses a reward model to optimize the model's performance. InstructGPT is an RLHF-tuned version of GPT3 that is trained to follow instructions and provide aligned responses. There are also open-source alternatives to GPT models, such as GPT-J and GPT-Neo.
Subcategories
- fundamentals (6)
- llm (6)
- policy_methods (3)