/r/reinforcementlearning
Reinforcement learning is a subfield of AI/statistics focused on exploring/understanding complicated environments and learning how to optimally acquire rewards. Examples are AlphaGo, clinical trials & A/B tests, and Atari game playing.
This is for any reinforcement learning related work ranging from purely computational RL in artificial intelligence to the models of RL in neuroscience.
The standard introduction to RL is Sutton & Barto's Reinforcement Learning.
Related subreddits:
/r/reinforcementlearning
Im working on a chess RL project where 2 agents trained on different algorithms play against each other. Im wondering what is the best way to train the agents. Should i just let them play against each other, train the agents separately against random actions made by the opponent or have them train separately where both of the opponents playing in the game are using the same algorithm. And advice would be helpful
Is it possible to combine Mixture of Experts (MoE) with Reinforcement Learning (RL)? Does it make sense to train an agent that can choose which expert or experts to activate based on the input?
I have a more complex idea in mind: I want to integrate MoE and RL with Low-Rank Adaptation (LoRA). The plan is to have several LoRA modules and for the agent to identify the most suitable modules depending on the input. I aim to apply this approach to various NLP tasks. Does this make sense?
Started with this simple gym env to learn about RL. I'm observing the same trend on the PG loss and rewards.. Shouldn't I expect the loss decrease while the reward increase?
I calculated the loss over a batch of (state, action, reward) triplets using:
loss = -torch.mean(logits * rewards)
The rewards are calculated using rewards to go formula with discount, something like:
out = torch.zeros(size=rewards.size(), dtype=torch.float32)
out[-1] = rewards[-1]
for i in range(rewards.size(-1)-1)[::-1]:
out[i] = rewards[i] + r * out[i+1]
Does it ever really make sense to use a model based algorithm when we have access to large-scale environment parallelization?
Basically, I am wondering if there are environments that algorithms like DreamerV3 can solve that PPO just can't? For example, the DreamerV3 paper shows that PPO and IMPALA don't solve the most difficult Minecraft tasks, but would PPO solve these tasks eventually, given massive compute?
Are there any reasons to use model based algorithms besides reducing sample complexity?
I'm working on an agent that can work with the miniwob++ gymnasium environment. I've never worked with a non-flat action space like this and was wondering if anyone has any guidance about the best way to formulate it, given that different parameters are necessary depending on which action is being taken. Specifically trying to tackle this using a Pytorch as well if that's relevant.
Hi,
I'm a bit new to masked actions in RL. I have an RL problem where my action space sums to 1 (resource allocation problem), but as the time step t increases I have less resources I can use, i.e., the first t indices must be equal to zero.
I'm trying to use dirichlet distribution and zero out the first indices in the environment, but I'm keep getting sub-optimal results.
My reward function is a convex combination of two terms - one tries to optimize credibility and the second tries to optimize profitability. For two specific scenarios I know what is the optimum, but in all other scenarios I don't.
I'm using RLLIB, and I've tried using PPO (Attention-based model) and SAC. Somehow SAC works well than PPO, I'm not sure why, but yet sub-optimal.
Ask ideas what am I doing wrong or what should I do?
Thanks in advance!
hii I know its RL sub but does anyone have idea of any small LLM pre-trained models like for 5-12 GB size .
It should be able to just answer very basic stuff and thats about it. If so please share
thanks in advance :))
Let me preface this by saying I'm fairly new to RL. My previous work was with LLMs, where it is very common to rank and stack your model against the universe of models based on how it performs on a given benchmarks (and these benchmarks are a huge deal, startups raise serious $$$ just based on their scores).
Recently started training models in MuJoCo environments and I'm trying to figure out if my algorithms are performing somewhat decently. Sure, I can get Ant-v5 to walk using SB3's default PPO and MlpPolicy, but how good is it really?
Is there some benchmark or repo where I can compare my results against the learning curve of other people's algorithms using the default MuJoCo (or any of the other gyms') reward functions? Of course the assumption would be that we are using the same environment and reward function, but given Gymnasium is popular and offers good defaults, I'd imagine there should be a lot of data available.
I've googled around and have only found sparse results. Is there a reason why benchmarks are not as big in RL as they are with LLMs?
Genetic algorithms (GA) have been around for a long time (roughly the 1960's). They seem both incredibly intuitive and especially useful for blackbox problems, but they aren't currently "mainstream". In 2017, OpenAI was very bullish on evolutionary algos and cited their benefits, being parallelizable, robust, and able to deal with long time-scale problems with unclear value/fitness functions. Have there been any more recent updates? What algos are beating out GA?
For low-dimensional problems, Bayesian optimization may have better statistical guarantees/asymptotics. Are there even any guarantees for GA, or are we completely in the dark?
Hi there,
I got this very advance and diffcult problem at work. I got this problem scenario where I have this 2D discrete grid system. The size of this grid can vary. In addition to this, I got multiple different square boxes that can be place into these grids. Each of the boxes can have different solid colours on the sides. I need to find the best placement that maximizes the abutment of the same colours when two different square boxes are abutted together.
I have been looking at the AlphaZero approach but for my approach, I could no specified number of pieces and each box could have different colours. Any suggestions on how to approach this problem?
Hello guys,
I am beginner in reinforcement learning. Currently, working on DQN and getting issues with predicted actions by DQN. My problem is as follow:
One solution is to replace action that is now valid with action 0 (NOOP), it means do nothing.
is there any other way to handle this situation in efficient way
Thank you in advance
From what I’ve seen, PPO generally has been the go-to algorithm for reinforcement learning tasks. How come DreamerV3 isn’t used instead? It seems to be a lot more stable and requires less tuning of hyperparameters.
I really not sure I am I wrong or not but in term of equivariant and invariant graph neural network can handle this effectively but except rotational equivariant.
When I heard this word I still think should GAT add more step to become rotational transformation invariant make output 3D feature invariant weather position different or not in the task of regression prediction on PCQM4Mv2 on 3D graph.
My pipeline:
Node feature = feature extract from dataset
Edge feature = 3D Euclidian
Hi all :)
I am currently pursuing my master's and have always focused on RL. I have done now several projects in the field and I am also working on a publication currently with some collaborators on RL in uncertain stochastic environments which can be modeled as graphs. The field of Operations Research came to my attention when I did my undergrad thesis where I framed a stochastic combinatorial problem as an RL problem and solved it using Graph Neural Nets. During that time and since then I have read some very recent publications that try to solve traditional OR problems with modern RL approaches (e.g. https://arxiv.org/abs/2312.15658).
I think this is in general still underexplored and at the same very interesting. More modern neural net architectures such as GNNs seem to be perfectly equipped to be combined with RL for many OR problems. I would therefore also like to focus on graph machine learning methods (such as GNNs), but I am also open to any other modeling approach. Additionally, there is also an OR gym repository (https://github.com/hubbs5/or-gym) which I would like to explore and use to do some experiments on new approaches.
Hence, what I am looking for are some people who would want to join me and work together to solve some of these problems with more modern RL-based approaches. I have not thought about how much time to invest per week in these projects since I logically still have a lot to do on the side (uni, work, publication). So, if there is interest we could connect and find a good setup that would fit all of us :)
I personally, have a lot of experience in designing, building and pipelining large-scale neural nets and would be very happy to collaborate with people from various backgrounds.
Love hearing from you!
I have an agent whose actions must follow a sequential order: A, B, C. However, the environment is unstable or unpredictable. The agent takes actions based on the current situation, completing its sequence (A, B, C). After finishing the sequence, the agent has some reward based on completion of work. It waits for the environment and analyse the next situation before deciding and executing the next set of actions. How can we use RL in this scenerio? How can we train model to have suitational awareness to take a action. Each of action is equally important to get good action.
In summary, environment is unpredicted but the we have to find some hidden pattern to take this sequence of action.
Thank you in advance!
I am new to reinforcement learning. Using torchrl as a library, I have created the PPO code as per the following tutorial and confirmed that InvertedDoublePendulum can be trained off-policy.
Next I wanted to try learning with Ant-v4, so I changed the environment name to Ant-v4 and started learning. It did not seem to be learning well, so I set `frame_skip` to 5, `frames_per_batch` to `50_000 // frame_skip`, `total_frames` to `60_000_000 // frame_skip`, and `sub_ batch_size` to 2500 to increase the training volume (the other hyperparameters remain the same as in the previous tutorial).
However, Ant's rewards are coming and going at a very low level and are not walking the walk as shown in the following video. Given that increasing the amount of learning doesn't work, I think it's an issue with the hyperparameter settings, but I didn't get any good insights from Google.
What am I missing in my code?
Hi, I'm currently programming an AI that is supposed to learn Tic Tac Toe using Q-Learning. My Problem is that the model is learning a bit at the start but then gets worse and doesn't get better. I'm using
old_qvalue + self.alpha * (reward + self.gamma * max_qvalue_nextstate - old_qvalue)
to update the QValues, with alpha at 0.3 and gamma at 0.9. I also use the Epsilon Greedy strategy with a decaying Epsilon which starts at 0.9 and is decreased by 0.0005 per turn and stops decreasing at 0.1. The Opponent is a Minimax Algorithm. I didn't find any flaws in the Code and Chat GPT also didn't and I'm wondering what I'm doing wrong. If anyone has any Tips I would appreciate them. The Code is unfortunately in German and I don't have a Github Account set up right now.
I write an soft actor critic algorithm that I want to use later for autonomous driving with carla simulator. My Code Manager to solve simple Tasks, but when I try it on carla, I receive poor performance even tough I base my rewards on Master theses. The Master theses receive better performance. It would be nice if someone could check my code for math or programming errors. You can find my code on github with: https://github.com/b-gtr/Soft-Actor-Critic
I am testing PPO on a trivial task with continuous actions (Gaussian distributions), in the actor-critic setup. The actor network is learning quickly the optimal value for the mean, but the std keeps increasing (the optimal solution is deterministic, the higher the std the worse the return). What could be the reason for this to happen? I don't use any bonus for exploration. The std is state-independent.
import gymnasium as gym
import random
from random import random
from random import random
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from replay_memory import ReplayMemory
from transition import Transition
ENV = "CartPole-v1"
ENV = "FrozenLake-v1"
device = torch.device(
"cuda"
if torch.cuda.is_available()
else "mps" if torch.backends.mps.is_available() else "cpu"
)
print(f"Using device: {device}")
TAU = 0.005
class MLP(nn.Module):
def __init__(self, input_dim, output_dim, device):
super().__init__()
self.layer_1 = nn.Linear(input_dim, 128).to(device)
self.layer_2 = nn.Linear(128, 128).to(device)
self.layer_3 = nn.Linear(128, output_dim).to(device)
def forward(self, x):
x = F.relu(self.layer_1(x))
x = F.relu(self.layer_2(x))
return self.layer_3(x)
def init_weights(m):
if isinstance(m, nn.Linear):
print('ran xavier')
nn.init.xavier_uniform_(m.weight)
nn.init.constant_(m.bias, 0)
class Agent:
def __init__(self):
# self.env = gym.make(ENV, render_mode="human", is_slippery=False)
self.env = gym.make(ENV, render_mode=None, is_slippery=False)
# self.env = gym.make(ENV, render_mode="human")
self.device = torch.device(
"cuda"
if torch.cuda.is_available()
else "mps" if torch.backends.mps.is_available() else "cpu"
)
state, _ = self.env.reset()
if isinstance(state, int):
state = [state]
self.state_space_dim = len(state)
self.action_space_dim = self.env.action_space.n
self.epsilon = 0.8
self.gamma = 0.99
self.batch_size = 64
self.replay_memory = ReplayMemory(capacity=10000)
self.policy_network = MLP(
self.state_space_dim, self.action_space_dim, device=self.device
).to(self.device)
self.target_network = MLP(
self.state_space_dim, self.action_space_dim, device=self.device
).to(self.device)
self.policy_network.apply(init_weights)
self.target_network.apply(init_weights)
self.target_network.load_state_dict(self.policy_network.state_dict())
self.optimizer = optim.AdamW(
self.policy_network.parameters(), lr=0.001, amsgrad=True
)
def select_action(self, state: torch.Tensor):
if random() < self.epsilon:
return torch.tensor(
[[self.env.action_space.sample()]],
device=self.device,
dtype=torch.long,
)
else:
with torch.no_grad():
return self.policy_network(state).max(1).indices.unsqueeze(0)
def train(self, n_episodes=100):
episode_rewards = []
episode_loss = []
total_steps = 0
for episode in range(n_episodes):
self.epsilon = max(0.05, self.epsilon * 0.99)
state, _ = self.env.reset()
if isinstance(state, int):
state = [state]
total_episode_reward = 0
max_episode_loss = 0
while True:
total_steps += 1
action = self.select_action(
torch.tensor(
state, device=self.device, dtype=torch.float
).unsqueeze(0)
)
# print(f'at state: {state} \n'
# f'q table: {self.target_network(torch.tensor(state, dtype=torch.float32, device=self.device))}')
next_state, reward, terminated, truncated, _ = self.env.step(
int(action.item())
)
if isinstance(next_state, int):
next_state = [next_state]
if terminated and reward == 0:
reward = -10
if terminated and reward == 1:
reward = 100
if state == next_state:
reward -= 0.01
else:
reward += 0.01
total_episode_reward += reward
done = terminated or truncated
if terminated:
next_state = None
self.replay_memory.push(Transition(
torch.tensor(state, dtype=torch.float32,
device=self.device).unsqueeze(0),
torch.tensor([action], dtype=torch.float32,
device=self.device),
torch.tensor(next_state, dtype=torch.float32, device=self.device).unsqueeze(
0) if next_state is not None else None,
torch.tensor([reward], dtype=torch.float32,
device=self.device)
))
state = next_state
if len(self.replay_memory) > self.batch_size:
loss = self.optimize(
self.replay_memory.sample(self.batch_size))
max_episode_loss = max(max_episode_loss, loss)
# Soft update.
# if total_steps % 10 == 0:
# policy_network_state_dict = self.policy_network.state_dict()
# target_network_state_dict = self.target_network.state_dict()
# for key in policy_network_state_dict:
# target_network_state_dict[key] = (
# policy_network_state_dict[key] * TAU
# + (1 - TAU) * target_network_state_dict[key]
# )
# self.target_network.load_state_dict(target_network_state_dict)
if done:
episode_rewards.append(total_episode_reward)
episode_loss.append(max_episode_loss)
print(
f"episode: {episode}\n"
f"epsilon: {self.epsilon}\n"
f"reward: {total_episode_reward}\n"
f"loss: {max_episode_loss}\n"
)
break
# Hard update.
if episode % 25 == 0:
policy_network_state_dict = self.policy_network.state_dict()
self.target_network.load_state_dict(policy_network_state_dict)
return [episode_rewards, episode_loss]
def run(self):
self.env = gym.make(ENV, render_mode="human", is_slippery=False)
# self.env = gym.make(ENV, render_mode="human")
for _ in range(1):
state, _ = self.env.reset()
while True:
if isinstance(state, int):
state = [state]
action = (
self.target_network(
torch.tensor(state, dtype=torch.float32, device=self.device))
.max(0)
.indices
)
next_state, _, terminated, truncated, _ = self.env.step(
action.item()
)
print(f'at state: {state} \n'
f'q table: {self.target_network(torch.tensor(state, dtype=torch.float32, device=self.device))}\n'
f'action: {action}\n'
f'action.item(): {action.item()}\n'
f'next state: {next_state}')
state = next_state
if terminated or truncated:
break
def optimize(self, batch: list[Transition]):
states = torch.cat(
[
transition.state
for transition in batch
]
)
actions = torch.cat(
[
transition.action
for transition in batch
]
)
next_states = [transition.next_state for transition in batch]
rewards = torch.cat(
[
transition.reward
for transition in batch
]
)
# Compute Q(s_t, a), the current estimate of Q-values for all actions in the batch.
# Shape: (batch_size, action_dim)
# Example Q-values for a batch of states:
# [
# [1.2, 2.0], # Q-values for actions in state 1
# [0.2, 1.2], # Q-values for actions in state 2
# [1.3, 1.1], # Q-values for actions in state 3
# ]
q_estimates = self.policy_network(states)
# Extract Q(s_t, a) for the actions actually taken in each state
# Shape: (batch_size, 1)
# Example:
# actions = [[1], [0], [1]] (actions taken in states 1, 2, 3 respectively)
# The gathered values would be:
# [
# [2.0], # Q-value for action 1 in state 1
# [0.2], # Q-value for action 0 in state 2
# [1.1], # Q-value for action 1 in state 3
# ]
state_action_values = q_estimates.gather(
1, actions.unsqueeze(1).long())
# Create a mask to identify which next states are non-final (i.e., states that have a valid next state).
# Shape: (batch_size,)
# Example:
# [
# True, # State 1 has a valid next state
# False, # State 2 is a final state
# True, # State 3 has a valid next state
# ]
non_final_next_state_mask = torch.tensor(
[
next_state is not None
for next_state in next_states
], dtype=torch.bool, device=self.device
)
# Filter and concatenate only the non-final next states into a single tensor for further processing.
# Shape: (filtered_batch_size, state_dim)
# Example filtered states (assuming state_dim = 4):
# [
# [1.2, 1.3, 2.2, 3.0], # Next state for state 1
# [0.2, -3.1, 1.1, 3.0], # Next state for state 3
# ]
non_final_next_states = torch.cat(
[next_state
for next_state in filter(lambda s: s is not None, next_states)]
)
# Compute the target Q-values using the Bellman equation:
# Q_target(s_t, a_t) = r + gamma * max_a Q(s_{t+1}, a').
# state_action_targets = torch.zeros(len(batch), device=self.device)
state_action_targets = torch.zeros(len(batch), device=self.device)
# Mask update such that:
# For non-final states, we compute gamma * max_a Q(s_{t+1}, a') using the target network.
# For final states, this contribution is 0 because there is no valid next state.
with torch.no_grad():
state_action_targets[non_final_next_state_mask] = self.target_network(
non_final_next_states).max(1).values
target = rewards + (self.gamma * state_action_targets)
# Huber loss
criterion = nn.SmoothL1Loss()
loss = criterion(state_action_values,
target.unsqueeze(1))
self.optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_value_(self.policy_network.parameters(), 10)
self.optimizer.step()
return loss.item()
agent = Agent()
rewards, loss = agent.train(n_episodes=200)
print(rewards, loss)
agent.run()
I tried modifying the reward functions to generate more diverse samples but no matter what I cannot get it to converge. Could it be that I need to one hot encode the state?
I have a question regarding how the expected return of a deterministic policy in written. I have seen that in some cases the use the Q-Function as it is shown in expression 5. However, I do not fully understand the steps how it is obtained contrary to the stochastic policy. What are the steps or reasoning to get the expression 5 ?
How feasible is it to write a program that takes bvh data and attempts to imitate it with a mujoco humanoid. I assume the agent would have to be trained on a lot of data, and i have already identified the CMU dataset as a commonly used dataset. Can anyone point me to projects that implement this, or describe how it would be performed? Thanks.
Hi everyone!
I’m currently training an ML-Agents PPO agent to escape a sequence of rooms, where each room has a specific time constraint. The agent manages to find its way out and complete each step , but it’s taking too long to escape the rooms. I believe the training could be more efficient and faster with better parameter tuning. Below are my observations, screenshots from TensorBoard, and details of my setup.
Here are the parameters I’m using for the PPO agent:
behaviors:
NavigationAgentController:
trainer_type: ppo
hyperparameters:
batch_size: 1024
buffer_size: 102400
learning_rate: 3.0e-4
beta: 0.01
epsilon: 0.2
lambd: 0.95
num_epoch: 5
learning_rate_schedule: constant
beta_schedule: linear
epsilon_schedule: constant
network_settings:
normalize: true
hidden_units: 128
num_layers: 3
vis_encode_type: simple
memory:
sequence_length: 256
memory_size: 256
reward_signals:
extrinsic:
gamma: 0.99
strength: 1
curiosity:
gamma: 0.99
strength: 0.01
learning_rate: 0.0003
network_settings:
encoding_size: 256
num_layers: 4
max_steps: 1000000000000
time_horizon: 64
summary_freq: 20000
keep_checkpoints: 5
checkpoint_interval: 500000
torch_settings:
device: true
[Header("Reward Settings")]
public float fastPlateReward = 1.0f;
public float mediumPlateReward = 0.9f;
public float slowPlateReward = 0.8f;
public float explorationReward = 0.05f;
public float fallPenalty = -0.5f;
public float wallPenalty = -0.1f;
public float groundPenalty = -0.05f;
public float timePenalty = -0.05f;
[Header("Jump Penalty Settings")]
public float jumpPenalty = -0.05f; // Penalty for jumping
public float jumpPenaltyInterval = 1f; // Interval to apply the penalty
[Header("Reward for Escaping the Room")]
public float escapeRoomReward = 5.0f;
private bool rewardGranted = false;
[Header("Door Reward Settings")]
public float doorProximityReward = 0.5f; // Reward per proximity unit
public float reachingDoorReward = 2.0f; // Fixed reward for reaching the door
public float maxRewardDistance = 10.0f; // Maximum distance considered for the reward
public float distanceToReachDoor = 2.0f; // Distance to consider the door as reached
curiosity: strength
?Would love your feedback on this! Thanks in advance for taking the time to help out.
I am a beginner in Reinforcement Learning and am using the Soft Actor-Critic (SAC) method for smart charging optimization of Electric Vehicles. The goal is to optimize charging schedules for multiple EVs across discrete time slots to minimize costs while meeting battery and grid constraints. I have implemented a replay buffer with prioritized sampling and added techniques like priority decay and dynamic sampling to enhance training stability and address potential overfitting. However, I am unsure if overfitting is occurring and how to determine an appropriate stopping criterion based on the gap between training and evaluation rewards. I would appreciate guidance on improving the model’s learning and ensuring better generalization.
I have some questions in my head about how reinforcement learning (RL) models handle sequences during training and testing:
List some useful resources as well.
Thank u in advance