/r/reinforcementlearning
Reinforcement learning is a subfield of AI/statistics focused on exploring/understanding complicated environments and learning how to optimally acquire rewards. Examples are AlphaGo, clinical trials & A/B tests, and Atari game playing.
This is for any reinforcement learning related work ranging from purely computational RL in artificial intelligence to the models of RL in neuroscience.
The standard introduction to RL is Sutton & Barto's Reinforcement Learning.
Related subreddits:
/r/reinforcementlearning
Hi, I'm currently programming an AI that is supposed to learn Tic Tac Toe using Q-Learning. My Problem is that the model is learning a bit at the start but then gets worse and doesn't get better. I'm using
old_qvalue + self.alpha * (reward + self.gamma * max_qvalue_nextstate - old_qvalue)
to update the QValues, with alpha at 0.3 and gamma at 0.9. I also use the Epsilon Greedy strategy with a decaying Epsilon which starts at 0.9 and is decreased by 0.0005 per turn and stops decreasing at 0.1. The Opponent is a Minimax Algorithm. I didn't find any flaws in the Code and Chat GPT also didn't and I'm wondering what I'm doing wrong. If anyone has any Tips I would appreciate them. The Code is unfortunately in German and I don't have a Github Account set up right now.
I write an soft actor critic algorithm that I want to use later for autonomous driving with carla simulator. My Code Manager to solve simple Tasks, but when I try it on carla, I receive poor performance even tough I base my rewards on Master theses. The Master theses receive better performance. It would be nice if someone could check my code for math or programming errors. You can find my code on github with: https://github.com/b-gtr/Soft-Actor-Critic
I am testing PPO on a trivial task with continuous actions (Gaussian distributions), in the actor-critic setup. The actor network is learning quickly the optimal value for the mean, but the std keeps increasing (the optimal solution is deterministic, the higher the std the worse the return). What could be the reason for this to happen? I don't use any bonus for exploration. The std is state-independent.
import gymnasium as gym
import random
from random import random
from random import random
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from replay_memory import ReplayMemory
from transition import Transition
ENV = "CartPole-v1"
ENV = "FrozenLake-v1"
device = torch.device(
"cuda"
if torch.cuda.is_available()
else "mps" if torch.backends.mps.is_available() else "cpu"
)
print(f"Using device: {device}")
TAU = 0.005
class MLP(nn.Module):
def __init__(self, input_dim, output_dim, device):
super().__init__()
self.layer_1 = nn.Linear(input_dim, 128).to(device)
self.layer_2 = nn.Linear(128, 128).to(device)
self.layer_3 = nn.Linear(128, output_dim).to(device)
def forward(self, x):
x = F.relu(self.layer_1(x))
x = F.relu(self.layer_2(x))
return self.layer_3(x)
def init_weights(m):
if isinstance(m, nn.Linear):
print('ran xavier')
nn.init.xavier_uniform_(m.weight)
nn.init.constant_(m.bias, 0)
class Agent:
def __init__(self):
# self.env = gym.make(ENV, render_mode="human", is_slippery=False)
self.env = gym.make(ENV, render_mode=None, is_slippery=False)
# self.env = gym.make(ENV, render_mode="human")
self.device = torch.device(
"cuda"
if torch.cuda.is_available()
else "mps" if torch.backends.mps.is_available() else "cpu"
)
state, _ = self.env.reset()
if isinstance(state, int):
state = [state]
self.state_space_dim = len(state)
self.action_space_dim = self.env.action_space.n
self.epsilon = 0.8
self.gamma = 0.99
self.batch_size = 64
self.replay_memory = ReplayMemory(capacity=10000)
self.policy_network = MLP(
self.state_space_dim, self.action_space_dim, device=self.device
).to(self.device)
self.target_network = MLP(
self.state_space_dim, self.action_space_dim, device=self.device
).to(self.device)
self.policy_network.apply(init_weights)
self.target_network.apply(init_weights)
self.target_network.load_state_dict(self.policy_network.state_dict())
self.optimizer = optim.AdamW(
self.policy_network.parameters(), lr=0.001, amsgrad=True
)
def select_action(self, state: torch.Tensor):
if random() < self.epsilon:
return torch.tensor(
[[self.env.action_space.sample()]],
device=self.device,
dtype=torch.long,
)
else:
with torch.no_grad():
return self.policy_network(state).max(1).indices.unsqueeze(0)
def train(self, n_episodes=100):
episode_rewards = []
episode_loss = []
total_steps = 0
for episode in range(n_episodes):
self.epsilon = max(0.05, self.epsilon * 0.99)
state, _ = self.env.reset()
if isinstance(state, int):
state = [state]
total_episode_reward = 0
max_episode_loss = 0
while True:
total_steps += 1
action = self.select_action(
torch.tensor(
state, device=self.device, dtype=torch.float
).unsqueeze(0)
)
# print(f'at state: {state} \n'
# f'q table: {self.target_network(torch.tensor(state, dtype=torch.float32, device=self.device))}')
next_state, reward, terminated, truncated, _ = self.env.step(
int(action.item())
)
if isinstance(next_state, int):
next_state = [next_state]
if terminated and reward == 0:
reward = -10
if terminated and reward == 1:
reward = 100
if state == next_state:
reward -= 0.01
else:
reward += 0.01
total_episode_reward += reward
done = terminated or truncated
if terminated:
next_state = None
self.replay_memory.push(Transition(
torch.tensor(state, dtype=torch.float32,
device=self.device).unsqueeze(0),
torch.tensor([action], dtype=torch.float32,
device=self.device),
torch.tensor(next_state, dtype=torch.float32, device=self.device).unsqueeze(
0) if next_state is not None else None,
torch.tensor([reward], dtype=torch.float32,
device=self.device)
))
state = next_state
if len(self.replay_memory) > self.batch_size:
loss = self.optimize(
self.replay_memory.sample(self.batch_size))
max_episode_loss = max(max_episode_loss, loss)
# Soft update.
# if total_steps % 10 == 0:
# policy_network_state_dict = self.policy_network.state_dict()
# target_network_state_dict = self.target_network.state_dict()
# for key in policy_network_state_dict:
# target_network_state_dict[key] = (
# policy_network_state_dict[key] * TAU
# + (1 - TAU) * target_network_state_dict[key]
# )
# self.target_network.load_state_dict(target_network_state_dict)
if done:
episode_rewards.append(total_episode_reward)
episode_loss.append(max_episode_loss)
print(
f"episode: {episode}\n"
f"epsilon: {self.epsilon}\n"
f"reward: {total_episode_reward}\n"
f"loss: {max_episode_loss}\n"
)
break
# Hard update.
if episode % 25 == 0:
policy_network_state_dict = self.policy_network.state_dict()
self.target_network.load_state_dict(policy_network_state_dict)
return [episode_rewards, episode_loss]
def run(self):
self.env = gym.make(ENV, render_mode="human", is_slippery=False)
# self.env = gym.make(ENV, render_mode="human")
for _ in range(1):
state, _ = self.env.reset()
while True:
if isinstance(state, int):
state = [state]
action = (
self.target_network(
torch.tensor(state, dtype=torch.float32, device=self.device))
.max(0)
.indices
)
next_state, _, terminated, truncated, _ = self.env.step(
action.item()
)
print(f'at state: {state} \n'
f'q table: {self.target_network(torch.tensor(state, dtype=torch.float32, device=self.device))}\n'
f'action: {action}\n'
f'action.item(): {action.item()}\n'
f'next state: {next_state}')
state = next_state
if terminated or truncated:
break
def optimize(self, batch: list[Transition]):
states = torch.cat(
[
transition.state
for transition in batch
]
)
actions = torch.cat(
[
transition.action
for transition in batch
]
)
next_states = [transition.next_state for transition in batch]
rewards = torch.cat(
[
transition.reward
for transition in batch
]
)
# Compute Q(s_t, a), the current estimate of Q-values for all actions in the batch.
# Shape: (batch_size, action_dim)
# Example Q-values for a batch of states:
# [
# [1.2, 2.0], # Q-values for actions in state 1
# [0.2, 1.2], # Q-values for actions in state 2
# [1.3, 1.1], # Q-values for actions in state 3
# ]
q_estimates = self.policy_network(states)
# Extract Q(s_t, a) for the actions actually taken in each state
# Shape: (batch_size, 1)
# Example:
# actions = [[1], [0], [1]] (actions taken in states 1, 2, 3 respectively)
# The gathered values would be:
# [
# [2.0], # Q-value for action 1 in state 1
# [0.2], # Q-value for action 0 in state 2
# [1.1], # Q-value for action 1 in state 3
# ]
state_action_values = q_estimates.gather(
1, actions.unsqueeze(1).long())
# Create a mask to identify which next states are non-final (i.e., states that have a valid next state).
# Shape: (batch_size,)
# Example:
# [
# True, # State 1 has a valid next state
# False, # State 2 is a final state
# True, # State 3 has a valid next state
# ]
non_final_next_state_mask = torch.tensor(
[
next_state is not None
for next_state in next_states
], dtype=torch.bool, device=self.device
)
# Filter and concatenate only the non-final next states into a single tensor for further processing.
# Shape: (filtered_batch_size, state_dim)
# Example filtered states (assuming state_dim = 4):
# [
# [1.2, 1.3, 2.2, 3.0], # Next state for state 1
# [0.2, -3.1, 1.1, 3.0], # Next state for state 3
# ]
non_final_next_states = torch.cat(
[next_state
for next_state in filter(lambda s: s is not None, next_states)]
)
# Compute the target Q-values using the Bellman equation:
# Q_target(s_t, a_t) = r + gamma * max_a Q(s_{t+1}, a').
# state_action_targets = torch.zeros(len(batch), device=self.device)
state_action_targets = torch.zeros(len(batch), device=self.device)
# Mask update such that:
# For non-final states, we compute gamma * max_a Q(s_{t+1}, a') using the target network.
# For final states, this contribution is 0 because there is no valid next state.
with torch.no_grad():
state_action_targets[non_final_next_state_mask] = self.target_network(
non_final_next_states).max(1).values
target = rewards + (self.gamma * state_action_targets)
# Huber loss
criterion = nn.SmoothL1Loss()
loss = criterion(state_action_values,
target.unsqueeze(1))
self.optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_value_(self.policy_network.parameters(), 10)
self.optimizer.step()
return loss.item()
agent = Agent()
rewards, loss = agent.train(n_episodes=200)
print(rewards, loss)
agent.run()
I tried modifying the reward functions to generate more diverse samples but no matter what I cannot get it to converge. Could it be that I need to one hot encode the state?
I have a question regarding how the expected return of a deterministic policy in written. I have seen that in some cases the use the Q-Function as it is shown in expression 5. However, I do not fully understand the steps how it is obtained contrary to the stochastic policy. What are the steps or reasoning to get the expression 5 ?
How feasible is it to write a program that takes bvh data and attempts to imitate it with a mujoco humanoid. I assume the agent would have to be trained on a lot of data, and i have already identified the CMU dataset as a commonly used dataset. Can anyone point me to projects that implement this, or describe how it would be performed? Thanks.
Hi everyone!
I’m currently training an ML-Agents PPO agent to escape a sequence of rooms, where each room has a specific time constraint. The agent manages to find its way out and complete each step , but it’s taking too long to escape the rooms. I believe the training could be more efficient and faster with better parameter tuning. Below are my observations, screenshots from TensorBoard, and details of my setup.
Here are the parameters I’m using for the PPO agent:
behaviors:
NavigationAgentController:
trainer_type: ppo
hyperparameters:
batch_size: 1024
buffer_size: 102400
learning_rate: 3.0e-4
beta: 0.01
epsilon: 0.2
lambd: 0.95
num_epoch: 5
learning_rate_schedule: constant
beta_schedule: linear
epsilon_schedule: constant
network_settings:
normalize: true
hidden_units: 128
num_layers: 3
vis_encode_type: simple
memory:
sequence_length: 256
memory_size: 256
reward_signals:
extrinsic:
gamma: 0.99
strength: 1
curiosity:
gamma: 0.99
strength: 0.01
learning_rate: 0.0003
network_settings:
encoding_size: 256
num_layers: 4
max_steps: 1000000000000
time_horizon: 64
summary_freq: 20000
keep_checkpoints: 5
checkpoint_interval: 500000
torch_settings:
device: true
[Header("Reward Settings")]
public float fastPlateReward = 1.0f;
public float mediumPlateReward = 0.9f;
public float slowPlateReward = 0.8f;
public float explorationReward = 0.05f;
public float fallPenalty = -0.5f;
public float wallPenalty = -0.1f;
public float groundPenalty = -0.05f;
public float timePenalty = -0.05f;
[Header("Jump Penalty Settings")]
public float jumpPenalty = -0.05f; // Penalty for jumping
public float jumpPenaltyInterval = 1f; // Interval to apply the penalty
[Header("Reward for Escaping the Room")]
public float escapeRoomReward = 5.0f;
private bool rewardGranted = false;
[Header("Door Reward Settings")]
public float doorProximityReward = 0.5f; // Reward per proximity unit
public float reachingDoorReward = 2.0f; // Fixed reward for reaching the door
public float maxRewardDistance = 10.0f; // Maximum distance considered for the reward
public float distanceToReachDoor = 2.0f; // Distance to consider the door as reached
curiosity: strength
?Would love your feedback on this! Thanks in advance for taking the time to help out.
I am a beginner in Reinforcement Learning and am using the Soft Actor-Critic (SAC) method for smart charging optimization of Electric Vehicles. The goal is to optimize charging schedules for multiple EVs across discrete time slots to minimize costs while meeting battery and grid constraints. I have implemented a replay buffer with prioritized sampling and added techniques like priority decay and dynamic sampling to enhance training stability and address potential overfitting. However, I am unsure if overfitting is occurring and how to determine an appropriate stopping criterion based on the gap between training and evaluation rewards. I would appreciate guidance on improving the model’s learning and ensuring better generalization.
I have some questions in my head about how reinforcement learning (RL) models handle sequences during training and testing:
List some useful resources as well.
Thank u in advance
Hi everyone,
I am working on my understanding of how different parameters impact the rate of convergence and if the algorithm converges to the optimal policy at all. In my experiment I am running a pretty simple grid world - essentially trying to find the end point in a maze. Reward is 0 everywhere except for the maze exit where it is 1. The environment is non-stationary and changes at 3 points indicated by blue dotted lines in the attached figure. I am not currently giving the agent enough time to adapt to those changes but that is not my main concern at the moment.
I am running a Q-Learning agent with an epsilon greedy policy. With a fixed epsilon of 0.25 the agent finds its way out quicker than the decaying epsilon. However, if I compare the greedy policies based on the Q values of these agents, by inspection I can see that the fixed epsilon of 0.25 is actually not optimal, while the decaying epsilon's agent is. The agent with the fixed epsilon does indeed find the best route from its starting point, however, what I see is that in states that aren't on that best path the policy does not point in the optimal direction.
I would think that with a fixed epsilon the agent would actually explore more over time so these states should be more flushed out in their Q-values. So I think I am lacking a bit of intuition here if anyone could help that would be much appreciated!
I am thinking about whether is possible to combine the actor-critic algorithm with music generation. I think it would be interesting for my master's thesis. Tomorrow I should tell my advisor which topic I will be working on and I am not sure if this one is feasible. Help plsss
What I am looking for is the following:
Given my requirements, Isaac Lab seemed the perfect option, but unfortunately my hardware is not supported by Isaac Lab. Are there some other projects that specifically implement (dog-like) quadrupeds?
I have been working in Python using JAX for developing frameworks in DRL. I chose JAX due to the immense speed-ups it is capable of. However, lately I have been considering to move away from Python and adopt a high performance language, hoping for even greater speed. I am thinking to either go with C++ or Rust.
Which language would you suggest?
This is my University project. I am training a drone to navigate in an environment while avoiding obstacles. The approach is to use a depth-estimation model to ascertain the depth of the scene and determine the region with a clear path. The drone then learns to converge the ares of interest into the centre of its FOV.
The approach is to use a policy gradient method with discrete actions. The actions are to adjust roll, pitch, yaw and target altitude. input are all the sensor readings and the centeroid of pixels of the region of interest. The action selected by the policy network are then given to a PID controller to control the rotor speed to execute the action.
I am linking the github repository containing all the code. I am using a software called webots but regardless the code is just as similar to any other software.
https://github.com/BiradarSiddhant02/autonomous-drone-navigation.git
Below are some images of what exactly is happening with the depth map and the region of interest
The drone tries to bring the red cross towards the center. It receives more reward when it does so.
Can someone please explain why the gradient is wrt the approximate value function and not the true value function.
For a set of parameters w, Del_w = alpha. E[v_pi(S) - v_hat(S, w)) Del_w v_hat(S,w)
Please check the screenshot for better written equations.
The explanation from David Silver's lecture 6 starting 45:55](https://www.youtube.com/watch?v=UoPei5o4fps&t=2755s) went over my head.
The way I had understood (before the guy in the audience asked the question, triggering the response) was that - the true value function is like a constant. The approximate value function is the one we're trying to figure out - it's variable wrt the parameters w. The partial derivative wrt the true function is 0. So we only consider the delta_w v_hat.
But then David Silver goes on to say that there are sophisticated approaches which do consider the gradient wrt both the true and the approximate functions. So my understanding (above) is probably incorrect.
Hi, I am working with contextual bandits, updating the policy in an online fashion, similar to PPO. Essentially, I collect a batch of samples in a buffer and update using PPO loss, the difference being that I get a single reward for the action I selected, i.e., bandit feedback. Given this setup, will the policy improvement bound from the RL literature (ref: Constrained Policy Optimization, Corollary 1, Eq 5) hold for the contextual bandit?
Hey! Beginner here.
From my understanding, standard PPO (not MAPPO) is a decetralized technique if looked at from the multi-agent perspective (agents assume that other agent's policy changes over time are part of the environment itself). If this logic is followed, then, it is safe to assume that when using multiple PPO agents in an environment (cooperative setting), the probabilities of convergence may not be high, or at least it is certainly not guaranteed.
There are existing implementation examples of standard PPO in multi-agent-based RL frameworks, PettingZoo in specific. PPO implementation on a PettingZoo environment.
My question is, why is PPO being used in such frameworks but there is no mention of non-stationarity in the documentation? Is convergence possible for an approach like multiple PPO agents? I mainly just wanted to know about how realistic is it for agents to cooperate if they all use PPO. Also, I am not considering other techniques that deal with non-staitonarity like IPPO, just want to know about PPO.
I do not know if I phrased what I was thinking about in a precise enough way for it to make sense + english is not my first language, so I hope it was understood.
Thank you in advance! :)
Hi, everyone,
Is there any open sourced datasets for MARL envs with safety constraints?
I am a new guy ,give me some suggestion,please
Hello,
I am reading the OpenAI spinning up tutorial Intro to policy gradient
An (optional) proof of this claim can be found `here`_, and it ultimately depends on the EGLP lemma.
In the above text, the link 'here' is not working at all (it just directs me to the same webpage). Do you know the link for this proof?
Thank you very much!
Hi everybody, I am a college student who Is currently studying reinforced learning in game development. I have a current thought. for one, Is RL agents ready to replace traditional AI agents? I personally do not think so, after doing research. for example, I read that agents would find the most rewarding situation instead of the most optimal solution or what the game developers intended. I did read in a study that semantic-segmented frames could be used as input for a agent could beat Super Mario Bros levels in less training time than an agent without these frames as input. what do you think? Is reinforced learning ready to replace traditional AI?
The latest supported Python version by Carla is 3.8.0 which is too old for PyTorch GPU acceleration. I have tried wrapping up my Carla code in a server but its too slow. Any advice?
I am using a custom environment. Where state is representes as ( x1, x2) actions are (delta_x1, delta_x2) next state is (x1+delta_x1, x2+ delta_x2) . There is reward. During training also the actor many times goes to the boundaries of the state space. I know many people have faced this same problem, likei in DDPG the actor always takes same action. What was the problem for your implementation and how u solved it? Also any other help is much appreciated. Thanks in advance.
Hello RL World!
I'm a huge fan of Starcraft Broodwar (from South Korea) since it first came out in late 90s when I was just a kid. Fast-forward 24 years, after getting my bachelors in CS, I've worked mostly on distributed systems / database for 10 years in the backend world in various companies. And here I am, still watching Broodwar professionals leagues.
I came across AlphaGo 9 years back (boy time flies) in Korea and got interested in AI back at that time, but Go wasn't my thing of interest, so the interest faded away, until AlphaStar came out to conquer Starcraft II. Now as I see though, I don't see much of an AI system in Broodwar that is human-like in terms of APM that is trained to challenge the Broodwar legends (like Flash, Bisu, Stork etc), so I want to at least learn the challenges of why it hasn't yet came to the surface to challenge these legends. Is it the cost of training the model? Challenges on Broodwar APIs?
I've been a Backend engineer for the past 10 years, but I'm currently new to RL so I just grabbed the book "Grokking the Deep Reinforcement Learning (Morales)" from Amazon and started reading (is this a good start)?
I am trying to implement the above algorithm for the Cartpole environment. Many of the implementation details are missing from the RL book, so I wanted to ask about them.
* What if the rewards are in the range of -100 and 100, how do you handle preprocessing rewards?
* We can clip the rewards between -1 and 1 but there will no longer be a difference between rewards -1 and -100 (before being preprocessed) as after preprocessing both will be -1
* We can normalize rewards, but how, running mean and std? As this is not a Monte Carlo method to get all the rewards and state-action pairs before updating values, where can we compute mean and std to normalize...
* Do we have to use a replay buffer?
* Do we have to normalize the td error before using it for the loss of policy pi?
Is there any paper for actor-critic algorithm just like we have the 2013 paper by deepmind for DQN?
Also after running the below code, I'm not getting the expected results, sum of rewards is like is not at all increasing...
(I'm a beginner trying to get it RL please help)
here's the code for it
```python
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import animation as anim
from dataclasses import dataclass
from itertools import count
from collections import deque
import random
import typing as tp
import torch
from torch import nn, Tensor
SEED:int = 42
u/dataclass
class config:
num_steps:int = 500_000
num_steps_per_episode:int = 500
num_episodes:int = num_steps//num_steps_per_episode # 1000
num_warmup_steps:int = num_steps_per_episode*7 # 3500
gamma:float = 0.99
batch_size:int = 32
lr:float = 1e-4
weight_decay:float = 0.0
device:torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dtype:torch.dtype = torch.float32 # if "cpu" in device.type else torch.bfloat16
generator:torch.Generator = torch.Generator(device=device)
generator.manual_seed(SEED+3)
class PolicyNetwork(nn.Module):
def __init__(self, state_dim:int, action_dim:int):
super().__init__()
assert action_dim > 1
last_dim = 1 if action_dim == 2 else action_dim
self.fc1 = nn.Linear(state_dim, 128)
self.relu1 = nn.ReLU()
self.fc2 = nn.Linear(128, 64)
self.relu2 = nn.ReLU()
self.fc3 = nn.Linear(64, last_dim)
self.softmax_or_sigmoid = nn.Sigmoid() if last_dim == 1 else nn.Softmax(dim=-1)
def forward(self, state):
x = self.relu1(self.fc1(state))
x = self.relu2(self.fc2(x))
logits = self.fc3(x)
return self.softmax_or_sigmoid(logits)
# Define the Value Network
class ValueNetwork(nn.Module):
def __init__(self, state_dim:int):
super().__init__()
self.fc1 = nn.Linear(state_dim, 128)
self.relu1 = nn.ReLU()
self.fc2 = nn.Linear(128, 64)
self.relu2 = nn.ReLU()
self.fc3 = nn.Linear(64, 1)
def forward(self, state):
x = self.relu1(self.fc1(state))
x = self.relu2(self.fc2(x))
value = self.fc3(x)
return value # (B, 1)
u/torch.no_grad()
def sample_prob_action_from_pi(pi:PolicyNetwork, state:Tensor):
left_proba:Tensor = pi(state)
# If `left_proba` is high, then `action` will most likely be `False` or 0, which means left
action = (torch.rand(size=(1, 1), device=config.device, generator=config.generator) > left_proba).int().item()
return int(action)
u/torch.compiler.disable(recursive=True)
def sample_from_buffer(replay_buffer:deque):
batched_samples = random.sample(replay_buffer, config.batch_size) # Frames stored in uint8 [0, 255]
instances = list(zip(*batched_samples))
current_states, actions, rewards, next_states, dones = [
torch.as_tensor(np.asarray(inst), device=config.device, dtype=torch.float32) for inst in instances
]
return current_states, actions, rewards, next_states, dones
u/torch.compile
def train_step():
# Sample from replay buffer
current_states, actions, rewards, next_states, dones = sample_from_buffer(replay_buffer)
# Value Loss and Update weights
zero_if_terminal_else_one = 1.0 - dones
td_error:Tensor = (
(rewards + config.gamma*value_fn(next_states).squeeze(1)*zero_if_terminal_else_one) -
value_fn(current_states).squeeze(1)
) # (B,)
value_loss = 0.5 * td_error.pow(2).mean() # (,)
value_loss.backward()
vopt.step()
vopt.zero_grad()
# Policy Loss and Update weights
td_error = ((td_error - td_error.mean()) / (td_error.std() + 1e-8)).detach() # (B,) # CHATGPT told me to normalize the td_error
y_target:Tensor = 1.0 - actions # (B,)
left_probas:Tensor = pi_fn(current_states).squeeze(1) # (B,)
pi_loss = -torch.mean(
(torch.log(left_probas) * y_target + torch.log(1.0 - left_probas) * (1.0 - y_target))*td_error,
dim=0
)
pi_loss.backward()
popt.step()
popt.zero_grad()
def main():
print(f"Training Starts...\nWARMING UP TILL ~{config.num_warmup_steps//config.num_steps_per_episode} episodes...")
num_steps_over = 0; sum_rewards_list = []
for episode_num in range(config.num_episodes):
state, info = env.reset()
sum_rewards = 0.0
for tstep in count(0):
num_steps_over += 1
# Sample action from policy
if num_steps_over < config.num_warmup_steps:
action = env.action_space.sample()
else:
action = sample_prob_action_from_pi(pi_fn, torch.as_tensor(state, device=config.device, dtype=torch.float32).unsqueeze(0))
next_state, reward, done, truncated, info = env.step(action)
replay_buffer.append((state, action, reward, next_state, done))
# Train the networks
if num_steps_over >= config.num_warmup_steps:
train_step()
sum_rewards += reward
if done or truncated:
break
# Update state
state = next_state
# LOGGING
print(f"Episode {episode_num+1}/{config.num_episodes} | Sum of rewards: {sum_rewards:.2f}")
sum_rewards_list.append(sum_rewards)
print("Training is over after", num_steps_over)
return sum_rewards_list
if __name__ == "__main__":
random.seed(SEED)
np.random.seed(SEED+1)
torch.manual_seed(SEED+2)
torch.use_deterministic_algorithms(mode=True, warn_only=True)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
env = gym.make("CartPole-v1", render_mode="rgb_array")
pi_fn = PolicyNetwork(env.observation_space.shape[0], env.action_space.n)
pi_fn.to(config.device)
print(pi_fn, end=f"| Number of parameters: {sum(p.numel() for p in pi_fn.parameters())}\n\n")
value_fn = ValueNetwork(env.observation_space.shape[0])
value_fn.to(config.device)
print(value_fn, end=f"| Number of parameters: {sum(p.numel() for p in value_fn.parameters())}\n\n")
vopt = torch.optim.AdamW(value_fn.parameters(), lr=config.lr, weight_decay=config.weight_decay, fused=True)
popt = torch.optim.AdamW(pi_fn.parameters(), lr=config.lr, weight_decay=config.weight_decay, fused=True)
vopt.zero_grad(), popt.zero_grad()
replay_buffer = deque(maxlen=5000)
sum_rewards_list = main()
plt.plot(sum_rewards_list)
plt.yticks(np.arange(0, 501, 50))
plt.xlabel("Episode")
plt.ylabel("Sum of rewards")
plt.title("Sum of rewards per episode")
plt.show()
```
I'm trying to train an agent on my Unity puzzle game project, the game works like this;
You need to send the color matching the currrent bus. You can only play the character whose path is not blocked. You've 5 slots to make a room for behind characters or wrong plays.
What I've tried so far;
I've been working on it about a month and no success so far.
I've started with vector observations and put tile colors, states, current bus color etc. But it didn't work. It's too complicated. I've simplified the observation state and setup by every time I've failed. At one point, I've given the agent only 1s and 0s which are the pieces it should learn to play, only the 1 values can be played because I'm checking the playable status and if color matches. I also use action mask. I couldn't train it on simple setup like this, it was a battle and frustration. I've even simplified to the point that I end episodes when it make mistake negative reward and end episode. I want it to choose the correct piece and not cared about play the level and do strategy. But it played well on trained levels but it overfit, memorized them. On the test level, even simple ones couldn't do it correctly.
I've started to look up deeply how should I approach it and look at match-3 example from Unity MLAgents examples. I've learned that for grid like structures I need to use CNN and I've created custom sensor and now putting visual observations like putting 40 layers of information on a 20x20 grid. 11 colors layer + 11 bus color layers + can move layer + cannot move layer etc. I've tried simple visual encode and match3 one, still I couldn't do some training on it.
My question is; is it hard to train this kind of puzzle game on RL ? Because on Unity examples there're so many complicated gameplays and it learns quickly even with giving less help to agent. Or am I doing something wrong in the core approach ?
this is the config I'm using atm but I've tried so many things on it, I've changed and tried almost every approach here;
```
behaviors:
AIAgentBehavior:
trainer_type: ppo
hyperparameters:
batch_size: 256
buffer_size: 2560 # buffer_size = batch_size * 8
learning_rate: 0.0003
beta: 0.005
epsilon: 0.2
lambd: 0.95
num_epoch: 3
shared_critic: False
learning_rate_schedule: linear
beta_schedule: linear
epsilon_schedule: linear
network_settings:
normalize: True
hidden_units: 256
num_layers: 3
vis_encode_type: match3
# conv_layers:
# - filters: 32
# kernel_size: 3
# stride: 1
# - filters: 64
# kernel_size: 3
# stride: 1
# - filters: 128
# kernel_size: 3
# stride: 1
deterministic: False
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
# network_settings:
# normalize: True
# hidden_units: 256
# num_layers: 3
# # memory: None
# deterministic: False
# init_path: None
keep_checkpoints: 5
checkpoint_interval: 50000
max_steps: 200000
time_horizon: 32
summary_freq: 1000
threaded: False
```