46,703 Subscribers

Why is my Q_Learning Algorithm not learning properly?

Hi, I'm currently programming an AI that is supposed to learn Tic Tac Toe using Q-Learning. My Problem is that the model is learning a bit at the start but then gets worse and doesn't get better. I'm using

old_qvalue + self.alpha * (reward + self.gamma * max_qvalue_nextstate - old_qvalue)

to update the QValues, with alpha at 0.3 and gamma at 0.9. I also use the Epsilon Greedy strategy with a decaying Epsilon which starts at 0.9 and is decreased by 0.0005 per turn and stops decreasing at 0.1. The Opponent is a Minimax Algorithm. I didn't find any flaws in the Code and Chat GPT also didn't and I'm wondering what I'm doing wrong. If anyone has any Tips I would appreciate them. The Code is unfortunately in German and I don't have a Github Account set up right now.

3 Comments

2024/11/30
15:19 UTC

SAC (Soft Actor Critc) cannot solve some tasks

I write an soft actor critic algorithm that I want to use later for autonomous driving with carla simulator. My Code Manager to solve simple Tasks, but when I try it on carla, I receive poor performance even tough I base my rewards on Master theses. The Master theses receive better performance. It would be nice if someone could check my code for math or programming errors. You can find my code on github with: https://github.com/b-gtr/Soft-Actor-Critic

0 Comments

2024/11/30
12:19 UTC

Why is std increasing when I train PPO?

I am testing PPO on a trivial task with continuous actions (Gaussian distributions), in the actor-critic setup. The actor network is learning quickly the optimal value for the mean, but the std keeps increasing (the optimal solution is deterministic, the higher the std the worse the return). What could be the reason for this to happen? I don't use any bonus for exploration. The std is state-independent.

9 Comments

2024/11/30
10:29 UTC

I cannot get this dqn to converge on grid world

import gymnasium as gym
import random
from random import random
from random import random
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from replay_memory import ReplayMemory
from transition import Transition


ENV = "CartPole-v1"
ENV = "FrozenLake-v1"

device = torch.device(
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)
print(f"Using device: {device}")

TAU = 0.005


class MLP(nn.Module):
    def __init__(self, input_dim, output_dim, device):
        super().__init__()
        self.layer_1 = nn.Linear(input_dim, 128).to(device)
        self.layer_2 = nn.Linear(128, 128).to(device)
        self.layer_3 = nn.Linear(128, output_dim).to(device)

    def forward(self, x):
        x = F.relu(self.layer_1(x))
        x = F.relu(self.layer_2(x))
        return self.layer_3(x)


def init_weights(m):
    if isinstance(m, nn.Linear):
        print('ran xavier')
        nn.init.xavier_uniform_(m.weight)
        nn.init.constant_(m.bias, 0)


class Agent:
    def __init__(self):
        # self.env = gym.make(ENV, render_mode="human", is_slippery=False)
        self.env = gym.make(ENV, render_mode=None, is_slippery=False)
        # self.env = gym.make(ENV, render_mode="human")
        self.device = torch.device(
            "cuda"
            if torch.cuda.is_available()
            else "mps" if torch.backends.mps.is_available() else "cpu"
        )

        state, _ = self.env.reset()
        if isinstance(state, int):
            state = [state]
        self.state_space_dim = len(state)
        self.action_space_dim = self.env.action_space.n
        self.epsilon = 0.8
        self.gamma = 0.99
        self.batch_size = 64
        self.replay_memory = ReplayMemory(capacity=10000)

        self.policy_network = MLP(
            self.state_space_dim, self.action_space_dim, device=self.device
        ).to(self.device)
        self.target_network = MLP(
            self.state_space_dim, self.action_space_dim, device=self.device
        ).to(self.device)
        self.policy_network.apply(init_weights)
        self.target_network.apply(init_weights)

        self.target_network.load_state_dict(self.policy_network.state_dict())
        self.optimizer = optim.AdamW(
            self.policy_network.parameters(), lr=0.001, amsgrad=True
        )

    def select_action(self, state: torch.Tensor):
        if random() < self.epsilon:
            return torch.tensor(
                [[self.env.action_space.sample()]],
                device=self.device,
                dtype=torch.long,
            )
        else:
            with torch.no_grad():
                return self.policy_network(state).max(1).indices.unsqueeze(0)

    def train(self, n_episodes=100):
        episode_rewards = []
        episode_loss = []
        total_steps = 0
        for episode in range(n_episodes):
            self.epsilon = max(0.05, self.epsilon * 0.99)
            state, _ = self.env.reset()
            if isinstance(state, int):
                state = [state]

            total_episode_reward = 0
            max_episode_loss = 0
            while True:
                total_steps += 1
                action = self.select_action(
                    torch.tensor(
                        state, device=self.device, dtype=torch.float
                    ).unsqueeze(0)
                )
                # print(f'at state: {state} \n'
                #       f'q table: {self.target_network(torch.tensor(state, dtype=torch.float32, device=self.device))}')
                next_state, reward, terminated, truncated, _ = self.env.step(
                    int(action.item())
                )
                if isinstance(next_state, int):
                    next_state = [next_state]
                
                if terminated and reward == 0:
                    reward = -10
                if terminated and reward == 1:
                    reward = 100
                if state == next_state:
                    reward -= 0.01
                else:
                    reward += 0.01

                total_episode_reward += reward
                done = terminated or truncated
                if terminated:
                    next_state = None
                self.replay_memory.push(Transition(
                    torch.tensor(state, dtype=torch.float32,
                                 device=self.device).unsqueeze(0),
                    torch.tensor([action], dtype=torch.float32,
                                 device=self.device),
                    torch.tensor(next_state, dtype=torch.float32, device=self.device).unsqueeze(
                        0) if next_state is not None else None,
                    torch.tensor([reward], dtype=torch.float32,
                                 device=self.device)
                ))
                state = next_state

                if len(self.replay_memory) > self.batch_size:
                    loss = self.optimize(
                        self.replay_memory.sample(self.batch_size))
                    max_episode_loss = max(max_episode_loss, loss)

                # Soft update.
                # if total_steps % 10 == 0:
                #     policy_network_state_dict = self.policy_network.state_dict()
                #     target_network_state_dict = self.target_network.state_dict()
                #     for key in policy_network_state_dict:
                #         target_network_state_dict[key] = (
                #             policy_network_state_dict[key] * TAU
                #             + (1 - TAU) * target_network_state_dict[key]
                #         )
                #     self.target_network.load_state_dict(target_network_state_dict)

                if done:
                    episode_rewards.append(total_episode_reward)
                    episode_loss.append(max_episode_loss)
                    print(
                        f"episode: {episode}\n"
                        f"epsilon: {self.epsilon}\n"
                        f"reward: {total_episode_reward}\n"
                        f"loss: {max_episode_loss}\n"
                    )
                    break

            # Hard update.
            if episode % 25 == 0:
                policy_network_state_dict = self.policy_network.state_dict()
                self.target_network.load_state_dict(policy_network_state_dict)

        return [episode_rewards, episode_loss]

    def run(self):
        self.env = gym.make(ENV, render_mode="human", is_slippery=False)
        # self.env = gym.make(ENV, render_mode="human")
        for _ in range(1):
            state, _ = self.env.reset()
            while True:
                if isinstance(state, int):
                    state = [state]
                action = (
                    self.target_network(
                        torch.tensor(state, dtype=torch.float32, device=self.device))
                    .max(0)
                    .indices
                )

                next_state, _, terminated, truncated, _ = self.env.step(
                    action.item()
                )
                print(f'at state: {state} \n'
                      f'q table: {self.target_network(torch.tensor(state, dtype=torch.float32, device=self.device))}\n'
                      f'action: {action}\n'
                      f'action.item(): {action.item()}\n'
                      f'next state: {next_state}')

                state = next_state
                if terminated or truncated:
                    break

    def optimize(self, batch: list[Transition]):
        states = torch.cat(
            [
                transition.state
                for transition in batch
            ]
        )
        actions = torch.cat(
            [
                transition.action
                for transition in batch
            ]
        )
        next_states = [transition.next_state for transition in batch]
        rewards = torch.cat(
            [
                transition.reward
                for transition in batch
            ]
        )
        # Compute Q(s_t, a), the current estimate of Q-values for all actions in the batch.
        # Shape: (batch_size, action_dim)
        # Example Q-values for a batch of states:
        # [
        #    [1.2, 2.0],  # Q-values for actions in state 1
        #    [0.2, 1.2],  # Q-values for actions in state 2
        #    [1.3, 1.1],  # Q-values for actions in state 3
        # ]
        q_estimates = self.policy_network(states)

        # Extract Q(s_t, a) for the actions actually taken in each state
        # Shape: (batch_size, 1)
        # Example:
        # actions = [[1], [0], [1]] (actions taken in states 1, 2, 3 respectively)
        # The gathered values would be:
        # [
        #    [2.0],  # Q-value for action 1 in state 1
        #    [0.2],  # Q-value for action 0 in state 2
        #    [1.1],  # Q-value for action 1 in state 3
        # ]

        state_action_values = q_estimates.gather(
            1, actions.unsqueeze(1).long())

        # Create a mask to identify which next states are non-final (i.e., states that have a valid next state).
        # Shape: (batch_size,)
        # Example:
        # [
        #   True,  # State 1 has a valid next state
        #   False, # State 2 is a final state
        #   True,  # State 3 has a valid next state
        # ]
        non_final_next_state_mask = torch.tensor(
            [
                next_state is not None
                for next_state in next_states
            ], dtype=torch.bool, device=self.device
        )

        # Filter and concatenate only the non-final next states into a single tensor for further processing.
        # Shape: (filtered_batch_size, state_dim)
        # Example filtered states (assuming state_dim = 4):
        # [
        #   [1.2, 1.3, 2.2, 3.0],   # Next state for state 1
        #   [0.2, -3.1, 1.1, 3.0],  # Next state for state 3
        # ]
        non_final_next_states = torch.cat(
            [next_state
             for next_state in filter(lambda s: s is not None, next_states)]
        )

        # Compute the target Q-values using the Bellman equation:
        # Q_target(s_t, a_t) = r + gamma * max_a Q(s_{t+1}, a').
        # state_action_targets = torch.zeros(len(batch), device=self.device)

        state_action_targets = torch.zeros(len(batch), device=self.device)

        # Mask update such that:
        # For non-final states, we compute gamma * max_a Q(s_{t+1}, a') using the target network.
        # For final states, this contribution is 0 because there is no valid next state.
        with torch.no_grad():
            state_action_targets[non_final_next_state_mask] = self.target_network(
                non_final_next_states).max(1).values
            target = rewards + (self.gamma * state_action_targets)

        # Huber loss
        criterion = nn.SmoothL1Loss()
        loss = criterion(state_action_values,
                         target.unsqueeze(1))
        self.optimizer.zero_grad()
        loss.backward()

        torch.nn.utils.clip_grad_value_(self.policy_network.parameters(), 10)
        self.optimizer.step()
        return loss.item()


agent = Agent()
rewards, loss = agent.train(n_episodes=200)
print(rewards, loss)
agent.run()

I tried modifying the reward functions to generate more diverse samples but no matter what I cannot get it to converge. Could it be that I need to one hot encode the state?

5 Comments

2024/11/30
07:09 UTC

Expected return formula of deterministic policy

I have a question regarding how the expected return of a deterministic policy in written. I have seen that in some cases the use the Q-Function as it is shown in expression 5. However, I do not fully understand the steps how it is obtained contrary to the stochastic policy. What are the steps or reasoning to get the expression 5 ?

https://preview.redd.it/q6ykgkjzbw3e1.png?width=711&format=png&auto=webp&s=b4ad5f9bbd75430a83d6b240395f52431fed3486

0 Comments

2024/11/29
19:58 UTC

Mujoco motion imitation

How feasible is it to write a program that takes bvh data and attempts to imitate it with a mujoco humanoid. I assume the agent would have to be trained on a lot of data, and i have already identified the CMU dataset as a commonly used dataset. Can anyone point me to projects that implement this, or describe how it would be performed? Thanks.

1 Comment

2024/11/29
19:06 UTC

Need Help Fine-Tuning ML-Agents PPO Training (TensorBoard Insights)

Hi everyone!

I’m currently training an ML-Agents PPO agent to escape a sequence of rooms, where each room has a specific time constraint. The agent manages to find its way out and complete each step , but it’s taking too long to escape the rooms. I believe the training could be more efficient and faster with better parameter tuning. Below are my observations, screenshots from TensorBoard, and details of my setup.

Screenshots

Environment Metrics: Image 1
Losses: Image 2
Policy Metrics: Image 3

Observations

The cumulative reward is improving steadily, but I suspect it could be faster.
Losses (Curiosity, Policy, Value) are decreasing but with fluctuations—should I tweak learning rates or buffer sizes?
The policy entropy drops significantly early on—could this indicate insufficient exploration over time?

Training Setup

Here are the parameters I’m using for the PPO agent:

Trainer Configuration

behaviors:
  NavigationAgentController:
    trainer_type: ppo
    hyperparameters:
      batch_size: 1024
      buffer_size: 102400
      learning_rate: 3.0e-4
      beta: 0.01
      epsilon: 0.2
      lambd: 0.95
      num_epoch: 5
      learning_rate_schedule: constant
      beta_schedule: linear
      epsilon_schedule: constant
    network_settings:
      normalize: true
      hidden_units: 128
      num_layers: 3
      vis_encode_type: simple
      memory:
        sequence_length: 256
        memory_size: 256
    reward_signals:
      extrinsic:
        gamma: 0.99
        strength: 1
      curiosity:
        gamma: 0.99
        strength: 0.01
        learning_rate: 0.0003
        network_settings:
          encoding_size: 256
          num_layers: 4
    max_steps: 1000000000000
    time_horizon: 64
    summary_freq: 20000
    keep_checkpoints: 5
    checkpoint_interval: 500000
    torch_settings:
      device: true

Reward Structure

[Header("Reward Settings")]
public float fastPlateReward = 1.0f;
public float mediumPlateReward = 0.9f;
public float slowPlateReward = 0.8f;
public float explorationReward = 0.05f;
public float fallPenalty = -0.5f;
public float wallPenalty = -0.1f;
public float groundPenalty = -0.05f;
public float timePenalty = -0.05f;

[Header("Jump Penalty Settings")]
public float jumpPenalty = -0.05f; // Penalty for jumping
public float jumpPenaltyInterval = 1f; // Interval to apply the penalty

[Header("Reward for Escaping the Room")]
public float escapeRoomReward = 5.0f;
private bool rewardGranted = false;

[Header("Door Reward Settings")]
public float doorProximityReward = 0.5f; // Reward per proximity unit
public float reachingDoorReward = 2.0f; // Fixed reward for reaching the door
public float maxRewardDistance = 10.0f; // Maximum distance considered for the reward
public float distanceToReachDoor = 2.0f; // Distance to consider the door as reached

TensorBoard Metrics

Entropy: Decreasing rapidly early on.
Curiosity Forward Loss: Drops to near-zero quickly—should I increase curiosity: strength?
Policy Loss: Fluctuations at mid-training. Do I need a higher buffer size?
Extrinsic Reward: Steadily improving but slow.

What I’m Looking For

Are there any obvious bottlenecks in these TensorBoard metrics?
Should I adjust the reward signals or hyperparameters to make training faster or more robust?
Any recommendations for improving exploration and policy stability?

Would love your feedback on this! Thanks in advance for taking the time to help out.

1 Comment

2024/11/29
15:31 UTC

How to know if SAC method is overfitting

https://preview.redd.it/r1tizdsxbt3e1.png?width=1000&format=png&auto=webp&s=0ba0880704571a2868c12a02d3780bb3244d34aa

I am a beginner in Reinforcement Learning and am using the Soft Actor-Critic (SAC) method for smart charging optimization of Electric Vehicles. The goal is to optimize charging schedules for multiple EVs across discrete time slots to minimize costs while meeting battery and grid constraints. I have implemented a replay buffer with prioritized sampling and added techniques like priority decay and dynamic sampling to enhance training stability and address potential overfitting. However, I am unsure if overfitting is occurring and how to determine an appropriate stopping criterion based on the gap between training and evaluation rewards. I would appreciate guidance on improving the model’s learning and ensuring better generalization.

1 Comment

2024/11/29
10:02 UTC

How Reinforcement Learning Models Handle Sequential Information During Training and Testing

I have some questions in my head about how reinforcement learning (RL) models handle sequences during training and testing:

Learning Sequences During Training:
- When we pass one state at a time to the model, how does it learn the sequence of states needed to reach a goal?
- How does the model understand the connection between different states in the sequence?
Passing Sequential Information:
- How do we give the model information about the sequence of states so it can remember past actions or events while making decisions?
Using Sequences During Testing:
- When testing, how does the model use the sequence of past states and actions to make predictions?
- Does it rely on what it learned during training, or does it have a way to remember and process past information?

List some useful resources as well.

Thank u in advance

1 Comment

2024/11/29
09:47 UTC

Epsilon Greedy Convergence to Optimal Policy In Q-Learning

Hi everyone,

I am working on my understanding of how different parameters impact the rate of convergence and if the algorithm converges to the optimal policy at all. In my experiment I am running a pretty simple grid world - essentially trying to find the end point in a maze. Reward is 0 everywhere except for the maze exit where it is 1. The environment is non-stationary and changes at 3 points indicated by blue dotted lines in the attached figure. I am not currently giving the agent enough time to adapt to those changes but that is not my main concern at the moment.

I am running a Q-Learning agent with an epsilon greedy policy. With a fixed epsilon of 0.25 the agent finds its way out quicker than the decaying epsilon. However, if I compare the greedy policies based on the Q values of these agents, by inspection I can see that the fixed epsilon of 0.25 is actually not optimal, while the decaying epsilon's agent is. The agent with the fixed epsilon does indeed find the best route from its starting point, however, what I see is that in states that aren't on that best path the policy does not point in the optimal direction.

I would think that with a fixed epsilon the agent would actually explore more over time so these states should be more flushed out in their Q-values. So I think I am lacking a bit of intuition here if anyone could help that would be much appreciated!

https://preview.redd.it/cu21nn6wpq3e1.png?width=640&format=png&auto=webp&s=dea0446dd51e85dcb5107afa4ab86e11c7581780

4 Comments

2024/11/29
01:14 UTC

RL for music generation

I am thinking about whether is possible to combine the actor-critic algorithm with music generation. I think it would be interesting for my master's thesis. Tomorrow I should tell my advisor which topic I will be working on and I am not sure if this one is feasible. Help plsss

22 Comments

2024/11/28
21:55 UTC

Easy-to-set-up environments to simulate quadrupeds while being as realistic as possible

What I am looking for is the following:

Easy to install
Has a Python API and is easy to use (like gym environments)
Has cameras and other sensors information

Given my requirements, Isaac Lab seemed the perfect option, but unfortunately my hardware is not supported by Isaac Lab. Are there some other projects that specifically implement (dog-like) quadrupeds?

4 Comments

2024/11/28
21:29 UTC

C++ vs Rust for DRL

I have been working in Python using JAX for developing frameworks in DRL. I chose JAX due to the immense speed-ups it is capable of. However, lately I have been considering to move away from Python and adopt a high performance language, hoping for even greater speed. I am thinking to either go with C++ or Rust.

Which language would you suggest?

6 Comments

2024/11/28
18:34 UTC

Should I start with MineRL or MineDojo for learning AI and reinforcement learning?

I’m new to reinforcement learning (RL) and want to start experimenting in a Minecraft-based environment. I’ve come across two main platforms: MineRL and MineDojo, but I’m not sure which one is better for a beginner like me.

6 Comments

2024/11/28
17:02 UTC

Am I doing something wrong? Reward is hitting a cap

This is my University project. I am training a drone to navigate in an environment while avoiding obstacles. The approach is to use a depth-estimation model to ascertain the depth of the scene and determine the region with a clear path. The drone then learns to converge the ares of interest into the centre of its FOV.

The approach is to use a policy gradient method with discrete actions. The actions are to adjust roll, pitch, yaw and target altitude. input are all the sensor readings and the centeroid of pixels of the region of interest. The action selected by the policy network are then given to a PID controller to control the rotor speed to execute the action.

I am linking the github repository containing all the code. I am using a software called webots but regardless the code is just as similar to any other software.

https://github.com/BiradarSiddhant02/autonomous-drone-navigation.git

Below are some images of what exactly is happening with the depth map and the region of interest

https://preview.redd.it/3fto1axp7n3e1.png?width=640&format=png&auto=webp&s=bb5c7b29fce957d5f64d5f7f576cb952f03707b6

The drone tries to bring the red cross towards the center. It receives more reward when it does so.

0 Comments

2024/11/28
13:19 UTC

Why is the gradient wrt the approximate value function and not the true value function

Can someone please explain why the gradient is wrt the approximate value function and not the true value function.

For a set of parameters w, Del_w = alpha. E[v_pi(S) - v_hat(S, w)) Del_w v_hat(S,w)

Please check the screenshot for better written equations.

The explanation from David Silver's lecture 6 starting 45:55](https://www.youtube.com/watch?v=UoPei5o4fps&t=2755s) went over my head.

The way I had understood (before the guy in the audience asked the question, triggering the response) was that - the true value function is like a constant. The approximate value function is the one we're trying to figure out - it's variable wrt the parameters w. The partial derivative wrt the true function is 0. So we only consider the delta_w v_hat.

But then David Silver goes on to say that there are sophisticated approaches which do consider the gradient wrt both the true and the approximate functions. So my understanding (above) is probably incorrect.

12 Comments

2024/11/28
06:42 UTC

test

3 Comments

2024/11/28
01:55 UTC

test

1 Comment

2024/11/28
01:52 UTC

Policy Performance Bound for Contextual Bandits?

Hi, I am working with contextual bandits, updating the policy in an online fashion, similar to PPO. Essentially, I collect a batch of samples in a buffer and update using PPO loss, the difference being that I get a single reward for the action I selected, i.e., bandit feedback. Given this setup, will the policy improvement bound from the RL literature (ref: Constrained Policy Optimization, Corollary 1, Eq 5) hold for the contextual bandit?

0 Comments

2024/11/28
00:09 UTC

Proximal Policy Optimization for Multi-Agent RL (PettingZoo)

Hey! Beginner here.

From my understanding, standard PPO (not MAPPO) is a decetralized technique if looked at from the multi-agent perspective (agents assume that other agent's policy changes over time are part of the environment itself). If this logic is followed, then, it is safe to assume that when using multiple PPO agents in an environment (cooperative setting), the probabilities of convergence may not be high, or at least it is certainly not guaranteed.

There are existing implementation examples of standard PPO in multi-agent-based RL frameworks, PettingZoo in specific. PPO implementation on a PettingZoo environment.

My question is, why is PPO being used in such frameworks but there is no mention of non-stationarity in the documentation? Is convergence possible for an approach like multiple PPO agents? I mainly just wanted to know about how realistic is it for agents to cooperate if they all use PPO. Also, I am not considering other techniques that deal with non-staitonarity like IPPO, just want to know about PPO.

I do not know if I phrased what I was thinking about in a precise enough way for it to make sense + english is not my first language, so I hope it was understood.

Thank you in advance! :)

4 Comments

2024/11/27
19:16 UTC

Safe offline MARL datasets

Hi, everyone,

Is there any open sourced datasets for MARL envs with safety constraints?

6 Comments

2024/11/27
13:27 UTC

How to learn RL?

I am a new guy ，give me some suggestion，please

13 Comments

2024/11/27
06:12 UTC

The link in this OpenAI spinning up tutorial is missing

Hello,

I am reading the OpenAI spinning up tutorial Intro to policy gradient

An (optional) proof of this claim can be found `here`_, and it ultimately depends on the EGLP lemma.

In the above text, the link 'here' is not working at all (it just directs me to the same webpage). Do you know the link for this proof?

Thank you very much!

2 Comments

2024/11/27
03:14 UTC

Is RL ready to replace traditional AI?

Hi everybody, I am a college student who Is currently studying reinforced learning in game development. I have a current thought. for one, Is RL agents ready to replace traditional AI agents? I personally do not think so, after doing research. for example, I read that agents would find the most rewarding situation instead of the most optimal solution or what the game developers intended. I did read in a study that semantic-segmented frames could be used as input for a agent could beat Super Mario Bros levels in less training time than an agent without these frames as input. what do you think? Is reinforced learning ready to replace traditional AI?

View Poll

9 Comments

2024/11/26
16:32 UTC

Has someone been able to use GPU-enabled PyTorch in training an RL model using the Carla Simulator?

The latest supported Python version by Carla is 3.8.0 which is too old for PyTorch GPU acceleration. I have tried wrapping up my Carla code in a server but its too slow. Any advice?

1 Comment

2024/11/26
12:22 UTC

DDPG actor always taking same action during evaluation.

I am using a custom environment. Where state is representes as ( x1, x2) actions are (delta_x1, delta_x2) next state is (x1+delta_x1, x2+ delta_x2) . There is reward. During training also the actor many times goes to the boundaries of the state space. I know many people have faced this same problem, likei in DDPG the actor always takes same action. What was the problem for your implementation and how u solved it? Also any other help is much appreciated. Thanks in advance.

3 Comments

2024/11/26
09:30 UTC

Starcraft Broodwar

Hello RL World!

I'm a huge fan of Starcraft Broodwar (from South Korea) since it first came out in late 90s when I was just a kid. Fast-forward 24 years, after getting my bachelors in CS, I've worked mostly on distributed systems / database for 10 years in the backend world in various companies. And here I am, still watching Broodwar professionals leagues.

I came across AlphaGo 9 years back (boy time flies) in Korea and got interested in AI back at that time, but Go wasn't my thing of interest, so the interest faded away, until AlphaStar came out to conquer Starcraft II. Now as I see though, I don't see much of an AI system in Broodwar that is human-like in terms of APM that is trained to challenge the Broodwar legends (like Flash, Bisu, Stork etc), so I want to at least learn the challenges of why it hasn't yet came to the surface to challenge these legends. Is it the cost of training the model? Challenges on Broodwar APIs?

I've been a Backend engineer for the past 10 years, but I'm currently new to RL so I just grabbed the book "Grokking the Deep Reinforcement Learning (Morales)" from Amazon and started reading (is this a good start)?

6 Comments

2024/11/25
22:06 UTC

One-Step Actor-Critic Algorithm (RL book) not working as expected for the Cartpole Environment

actor critic algorithm

I am trying to implement the above algorithm for the Cartpole environment. Many of the implementation details are missing from the RL book, so I wanted to ask about them.

* What if the rewards are in the range of -100 and 100, how do you handle preprocessing rewards?

* We can clip the rewards between -1 and 1 but there will no longer be a difference between rewards -1 and -100 (before being preprocessed) as after preprocessing both will be -1

* We can normalize rewards, but how, running mean and std? As this is not a Monte Carlo method to get all the rewards and state-action pairs before updating values, where can we compute mean and std to normalize...

* Do we have to use a replay buffer?

* Do we have to normalize the td error before using it for the loss of policy pi?

Is there any paper for actor-critic algorithm just like we have the 2013 paper by deepmind for DQN?

Also after running the below code, I'm not getting the expected results, sum of rewards is like is not at all increasing...

(I'm a beginner trying to get it RL please help)

here's the code for it

```python

import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import animation as anim
from dataclasses import dataclass
from itertools import count
from collections import deque
import random
import typing as tp

import torch
from torch import nn, Tensor


SEED:int = 42


u/dataclass
class config:
    num_steps:int = 500_000
    num_steps_per_episode:int = 500
    num_episodes:int = num_steps//num_steps_per_episode # 1000
    num_warmup_steps:int = num_steps_per_episode*7 # 3500
    gamma:float = 0.99
    
    batch_size:int = 32
    lr:float = 1e-4
    weight_decay:float = 0.0

    device:torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    dtype:torch.dtype = torch.float32 # if "cpu" in device.type else torch.bfloat16

    generator:torch.Generator = torch.Generator(device=device)
    generator.manual_seed(SEED+3)


class PolicyNetwork(nn.Module):
    def __init__(self, state_dim:int, action_dim:int):
        super().__init__()
        assert action_dim > 1
        last_dim = 1 if action_dim == 2 else action_dim
        self.fc1 = nn.Linear(state_dim, 128)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(128, 64)
        self.relu2 = nn.ReLU()
        self.fc3 = nn.Linear(64, last_dim)
        self.softmax_or_sigmoid = nn.Sigmoid() if last_dim == 1 else nn.Softmax(dim=-1)

    def forward(self, state):
        x = self.relu1(self.fc1(state))
        x = self.relu2(self.fc2(x))
        logits = self.fc3(x)
        return self.softmax_or_sigmoid(logits)


# Define the Value Network
class ValueNetwork(nn.Module):
    def __init__(self, state_dim:int):
        super().__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(128, 64)
        self.relu2 = nn.ReLU()
        self.fc3 = nn.Linear(64, 1)

    def forward(self, state):
        x = self.relu1(self.fc1(state))
        x = self.relu2(self.fc2(x))
        value = self.fc3(x)
        return value # (B, 1)
    

u/torch.no_grad()
def sample_prob_action_from_pi(pi:PolicyNetwork, state:Tensor):
    left_proba:Tensor = pi(state)
    # If `left_proba` is high, then `action` will most likely be `False` or 0, which means left
    action = (torch.rand(size=(1, 1), device=config.device, generator=config.generator) > left_proba).int().item()
    return int(action)


u/torch.compiler.disable(recursive=True)
def sample_from_buffer(replay_buffer:deque):
    batched_samples = random.sample(replay_buffer, config.batch_size) # Frames stored in uint8 [0, 255]
    instances = list(zip(*batched_samples))
    current_states, actions, rewards, next_states, dones = [
        torch.as_tensor(np.asarray(inst), device=config.device, dtype=torch.float32) for inst in instances
    ]
    return current_states, actions, rewards, next_states, dones


u/torch.compile
def train_step():
    # Sample from replay buffer
    current_states, actions, rewards, next_states, dones = sample_from_buffer(replay_buffer)
    
    # Value Loss and Update weights
    zero_if_terminal_else_one = 1.0 - dones
    td_error:Tensor = (
        (rewards + config.gamma*value_fn(next_states).squeeze(1)*zero_if_terminal_else_one) -
        value_fn(current_states).squeeze(1)
    ) # (B,)
    value_loss = 0.5 * td_error.pow(2).mean() # (,)
    value_loss.backward()
    vopt.step()
    vopt.zero_grad()

    # Policy Loss and Update weights
    td_error = ((td_error - td_error.mean()) / (td_error.std() + 1e-8)).detach() # (B,) # CHATGPT told me to normalize the td_error
    y_target:Tensor = 1.0 - actions # (B,)
    left_probas:Tensor = pi_fn(current_states).squeeze(1) # (B,)
    pi_loss = -torch.mean(
        (torch.log(left_probas) * y_target + torch.log(1.0 - left_probas) * (1.0 - y_target))*td_error,
        dim=0
    )
    pi_loss.backward()
    popt.step()
    popt.zero_grad()


def main():
    print(f"Training Starts...\nWARMING UP TILL ~{config.num_warmup_steps//config.num_steps_per_episode} episodes...")
    num_steps_over = 0; sum_rewards_list = []
    for episode_num in range(config.num_episodes):
        state, info = env.reset()
        sum_rewards = 0.0
        for tstep in count(0):
            num_steps_over += 1
            
            # Sample action from policy
            if num_steps_over < config.num_warmup_steps:
                action = env.action_space.sample()
            else:
                action = sample_prob_action_from_pi(pi_fn, torch.as_tensor(state, device=config.device, dtype=torch.float32).unsqueeze(0))
            next_state, reward, done, truncated, info = env.step(action)
            replay_buffer.append((state, action, reward, next_state, done))

            # Train the networks
            if num_steps_over >= config.num_warmup_steps:
                train_step()

            sum_rewards += reward
            if done or truncated:
                break
            
            # Update state
            state = next_state

        # LOGGING
        print(f"Episode {episode_num+1}/{config.num_episodes} | Sum of rewards: {sum_rewards:.2f}")
        sum_rewards_list.append(sum_rewards)

    print("Training is over after", num_steps_over)
    return sum_rewards_list


if __name__ == "__main__":
    random.seed(SEED)
    np.random.seed(SEED+1)
    torch.manual_seed(SEED+2)
    torch.use_deterministic_algorithms(mode=True, warn_only=True)
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True

    env = gym.make("CartPole-v1", render_mode="rgb_array")

    pi_fn = PolicyNetwork(env.observation_space.shape[0], env.action_space.n)
    pi_fn.to(config.device)
    print(pi_fn, end=f"| Number of parameters: {sum(p.numel() for p in pi_fn.parameters())}\n\n")

    value_fn = ValueNetwork(env.observation_space.shape[0])
    value_fn.to(config.device)
    print(value_fn, end=f"| Number of parameters: {sum(p.numel() for p in value_fn.parameters())}\n\n")

    vopt = torch.optim.AdamW(value_fn.parameters(), lr=config.lr, weight_decay=config.weight_decay, fused=True)
    popt = torch.optim.AdamW(pi_fn.parameters(), lr=config.lr, weight_decay=config.weight_decay, fused=True)
    vopt.zero_grad(), popt.zero_grad()

    replay_buffer = deque(maxlen=5000)

    sum_rewards_list = main()

    plt.plot(sum_rewards_list)
    plt.yticks(np.arange(0, 501, 50))
    plt.xlabel("Episode")
    plt.ylabel("Sum of rewards")
    plt.title("Sum of rewards per episode")
    plt.show()

```

5 Comments

2024/11/25
16:24 UTC

Unity MLAgents struggle to train on a simple puzzle game

https://preview.redd.it/uva3kh8zp03e1.png?width=677&format=png&auto=webp&s=9f838885a8d433c50e324c68c681a006855448ad

I'm trying to train an agent on my Unity puzzle game project, the game works like this;

You need to send the color matching the currrent bus. You can only play the character whose path is not blocked. You've 5 slots to make a room for behind characters or wrong plays.

What I've tried so far;

I've been working on it about a month and no success so far.

I've started with vector observations and put tile colors, states, current bus color etc. But it didn't work. It's too complicated. I've simplified the observation state and setup by every time I've failed. At one point, I've given the agent only 1s and 0s which are the pieces it should learn to play, only the 1 values can be played because I'm checking the playable status and if color matches. I also use action mask. I couldn't train it on simple setup like this, it was a battle and frustration. I've even simplified to the point that I end episodes when it make mistake negative reward and end episode. I want it to choose the correct piece and not cared about play the level and do strategy. But it played well on trained levels but it overfit, memorized them. On the test level, even simple ones couldn't do it correctly.

I've started to look up deeply how should I approach it and look at match-3 example from Unity MLAgents examples. I've learned that for grid like structures I need to use CNN and I've created custom sensor and now putting visual observations like putting 40 layers of information on a 20x20 grid. 11 colors layer + 11 bus color layers + can move layer + cannot move layer etc. I've tried simple visual encode and match3 one, still I couldn't do some training on it.

My question is; is it hard to train this kind of puzzle game on RL ? Because on Unity examples there're so many complicated gameplays and it learns quickly even with giving less help to agent. Or am I doing something wrong in the core approach ?

this is the config I'm using atm but I've tried so many things on it, I've changed and tried almost every approach here;

```

behaviors:
  AIAgentBehavior:
    trainer_type: ppo
    hyperparameters:
      batch_size: 256
      buffer_size: 2560 # buffer_size = batch_size * 8
      learning_rate: 0.0003
      beta: 0.005
      epsilon: 0.2
      lambd: 0.95
      num_epoch: 3
      shared_critic: False
      learning_rate_schedule: linear
      beta_schedule: linear
      epsilon_schedule: linear
    network_settings:
      normalize: True
      hidden_units: 256
      num_layers: 3
      vis_encode_type: match3
      # conv_layers:
      #   - filters: 32
      #     kernel_size: 3
      #     stride: 1
      #   - filters: 64
      #     kernel_size: 3
      #     stride: 1
      #   - filters: 128
      #     kernel_size: 3
      #     stride: 1
      deterministic: False
    reward_signals:
      extrinsic:
        gamma: 0.99
        strength: 1.0
        # network_settings:
        #   normalize: True
        #   hidden_units: 256
        #   num_layers: 3
        #   # memory: None
        #   deterministic: False
    # init_path: None
    keep_checkpoints: 5
    checkpoint_interval: 50000
    max_steps: 200000
    time_horizon: 32
    summary_freq: 1000
    threaded: False

```

3 Comments

2024/11/25
09:57 UTC