/r/reinforcementlearning

Photograph via snooOG

Reinforcement learning is a subfield of AI/statistics focused on exploring/understanding complicated environments and learning how to optimally acquire rewards. Examples are AlphaGo, clinical trials & A/B tests, and Atari game playing.

This is for any reinforcement learning related work ranging from purely computational RL in artificial intelligence to the models of RL in neuroscience.

The standard introduction to RL is Sutton & Barto's Reinforcement Learning.

Related subreddits:

/r/reinforcementlearning

46,885 Subscribers

1

Best way to train agents in multi agent environment

Im working on a chess RL project where 2 agents trained on different algorithms play against each other. Im wondering what is the best way to train the agents. Should i just let them play against each other, train the agents separately against random actions made by the opponent or have them train separately where both of the opponents playing in the game are using the same algorithm. And advice would be helpful

0 Comments
2024/12/04
11:29 UTC

1

MoE RL

Is it possible to combine Mixture of Experts (MoE) with Reinforcement Learning (RL)? Does it make sense to train an agent that can choose which expert or experts to activate based on the input?

I have a more complex idea in mind: I want to integrate MoE and RL with Low-Rank Adaptation (LoRA). The plan is to have several LoRA modules and for the agent to identify the most suitable modules depending on the input. I aim to apply this approach to various NLP tasks. Does this make sense?

0 Comments
2024/12/04
11:26 UTC

3

DDPGII based on Symphony S2

3 Comments
2024/12/04
06:19 UTC

2

[CartPole-v1] Vanilla PG loss increases while total rewards also increases..

Started with this simple gym env to learn about RL. I'm observing the same trend on the PG loss and rewards.. Shouldn't I expect the loss decrease while the reward increase?

I calculated the loss over a batch of (state, action, reward) triplets using:

loss = -torch.mean(logits * rewards)

The rewards are calculated using rewards to go formula with discount, something like:

out = torch.zeros(size=rewards.size(), dtype=torch.float32)
out[-1] = rewards[-1]
for i in range(rewards.size(-1)-1)[::-1]:
   out[i] = rewards[i] + r * out[i+1]
1 Comment
2024/12/04
02:41 UTC

10

Are there reasons to use model based RL beyond sample complexity?

Does it ever really make sense to use a model based algorithm when we have access to large-scale environment parallelization?

Basically, I am wondering if there are environments that algorithms like DreamerV3 can solve that PPO just can't? For example, the DreamerV3 paper shows that PPO and IMPALA don't solve the most difficult Minecraft tasks, but would PPO solve these tasks eventually, given massive compute?

Are there any reasons to use model based algorithms besides reducing sample complexity?

7 Comments
2024/12/04
01:39 UTC

8

How to deal with complex, nested action spaces.

I'm working on an agent that can work with the miniwob++ gymnasium environment. I've never worked with a non-flat action space like this and was wondering if anyone has any guidance about the best way to formulate it, given that different parameters are necessary depending on which action is being taken. Specifically trying to tackle this using a Pytorch as well if that's relevant.

2 Comments
2024/12/03
21:59 UTC

5

Dirichlet masked action space

Hi,

I'm a bit new to masked actions in RL. I have an RL problem where my action space sums to 1 (resource allocation problem), but as the time step t increases I have less resources I can use, i.e., the first t indices must be equal to zero.

I'm trying to use dirichlet distribution and zero out the first indices in the environment, but I'm keep getting sub-optimal results.

My reward function is a convex combination of two terms - one tries to optimize credibility and the second tries to optimize profitability. For two specific scenarios I know what is the optimum, but in all other scenarios I don't.

I'm using RLLIB, and I've tried using PPO (Attention-based model) and SAC. Somehow SAC works well than PPO, I'm not sure why, but yet sub-optimal.

Ask ideas what am I doing wrong or what should I do?

Thanks in advance!

0 Comments
2024/12/03
19:52 UTC

2

small sized LLM models

hii I know its RL sub but does anyone have idea of any small LLM pre-trained models like for 5-12 GB size .
It should be able to just answer very basic stuff and thats about it. If so please share
thanks in advance :))

3 Comments
2024/12/03
11:12 UTC

2

Where can I find reliable benchmarks for RL algorithms across Gym environments?

Let me preface this by saying I'm fairly new to RL. My previous work was with LLMs, where it is very common to rank and stack your model against the universe of models based on how it performs on a given benchmarks (and these benchmarks are a huge deal, startups raise serious $$$ just based on their scores).

Recently started training models in MuJoCo environments and I'm trying to figure out if my algorithms are performing somewhat decently. Sure, I can get Ant-v5 to walk using SB3's default PPO and MlpPolicy, but how good is it really?

Is there some benchmark or repo where I can compare my results against the learning curve of other people's algorithms using the default MuJoCo (or any of the other gyms') reward functions? Of course the assumption would be that we are using the same environment and reward function, but given Gymnasium is popular and offers good defaults, I'd imagine there should be a lot of data available.

I've googled around and have only found sparse results. Is there a reason why benchmarks are not as big in RL as they are with LLMs?

2 Comments
2024/12/03
00:15 UTC

66

Is there less attention towards genetic algos now? If so, why

Genetic algorithms (GA) have been around for a long time (roughly the 1960's). They seem both incredibly intuitive and especially useful for blackbox problems, but they aren't currently "mainstream". In 2017, OpenAI was very bullish on evolutionary algos and cited their benefits, being parallelizable, robust, and able to deal with long time-scale problems with unclear value/fitness functions. Have there been any more recent updates? What algos are beating out GA?

For low-dimensional problems, Bayesian optimization may have better statistical guarantees/asymptotics. Are there even any guarantees for GA, or are we completely in the dark?

10 Comments
2024/12/02
23:41 UTC

2

RL Problem - 2D Grid with Endless Pieces (of varying colours)

Hi there,

I got this very advance and diffcult problem at work. I got this problem scenario where I have this 2D discrete grid system. The size of this grid can vary. In addition to this, I got multiple different square boxes that can be place into these grids. Each of the boxes can have different solid colours on the sides. I need to find the best placement that maximizes the abutment of the same colours when two different square boxes are abutted together.

I have been looking at the AlphaZero approach but for my approach, I could no specified number of pieces and each box could have different colours. Any suggestions on how to approach this problem?

3 Comments
2024/12/02
21:22 UTC

1

Problem with action, that I get from DQN is sometime out of bound.

Hello guys,

I am beginner in reinforcement learning. Currently, working on DQN and getting issues with predicted actions by DQN. My problem is as follow:

  • Atari games have 16 general action spaces, but each game has its own subset of actions from this general space.
  • Suppose I am exploring 5 different games, each with a different number of valid actions.
  • During evolution, one game has only 3 valid actions, but your DQN is returning actions outside this valid space.

One solution is to replace action that is now valid with action 0 (NOOP), it means do nothing.

is there any other way to handle this situation in efficient way

Thank you in advance

6 Comments
2024/12/02
04:00 UTC

35

Why is there less hype around DreamerV3 than PPO?

From what I’ve seen, PPO generally has been the go-to algorithm for reinforcement learning tasks. How come DreamerV3 isn’t used instead? It seems to be a lot more stable and requires less tuning of hyperparameters.

29 Comments
2024/12/02
01:22 UTC

2

Do graph attention is invariant to rotational transformation?

I really not sure I am I wrong or not but in term of equivariant and invariant graph neural network can handle this effectively but except rotational equivariant.
When I heard this word I still think should GAT add more step to become rotational transformation invariant make output 3D feature invariant weather position different or not in the task of regression prediction on PCQM4Mv2 on 3D graph.

My pipeline:
Node feature = feature extract from dataset
Edge feature = 3D Euclidian

1 Comment
2024/12/01
23:39 UTC

7

Looking for collaborators - RL for Operations Research problems

Hi all :)

I am currently pursuing my master's and have always focused on RL. I have done now several projects in the field and I am also working on a publication currently with some collaborators on RL in uncertain stochastic environments which can be modeled as graphs. The field of Operations Research came to my attention when I did my undergrad thesis where I framed a stochastic combinatorial problem as an RL problem and solved it using Graph Neural Nets. During that time and since then I have read some very recent publications that try to solve traditional OR problems with modern RL approaches (e.g. https://arxiv.org/abs/2312.15658).

I think this is in general still underexplored and at the same very interesting. More modern neural net architectures such as GNNs seem to be perfectly equipped to be combined with RL for many OR problems. I would therefore also like to focus on graph machine learning methods (such as GNNs), but I am also open to any other modeling approach. Additionally, there is also an OR gym repository (https://github.com/hubbs5/or-gym) which I would like to explore and use to do some experiments on new approaches.

Hence, what I am looking for are some people who would want to join me and work together to solve some of these problems with more modern RL-based approaches. I have not thought about how much time to invest per week in these projects since I logically still have a lot to do on the side (uni, work, publication). So, if there is interest we could connect and find a good setup that would fit all of us :)

I personally, have a lot of experience in designing, building and pipelining large-scale neural nets and would be very happy to collaborate with people from various backgrounds.

Love hearing from you!

5 Comments
2024/12/01
22:48 UTC

13

Sequential Action in RL

I have an agent whose actions must follow a sequential order: A, B, C. However, the environment is unstable or unpredictable. The agent takes actions based on the current situation, completing its sequence (A, B, C). After finishing the sequence, the agent has some reward based on completion of work. It waits for the environment and analyse the next situation before deciding and executing the next set of actions. How can we use RL in this scenerio? How can we train model to have suitational awareness to take a action. Each of action is equally important to get good action.

In summary, environment is unpredicted but the we have to find some hidden pattern to take this sequence of action.

Thank you in advance!

4 Comments
2024/12/01
19:47 UTC

2

What are the hyper-parameters when training Ant in an off-policy PPO?

I am new to reinforcement learning. Using torchrl as a library, I have created the PPO code as per the following tutorial and confirmed that InvertedDoublePendulum can be trained off-policy.

Next I wanted to try learning with Ant-v4, so I changed the environment name to Ant-v4 and started learning. It did not seem to be learning well, so I set `frame_skip` to 5, `frames_per_batch` to `50_000 // frame_skip`, `total_frames` to `60_000_000 // frame_skip`, and `sub_ batch_size` to 2500 to increase the training volume (the other hyperparameters remain the same as in the previous tutorial).

However, Ant's rewards are coming and going at a very low level and are not walking the walk as shown in the following video. Given that increasing the amount of learning doesn't work, I think it's an issue with the hyperparameter settings, but I didn't get any good insights from Google.

What am I missing in my code?

https://reddit.com/link/1h45ovi/video/ewie7pmk994e1/player

6 Comments
2024/12/01
15:28 UTC

11

Why is my Q_Learning Algorithm not learning properly?

Hi, I'm currently programming an AI that is supposed to learn Tic Tac Toe using Q-Learning. My Problem is that the model is learning a bit at the start but then gets worse and doesn't get better. I'm using

old_qvalue + self.alpha * (reward + self.gamma * max_qvalue_nextstate - old_qvalue)

to update the QValues, with alpha at 0.3 and gamma at 0.9. I also use the Epsilon Greedy strategy with a decaying Epsilon which starts at 0.9 and is decreased by 0.0005 per turn and stops decreasing at 0.1. The Opponent is a Minimax Algorithm. I didn't find any flaws in the Code and Chat GPT also didn't and I'm wondering what I'm doing wrong. If anyone has any Tips I would appreciate them. The Code is unfortunately in German and I don't have a Github Account set up right now.

3 Comments
2024/11/30
15:19 UTC

3

SAC (Soft Actor Critc) cannot solve some tasks

I write an soft actor critic algorithm that I want to use later for autonomous driving with carla simulator. My Code Manager to solve simple Tasks, but when I try it on carla, I receive poor performance even tough I base my rewards on Master theses. The Master theses receive better performance. It would be nice if someone could check my code for math or programming errors. You can find my code on github with: https://github.com/b-gtr/Soft-Actor-Critic

0 Comments
2024/11/30
12:19 UTC

10

Why is std increasing when I train PPO?

I am testing PPO on a trivial task with continuous actions (Gaussian distributions), in the actor-critic setup. The actor network is learning quickly the optimal value for the mean, but the std keeps increasing (the optimal solution is deterministic, the higher the std the worse the return). What could be the reason for this to happen? I don't use any bonus for exploration. The std is state-independent.

10 Comments
2024/11/30
10:29 UTC

3

I cannot get this dqn to converge on grid world

import gymnasium as gym
import random
from random import random
from random import random
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from replay_memory import ReplayMemory
from transition import Transition


ENV = "CartPole-v1"
ENV = "FrozenLake-v1"

device = torch.device(
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)
print(f"Using device: {device}")

TAU = 0.005


class MLP(nn.Module):
    def __init__(self, input_dim, output_dim, device):
        super().__init__()
        self.layer_1 = nn.Linear(input_dim, 128).to(device)
        self.layer_2 = nn.Linear(128, 128).to(device)
        self.layer_3 = nn.Linear(128, output_dim).to(device)

    def forward(self, x):
        x = F.relu(self.layer_1(x))
        x = F.relu(self.layer_2(x))
        return self.layer_3(x)


def init_weights(m):
    if isinstance(m, nn.Linear):
        print('ran xavier')
        nn.init.xavier_uniform_(m.weight)
        nn.init.constant_(m.bias, 0)


class Agent:
    def __init__(self):
        # self.env = gym.make(ENV, render_mode="human", is_slippery=False)
        self.env = gym.make(ENV, render_mode=None, is_slippery=False)
        # self.env = gym.make(ENV, render_mode="human")
        self.device = torch.device(
            "cuda"
            if torch.cuda.is_available()
            else "mps" if torch.backends.mps.is_available() else "cpu"
        )

        state, _ = self.env.reset()
        if isinstance(state, int):
            state = [state]
        self.state_space_dim = len(state)
        self.action_space_dim = self.env.action_space.n
        self.epsilon = 0.8
        self.gamma = 0.99
        self.batch_size = 64
        self.replay_memory = ReplayMemory(capacity=10000)

        self.policy_network = MLP(
            self.state_space_dim, self.action_space_dim, device=self.device
        ).to(self.device)
        self.target_network = MLP(
            self.state_space_dim, self.action_space_dim, device=self.device
        ).to(self.device)
        self.policy_network.apply(init_weights)
        self.target_network.apply(init_weights)

        self.target_network.load_state_dict(self.policy_network.state_dict())
        self.optimizer = optim.AdamW(
            self.policy_network.parameters(), lr=0.001, amsgrad=True
        )

    def select_action(self, state: torch.Tensor):
        if random() < self.epsilon:
            return torch.tensor(
                [[self.env.action_space.sample()]],
                device=self.device,
                dtype=torch.long,
            )
        else:
            with torch.no_grad():
                return self.policy_network(state).max(1).indices.unsqueeze(0)

    def train(self, n_episodes=100):
        episode_rewards = []
        episode_loss = []
        total_steps = 0
        for episode in range(n_episodes):
            self.epsilon = max(0.05, self.epsilon * 0.99)
            state, _ = self.env.reset()
            if isinstance(state, int):
                state = [state]

            total_episode_reward = 0
            max_episode_loss = 0
            while True:
                total_steps += 1
                action = self.select_action(
                    torch.tensor(
                        state, device=self.device, dtype=torch.float
                    ).unsqueeze(0)
                )
                # print(f'at state: {state} \n'
                #       f'q table: {self.target_network(torch.tensor(state, dtype=torch.float32, device=self.device))}')
                next_state, reward, terminated, truncated, _ = self.env.step(
                    int(action.item())
                )
                if isinstance(next_state, int):
                    next_state = [next_state]
                
                if terminated and reward == 0:
                    reward = -10
                if terminated and reward == 1:
                    reward = 100
                if state == next_state:
                    reward -= 0.01
                else:
                    reward += 0.01

                total_episode_reward += reward
                done = terminated or truncated
                if terminated:
                    next_state = None
                self.replay_memory.push(Transition(
                    torch.tensor(state, dtype=torch.float32,
                                 device=self.device).unsqueeze(0),
                    torch.tensor([action], dtype=torch.float32,
                                 device=self.device),
                    torch.tensor(next_state, dtype=torch.float32, device=self.device).unsqueeze(
                        0) if next_state is not None else None,
                    torch.tensor([reward], dtype=torch.float32,
                                 device=self.device)
                ))
                state = next_state

                if len(self.replay_memory) > self.batch_size:
                    loss = self.optimize(
                        self.replay_memory.sample(self.batch_size))
                    max_episode_loss = max(max_episode_loss, loss)

                # Soft update.
                # if total_steps % 10 == 0:
                #     policy_network_state_dict = self.policy_network.state_dict()
                #     target_network_state_dict = self.target_network.state_dict()
                #     for key in policy_network_state_dict:
                #         target_network_state_dict[key] = (
                #             policy_network_state_dict[key] * TAU
                #             + (1 - TAU) * target_network_state_dict[key]
                #         )
                #     self.target_network.load_state_dict(target_network_state_dict)

                if done:
                    episode_rewards.append(total_episode_reward)
                    episode_loss.append(max_episode_loss)
                    print(
                        f"episode: {episode}\n"
                        f"epsilon: {self.epsilon}\n"
                        f"reward: {total_episode_reward}\n"
                        f"loss: {max_episode_loss}\n"
                    )
                    break

            # Hard update.
            if episode % 25 == 0:
                policy_network_state_dict = self.policy_network.state_dict()
                self.target_network.load_state_dict(policy_network_state_dict)

        return [episode_rewards, episode_loss]

    def run(self):
        self.env = gym.make(ENV, render_mode="human", is_slippery=False)
        # self.env = gym.make(ENV, render_mode="human")
        for _ in range(1):
            state, _ = self.env.reset()
            while True:
                if isinstance(state, int):
                    state = [state]
                action = (
                    self.target_network(
                        torch.tensor(state, dtype=torch.float32, device=self.device))
                    .max(0)
                    .indices
                )

                next_state, _, terminated, truncated, _ = self.env.step(
                    action.item()
                )
                print(f'at state: {state} \n'
                      f'q table: {self.target_network(torch.tensor(state, dtype=torch.float32, device=self.device))}\n'
                      f'action: {action}\n'
                      f'action.item(): {action.item()}\n'
                      f'next state: {next_state}')

                state = next_state
                if terminated or truncated:
                    break

    def optimize(self, batch: list[Transition]):
        states = torch.cat(
            [
                transition.state
                for transition in batch
            ]
        )
        actions = torch.cat(
            [
                transition.action
                for transition in batch
            ]
        )
        next_states = [transition.next_state for transition in batch]
        rewards = torch.cat(
            [
                transition.reward
                for transition in batch
            ]
        )
        # Compute Q(s_t, a), the current estimate of Q-values for all actions in the batch.
        # Shape: (batch_size, action_dim)
        # Example Q-values for a batch of states:
        # [
        #    [1.2, 2.0],  # Q-values for actions in state 1
        #    [0.2, 1.2],  # Q-values for actions in state 2
        #    [1.3, 1.1],  # Q-values for actions in state 3
        # ]
        q_estimates = self.policy_network(states)

        # Extract Q(s_t, a) for the actions actually taken in each state
        # Shape: (batch_size, 1)
        # Example:
        # actions = [[1], [0], [1]] (actions taken in states 1, 2, 3 respectively)
        # The gathered values would be:
        # [
        #    [2.0],  # Q-value for action 1 in state 1
        #    [0.2],  # Q-value for action 0 in state 2
        #    [1.1],  # Q-value for action 1 in state 3
        # ]

        state_action_values = q_estimates.gather(
            1, actions.unsqueeze(1).long())

        # Create a mask to identify which next states are non-final (i.e., states that have a valid next state).
        # Shape: (batch_size,)
        # Example:
        # [
        #   True,  # State 1 has a valid next state
        #   False, # State 2 is a final state
        #   True,  # State 3 has a valid next state
        # ]
        non_final_next_state_mask = torch.tensor(
            [
                next_state is not None
                for next_state in next_states
            ], dtype=torch.bool, device=self.device
        )

        # Filter and concatenate only the non-final next states into a single tensor for further processing.
        # Shape: (filtered_batch_size, state_dim)
        # Example filtered states (assuming state_dim = 4):
        # [
        #   [1.2, 1.3, 2.2, 3.0],   # Next state for state 1
        #   [0.2, -3.1, 1.1, 3.0],  # Next state for state 3
        # ]
        non_final_next_states = torch.cat(
            [next_state
             for next_state in filter(lambda s: s is not None, next_states)]
        )

        # Compute the target Q-values using the Bellman equation:
        # Q_target(s_t, a_t) = r + gamma * max_a Q(s_{t+1}, a').
        # state_action_targets = torch.zeros(len(batch), device=self.device)

        state_action_targets = torch.zeros(len(batch), device=self.device)

        # Mask update such that:
        # For non-final states, we compute gamma * max_a Q(s_{t+1}, a') using the target network.
        # For final states, this contribution is 0 because there is no valid next state.
        with torch.no_grad():
            state_action_targets[non_final_next_state_mask] = self.target_network(
                non_final_next_states).max(1).values
            target = rewards + (self.gamma * state_action_targets)

        # Huber loss
        criterion = nn.SmoothL1Loss()
        loss = criterion(state_action_values,
                         target.unsqueeze(1))
        self.optimizer.zero_grad()
        loss.backward()

        torch.nn.utils.clip_grad_value_(self.policy_network.parameters(), 10)
        self.optimizer.step()
        return loss.item()


agent = Agent()
rewards, loss = agent.train(n_episodes=200)
print(rewards, loss)
agent.run()

I tried modifying the reward functions to generate more diverse samples but no matter what I cannot get it to converge. Could it be that I need to one hot encode the state?

9 Comments
2024/11/30
07:09 UTC

2

Expected return formula of deterministic policy

I have a question regarding how the expected return of a deterministic policy in written. I have seen that in some cases the use the Q-Function as it is shown in expression 5. However, I do not fully understand the steps how it is obtained contrary to the stochastic policy. What are the steps or reasoning to get the expression 5 ?

https://preview.redd.it/q6ykgkjzbw3e1.png?width=711&format=png&auto=webp&s=b4ad5f9bbd75430a83d6b240395f52431fed3486

0 Comments
2024/11/29
19:58 UTC

2

Mujoco motion imitation

How feasible is it to write a program that takes bvh data and attempts to imitate it with a mujoco humanoid. I assume the agent would have to be trained on a lot of data, and i have already identified the CMU dataset as a commonly used dataset. Can anyone point me to projects that implement this, or describe how it would be performed? Thanks.

1 Comment
2024/11/29
19:06 UTC

1

Need Help Fine-Tuning ML-Agents PPO Training (TensorBoard Insights)

Hi everyone!

I’m currently training an ML-Agents PPO agent to escape a sequence of rooms, where each room has a specific time constraint. The agent manages to find its way out and complete each step , but it’s taking too long to escape the rooms. I believe the training could be more efficient and faster with better parameter tuning. Below are my observations, screenshots from TensorBoard, and details of my setup.

Screenshots

Observations

  1. The cumulative reward is improving steadily, but I suspect it could be faster.
  2. Losses (Curiosity, Policy, Value) are decreasing but with fluctuations—should I tweak learning rates or buffer sizes?
  3. The policy entropy drops significantly early on—could this indicate insufficient exploration over time?

Training Setup

Here are the parameters I’m using for the PPO agent:

Trainer Configuration

behaviors:
  NavigationAgentController:
    trainer_type: ppo
    hyperparameters:
      batch_size: 1024
      buffer_size: 102400
      learning_rate: 3.0e-4
      beta: 0.01
      epsilon: 0.2
      lambd: 0.95
      num_epoch: 5
      learning_rate_schedule: constant
      beta_schedule: linear
      epsilon_schedule: constant
    network_settings:
      normalize: true
      hidden_units: 128
      num_layers: 3
      vis_encode_type: simple
      memory:
        sequence_length: 256
        memory_size: 256
    reward_signals:
      extrinsic:
        gamma: 0.99
        strength: 1
      curiosity:
        gamma: 0.99
        strength: 0.01
        learning_rate: 0.0003
        network_settings:
          encoding_size: 256
          num_layers: 4
    max_steps: 1000000000000
    time_horizon: 64
    summary_freq: 20000
    keep_checkpoints: 5
    checkpoint_interval: 500000
    torch_settings:
      device: true

Reward Structure

[Header("Reward Settings")]
public float fastPlateReward = 1.0f;
public float mediumPlateReward = 0.9f;
public float slowPlateReward = 0.8f;
public float explorationReward = 0.05f;
public float fallPenalty = -0.5f;
public float wallPenalty = -0.1f;
public float groundPenalty = -0.05f;
public float timePenalty = -0.05f;

[Header("Jump Penalty Settings")]
public float jumpPenalty = -0.05f; // Penalty for jumping
public float jumpPenaltyInterval = 1f; // Interval to apply the penalty

[Header("Reward for Escaping the Room")]
public float escapeRoomReward = 5.0f;
private bool rewardGranted = false;

[Header("Door Reward Settings")]
public float doorProximityReward = 0.5f; // Reward per proximity unit
public float reachingDoorReward = 2.0f; // Fixed reward for reaching the door
public float maxRewardDistance = 10.0f; // Maximum distance considered for the reward
public float distanceToReachDoor = 2.0f; // Distance to consider the door as reached

TensorBoard Metrics

  • Entropy: Decreasing rapidly early on.
  • Curiosity Forward Loss: Drops to near-zero quickly—should I increase curiosity: strength?
  • Policy Loss: Fluctuations at mid-training. Do I need a higher buffer size?
  • Extrinsic Reward: Steadily improving but slow.

What I’m Looking For

  1. Are there any obvious bottlenecks in these TensorBoard metrics?
  2. Should I adjust the reward signals or hyperparameters to make training faster or more robust?
  3. Any recommendations for improving exploration and policy stability?

Would love your feedback on this! Thanks in advance for taking the time to help out.

1 Comment
2024/11/29
15:31 UTC

12

How to know if SAC method is overfitting

https://preview.redd.it/r1tizdsxbt3e1.png?width=1000&format=png&auto=webp&s=0ba0880704571a2868c12a02d3780bb3244d34aa

I am a beginner in Reinforcement Learning and am using the Soft Actor-Critic (SAC) method for smart charging optimization of Electric Vehicles. The goal is to optimize charging schedules for multiple EVs across discrete time slots to minimize costs while meeting battery and grid constraints. I have implemented a replay buffer with prioritized sampling and added techniques like priority decay and dynamic sampling to enhance training stability and address potential overfitting. However, I am unsure if overfitting is occurring and how to determine an appropriate stopping criterion based on the gap between training and evaluation rewards. I would appreciate guidance on improving the model’s learning and ensuring better generalization.

1 Comment
2024/11/29
10:02 UTC

1

How Reinforcement Learning Models Handle Sequential Information During Training and Testing

I have some questions in my head about how reinforcement learning (RL) models handle sequences during training and testing:

  1. Learning Sequences During Training:
    • When we pass one state at a time to the model, how does it learn the sequence of states needed to reach a goal?
    • How does the model understand the connection between different states in the sequence?
  2. Passing Sequential Information:
    • How do we give the model information about the sequence of states so it can remember past actions or events while making decisions?
  3. Using Sequences During Testing:
    • When testing, how does the model use the sequence of past states and actions to make predictions?
    • Does it rely on what it learned during training, or does it have a way to remember and process past information?

List some useful resources as well.

Thank u in advance

1 Comment
2024/11/29
09:47 UTC

Back To Top