SAC/TD3 issue #64

dhruvkm2402 · 2023-03-29T14:18:39Z

dhruvkm2402
Mar 29, 2023

Hi @Toni-SM ,
I was trying to analyze TD3 and SAC performance for my scenario in Omniverse ISAAC gym. But with TD3 and SAC, after a few steps it outputs only one action which is 1. Are there some additional changes I need to make? PPO works as expected depending on the reward formulation.
I'll share the code that I referenced from Multi-Agent example.

class StochasticActor(GaussianMixin, Model):
    def __init__(self, observation_space, action_space, device, clip_actions=False,
                 clip_log_std=True, min_log_std=-20, max_log_std=2):
        Model.__init__(self, observation_space, action_space, device)
        GaussianMixin.__init__(self, clip_actions, clip_log_std, min_log_std, max_log_std)

        self.net = nn.Sequential(nn.Linear(self.num_observations, 32),
                                 nn.ELU(),
                                 nn.Linear(32, 32),
                                 nn.ELU(),
                                 nn.Linear(32, self.num_actions))
        self.log_std_parameter = nn.Parameter(torch.zeros(self.num_actions))

    def compute(self, inputs, role):
        return self.net(inputs["states"]), self.log_std_parameter, {}
# Define the models (stochastic and deterministic models) for the agents using mixins.
# - StochasticActor: takes as input the environment's observation/state and returns an action
# - DeterministicActor: takes as input the environment's observation/state and returns an action
# - Critic: takes the state and action as input and provides a value to guide the policy


class Critic(DeterministicMixin, Model):
    def __init__(self, observation_space, action_space, device, clip_actions=False):
        Model.__init__(self, observation_space, action_space, device)
        DeterministicMixin.__init__(self, clip_actions)

        self.net = nn.Sequential(nn.Linear(self.num_observations + self.num_actions, 32),
                                 nn.ELU(),
                                 nn.Linear(32, 32),
                                 nn.ELU(),
                                 nn.Linear(32, 1))

    def compute(self, inputs, role):
        return self.net(torch.cat([inputs["states"], inputs["taken_actions"]], dim=1)), {}


# Load and wrap the Isaac Gym environment
# instance VecEnvBase and setup task
headless = False  # set headless to False for rendering
env = get_env_instance(headless=headless) 

from omniisaacgymenvs.utils.config_utils.sim_config import SimConfig
from Hunter_Task import HunterTask, TASK_CFG

TASK_CFG["headless"] = headless
TASK_CFG["task"]["env"]["numEnvs"] = 1
TASK_CFG["task"]["env"]["controlSpace"] = "joint"  # "joint" or "cartesian"

sim_config = SimConfig(TASK_CFG)
task = HunterTask(name="Hunter", sim_config=sim_config, env=env)
env.set_task(task=task, sim_params=sim_config.get_physics_params(), backend="torch", init_sim=True)
#env.render()

# wrap the environment
env = wrap_env(env, "omniverse-isaacgym")
 

device = env.device


# Instantiate a RandomMemory as rollout buffer (any memory can be used for this)
memory_sac = RandomMemory(memory_size=8000, num_envs=1, device=device, replacement=True)



# Instantiate the agent's models (function approximators).
# TD3 requires 6 models, visit its documentation for more details
# https://skrl.readthedocs.io/en/latest/modules/skrl.agents.td3.html#spaces-and-models
models_sac = {}
models_sac["policy"] = StochasticActor(env.observation_space, env.action_space, device, clip_actions=True)
models_sac["critic_1"] = Critic(env.observation_space, env.action_space, device)
models_sac["critic_2"] = Critic(env.observation_space, env.action_space, device)
models_sac["target_critic_1"] = Critic(env.observation_space, env.action_space, device)
models_sac["target_critic_2"] = Critic(env.observation_space, env.action_space, device)

# Initialize the models' parameters (weights and biases) using a Gaussian distribution
for model in models_sac.values():
    model.init_parameters(method_name="normal_", mean=0.0, std=0.1)

cfg_sac = SAC_DEFAULT_CONFIG.copy()
cfg_sac["gradient_steps"] = 1
cfg_sac["batch_size"] = 1024
cfg_sac["random_timesteps"] = 0
cfg_sac["learning_starts"] = 0
cfg_sac["learn_entropy"] = True
# logging to TensorBoard and write checkpoints each 25 and 1000 timesteps respectively
cfg_sac["experiment"]["write_interval"] = 25
cfg_sac["experiment"]["checkpoint_interval"] = 500
# Configure and instantiate the agent.
# Only modify some of the default configuration, visit its documentation to see all the options
# https://skrl.readthedocs.io/en/latest/modules/skrl.agents.td3.html#configuration-and-hyperparameters
cfg_sac = SAC_DEFAULT_CONFIG.copy()
cfg_sac["gradient_steps"] = 1
cfg_sac["batch_size"] = 512
cfg_sac["random_timesteps"] = 0
cfg_sac["learning_starts"] = 0
cfg_sac["learn_entropy"] = True
# logging to TensorBoard and write checkpoints each 25 and 1000 timesteps respectively
cfg_sac["experiment"]["write_interval"] = 25
cfg_sac["experiment"]["checkpoint_interval"] = 1000

agent = SAC(models=models_sac,
            memory=memory_sac,
            cfg=cfg_sac,
            observation_space=env.observation_space,
            action_space=env.action_space,
            device=device)


# Configure and instantiate the RL trainer
cfg_trainer = {"timesteps": 50000, "headless": False}
trainer = SequentialTrainer(cfg=cfg_trainer, env=env, agents=agent, agents_scope=[])

# start training
trainer.train()

HumbleLee · 2023-03-30T09:56:07Z

HumbleLee
Mar 30, 2023

This looks similar to my problem, you can take a look at my question, it may be helpful for you.

0 replies

Toni-SM · 2023-03-30T10:10:27Z

Toni-SM
Mar 30, 2023
Maintainer

Hi @dhruvkm2402 and @HumbleLee

Some practices that can be used to control the limits of the actions taken by the policy/actor may include:

Modify the initial log_std value and limits for avoiding sampling actions far from the mean actions (for stochastic policies)

Use the hyperbolic tangent (tanh) function to compact and bound the output range.

self.net = nn.Sequential(nn.Linear(self.num_observations, 32),
                         nn.ELU(),
                         nn.Linear(32, 32),
                         nn.ELU(),
                         nn.Linear(32, self.num_actions),
                         nn.Tanh())

or

return torch.tanh(self.net(inputs["states"])), ......

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SAC/TD3 issue #64

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

SAC/TD3 issue #64

dhruvkm2402 Mar 29, 2023

Replies: 2 comments

HumbleLee Mar 30, 2023

Toni-SM Mar 30, 2023 Maintainer

dhruvkm2402
Mar 29, 2023

HumbleLee
Mar 30, 2023

Toni-SM
Mar 30, 2023
Maintainer