Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Step API with terminated, truncated bools instead of done #2752

Merged
merged 52 commits into from
Jul 9, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
e6b0a40
New Step API with terminated, truncated bools instead of done
arjun-kg Apr 14, 2022
6618da5
Merge branch 'master' of https://github.com/openai/gym into done_term…
arjun-kg Apr 20, 2022
a0c4475
Setting return_two_dones=False as default
arjun-kg Apr 20, 2022
2aabc30
update warnings
arjun-kg Apr 20, 2022
1babe4e
pytest - ignore deprecation warnings
arjun-kg Apr 21, 2022
c9c6add
Only ignore step api deprecation warnings
arjun-kg Apr 21, 2022
c5fe53c
fix duplicate wrapping bug in vector envs
arjun-kg Apr 21, 2022
f88927d
Merge branch 'master' of https://github.com/openai/gym into done_term…
arjun-kg Apr 22, 2022
7c1e9c7
Merge branch 'master' of https://github.com/openai/gym into done_term…
arjun-kg Apr 25, 2022
6af7182
edit docstrings, comments, warnings
arjun-kg Apr 25, 2022
22c1cc7
Merge branch 'master' of https://github.com/openai/gym into done_term…
arjun-kg May 3, 2022
68ef969
step compatibility for wrappers, vectors
arjun-kg May 4, 2022
f06343b
reset tests back to old api
arjun-kg May 4, 2022
794737b
fix circular import
arjun-kg May 4, 2022
f89e5da
merge tests with master
arjun-kg May 4, 2022
8b518bb
existing code, tests work
arjun-kg May 5, 2022
9a2a9af
fix compat at registration, tests
arjun-kg May 5, 2022
29eafe5
docstrings, tests passing
arjun-kg May 5, 2022
63fc044
Merge branch 'master' of https://github.com/openai/gym into done_term…
arjun-kg May 28, 2022
97f36d3
dealing with conflicts
arjun-kg May 28, 2022
63d3d19
update wrapper class to use step compatibility
arjun-kg May 28, 2022
492c6e1
Merge branch 'master' of https://github.com/openai/gym into done_term…
arjun-kg Jun 2, 2022
9ce03cb
add warning for play
arjun-kg Jun 2, 2022
f93295f
add todo
arjun-kg Jun 2, 2022
1940494
replace 'closing' with 'final'
arjun-kg Jun 2, 2022
f12b5fb
fix pre-commit
arjun-kg Jun 2, 2022
aa5a071
remove previously missed `done` references
arjun-kg Jun 3, 2022
e135b9e
fix step compat in atari wrapper reset
arjun-kg Jun 3, 2022
2bb742a
Merge branch 'master' of https://github.com/openai/gym into done_term…
arjun-kg Jun 7, 2022
1f11077
fix tests with step returning np.bool_
arjun-kg Jun 7, 2022
e861fbc
remove warning for using new api
arjun-kg Jun 7, 2022
fe04e7c
Merge branch 'master' of https://github.com/openai/gym into done_term…
arjun-kg Jun 8, 2022
8e56f45
pre-commit fixes
arjun-kg Jun 8, 2022
4491d9a
Merge branch 'master' of https://github.com/openai/gym into done_term…
arjun-kg Jun 9, 2022
be947e3
Merge branch 'master' of https://github.com/openai/gym into done_term…
arjun-kg Jun 20, 2022
5e8f085
new API does not include 'TimeLimit.truncated' in info
arjun-kg Jun 20, 2022
cdb3516
fix checks, tests
arjun-kg Jun 20, 2022
8cc2074
vector info mask - fix wrong underscore
arjun-kg Jun 20, 2022
2f83d55
dont remove from info
arjun-kg Jun 21, 2022
57e839c
Merge branch 'master' of https://github.com/openai/gym into done_term…
arjun-kg Jun 21, 2022
b1660cf
edit definitions
arjun-kg Jun 21, 2022
ea10e7a
remove whitespaces :/
arjun-kg Jun 21, 2022
bffa257
Merge branch 'master' of https://github.com/openai/gym into done_term…
arjun-kg Jul 3, 2022
d7dff2c
update tests
arjun-kg Jul 3, 2022
b2c10a4
fix pattern
arjun-kg Jul 3, 2022
6553bed
restructure warnings
arjun-kg Jul 4, 2022
50d367e
fix incorrect warning
arjun-kg Jul 4, 2022
d71836f
fix incorrect warnings (properly)
arjun-kg Jul 4, 2022
78a507e
Merge branch 'master' of https://github.com/openai/gym into done_term…
arjun-kg Jul 4, 2022
a747625
add warning to env checker
arjun-kg Jul 5, 2022
28c7b36
Merge branch 'master' of https://github.com/openai/gym into done_term…
arjun-kg Jul 5, 2022
d65d21b
Merge branch 'master' of https://github.com/openai/gym into done_term…
arjun-kg Jul 9, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ observation, info = env.reset(seed=42, return_info=True)

for _ in range(1000):
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
observation, reward, terminated, truncated, info = env.step(action)

if done:
observation, info = env.reset(return_info=True)
Expand Down
39 changes: 31 additions & 8 deletions gym/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,12 +61,17 @@ def np_random(self, value: RandomNumberGenerator):
self._np_random = value

@abstractmethod
def step(self, action: ActType) -> Tuple[ObsType, float, bool, dict]:
def step(
self, action: ActType
) -> Union[
Tuple[ObsType, float, bool, bool, dict], Tuple[ObsType, float, bool, dict]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if I like this approach to backwards compatibility. If this is the official state of (for example) 0.24.0, then you can't reliably write an algorithm that will work for all valid 0.24.0 environments. I think we should just say that an environment should have the signature of (ObsType, float, bool, bool, dict), and then provide a wrapper-like compatibility layer that can convert an old-style environment to a new-style environment.

]:
"""Run one timestep of the environment's dynamics. When end of
episode is reached, you are responsible for calling :meth:`reset`
to reset this environment's state.

Accepts an action and returns a tuple (observation, reward, done, info).
Accepts an action and returns either a tuple (observation, reward, terminated, truncated, info) or a tuple
(observation, reward, done, info). The latter is deprecated and will be removed in future versions.

Args:
action (object): an action provided by the agent
Expand All @@ -76,13 +81,17 @@ def step(self, action: ActType) -> Tuple[ObsType, float, bool, dict]:
Returns:
observation (object): agent's observation of the current environment. This will be an element of the environment's :attr:`observation_space`. This may, for instance, be a numpy array containing the positions and velocities of certain objects.
reward (float) : amount of reward returned after previous action
done (bool): whether the episode has ended, in which case further :meth:`step` calls will return undefined results. A done signal may be emitted for different reasons: Maybe the task underlying the environment was solved successfully, a certain timelimit was exceeded, or the physics simulation has entered an invalid state. ``info`` may contain additional information regarding the reason for a ``done`` signal.
terminated (bool): whether the episode has ended due to a termination, in which case further step() calls will return undefined results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rephrase "termination" to something like "reaching a terminal state", or otherwise to indicate that it's about the intrinsic properties of the environment

truncated (bool): whether the episode has ended due to a truncation, in which case further step() calls will return undefined results
info (dict): contains auxiliary diagnostic information (helpful for debugging, learning, and logging). This might, for instance, contain:

- metrics that describe the agent's performance or
- state variables that are hidden from observations or
- information that distinguishes truncation and termination or
- individual reward terms that are combined to produce the total reward

(deprecated)
done (bool): whether the episode has ended due to any reason, in which case further step() calls will return undefined results
"""
raise NotImplementedError

Expand Down Expand Up @@ -290,7 +299,11 @@ def metadata(self) -> dict:
def metadata(self, value):
self._metadata = value

def step(self, action: ActType) -> Tuple[ObsType, float, bool, dict]:
def step(
self, action: ActType
) -> Union[
Tuple[ObsType, float, bool, bool, dict], Tuple[ObsType, float, bool, dict]
]:
return self.env.step(action)

def reset(self, **kwargs) -> Union[ObsType, tuple[ObsType, dict]]:
Expand Down Expand Up @@ -325,8 +338,13 @@ def reset(self, **kwargs):
return self.observation(self.env.reset(**kwargs))

def step(self, action):
observation, reward, done, info = self.env.step(action)
return self.observation(observation), reward, done, info
step_returns = self.env.step(action)
if len(step_returns) == 5:
observation, reward, terminated, truncated, info = step_returns
return self.observation(observation), reward, terminated, truncated, info
else:
observation, reward, done, info = step_returns
return self.observation(observation), reward, done, info
arjun-kg marked this conversation as resolved.
Show resolved Hide resolved

@abstractmethod
def observation(self, observation):
Expand All @@ -338,8 +356,13 @@ def reset(self, **kwargs):
return self.env.reset(**kwargs)

def step(self, action):
observation, reward, done, info = self.env.step(action)
return observation, self.reward(reward), done, info
step_returns = self.env.step(action)
if len(step_returns) == 5:
observation, reward, terminated, truncated, info = step_returns
return observation, self.reward(reward), terminated, truncated, info
else:
observation, reward, done, info = step_returns
return observation, self.reward(reward), done, info
arjun-kg marked this conversation as resolved.
Show resolved Hide resolved

@abstractmethod
def reward(self, reward):
Expand Down
14 changes: 7 additions & 7 deletions gym/envs/box2d/bipedal_walker.py
Original file line number Diff line number Diff line change
Expand Up @@ -581,13 +581,13 @@ def step(self, action: np.ndarray):
reward -= 0.00035 * MOTORS_TORQUE * np.clip(np.abs(a), 0, 1)
# normalized to about -50.0 using heuristic, more optimal agent should spend less

done = False
terminated = False
if self.game_over or pos[0] < 0:
reward = -100
done = True
terminated = True
if pos[0] > (TERRAIN_LENGTH - TERRAIN_GRASS) * TERRAIN_STEP:
done = True
return np.array(state, dtype=np.float32), reward, done, {}
terminated = True
return np.array(state, dtype=np.float32), reward, terminated, False, {}

def render(self, mode: str = "human"):
import pygame
Expand Down Expand Up @@ -757,9 +757,9 @@ def __init__(self):
SUPPORT_KNEE_ANGLE = +0.1
supporting_knee_angle = SUPPORT_KNEE_ANGLE
while True:
s, r, done, info = env.step(a)
s, r, terminated, truncated, info = env.step(a)
total_reward += r
if steps % 20 == 0 or done:
if steps % 20 == 0 or terminated or truncated:
print("\naction " + str([f"{x:+0.2f}" for x in a]))
print(f"step {steps} total_reward {total_reward:+0.2f}")
print("hull " + str([f"{x:+0.2f}" for x in s[0:4]]))
Expand Down Expand Up @@ -823,5 +823,5 @@ def __init__(self):
a = np.clip(0.5 * a, -1.0, 1.0)

env.render()
if done:
if terminated or truncated:
break
14 changes: 7 additions & 7 deletions gym/envs/box2d/car_racing.py
Original file line number Diff line number Diff line change
Expand Up @@ -415,7 +415,7 @@ def step(self, action):
self.state = self.render("state_pixels")

step_reward = 0
done = False
terminated = False
if action is not None: # First step without action, called from reset()
self.reward -= 0.1
# We actually don't want to count fuel spent, we want car to be faster.
Expand All @@ -424,13 +424,13 @@ def step(self, action):
step_reward = self.reward - self.prev_reward
self.prev_reward = self.reward
if self.tile_visited_count == len(self.track) or self.new_lap:
done = True
terminated = True
x, y = self.car.hull.position
if abs(x) > PLAYFIELD or abs(y) > PLAYFIELD:
done = True
terminated = True
step_reward = -100

return self.state, step_reward, done, {}
return self.state, step_reward, terminated, False, {}

def render(self, mode="human"):
import pygame
Expand Down Expand Up @@ -660,13 +660,13 @@ def register_input():
restart = False
while True:
register_input()
s, r, done, info = env.step(a)
s, r, terminated, truncated, info = env.step(a)
total_reward += r
if steps % 200 == 0 or done:
if steps % 200 == 0 or terminated or truncated:
print("\naction " + str([f"{x:+0.2f}" for x in a]))
print(f"step {steps} total_reward {total_reward:+0.2f}")
steps += 1
isopen = env.render()
if done or restart or isopen == False:
if terminated or truncated or restart or isopen == False:
break
env.close()
14 changes: 7 additions & 7 deletions gym/envs/box2d/lunar_lander.py
Original file line number Diff line number Diff line change
Expand Up @@ -473,14 +473,14 @@ def step(self, action):
) # less fuel spent is better, about -30 for heuristic landing
reward -= s_power * 0.03

done = False
terminated = False
if self.game_over or abs(state[0]) >= 1.0:
done = True
terminated = True
reward = -100
if not self.lander.awake:
done = True
terminated = True
reward = +100
return np.array(state, dtype=np.float32), reward, done, {}
return np.array(state, dtype=np.float32), reward, terminated, False, {}

def render(self, mode="human"):
import pygame
Expand Down Expand Up @@ -654,19 +654,19 @@ def demo_heuristic_lander(env, seed=None, render=False):
s = env.reset(seed=seed)
while True:
a = heuristic(env, s)
s, r, done, info = env.step(a)
s, r, terminated, truncated, info = env.step(a)
total_reward += r

if render:
still_open = env.render()
if still_open == False:
break

if steps % 20 == 0 or done:
if steps % 20 == 0 or terminated or truncated:
print("observations:", " ".join([f"{x:+0.2f}" for x in s]))
print(f"step {steps} total_reward {total_reward:+0.2f}")
steps += 1
if done:
if terminated or truncated:
break
if render:
env.close()
Expand Down
14 changes: 7 additions & 7 deletions gym/envs/classic_control/acrobot.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,12 +82,12 @@ class AcrobotEnv(core.Env):
Each parameter in the underlying state (`theta1`, `theta2`, and the two angular velocities) is initialized
uniformly between -0.1 and 0.1. This means both links are pointing downwards with some initial stochasticity.

### Episode Termination
### Episode End

The episode terminates if one of the following occurs:
1. The free end reaches the target height, which is constructed as:
The episode ends if one of the following occurs:
1. Termination: The free end reaches the target height, which is constructed as:
`-cos(theta1) - cos(theta2 + theta1) > 1.0`
2. Episode length is greater than 500 (200 for v0)
2. Truncation: Episode length is greater than 500 (200 for v0)

### Arguments

Expand Down Expand Up @@ -206,9 +206,9 @@ def step(self, a):
ns[2] = bound(ns[2], -self.MAX_VEL_1, self.MAX_VEL_1)
ns[3] = bound(ns[3], -self.MAX_VEL_2, self.MAX_VEL_2)
self.state = ns
terminal = self._terminal()
reward = -1.0 if not terminal else 0.0
return (self._get_ob(), reward, terminal, {})
terminated = self._terminal()
reward = -1.0 if not terminated else 0.0
return (self._get_ob(), reward, terminated, False, {})

def _get_ob(self):
s = self.state
Expand Down
33 changes: 17 additions & 16 deletions gym/envs/classic_control/cartpole.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,12 +56,13 @@ class CartPoleEnv(gym.Env[np.ndarray, Union[int, np.ndarray]]):

All observations are assigned a uniformly random value in `(-0.05, 0.05)`

### Episode Termination
### Episode End

The episode terminates if any one of the following occurs:
1. Pole Angle is greater than ±12°
2. Cart Position is greater than ±2.4 (center of the cart reaches the edge of the display)
3. Episode length is greater than 500 (200 for v0)
The episode ends if any one of the following occurs:

1. Termination: Pole Angle is greater than ±12°
2. Termination: Cart Position is greater than ±2.4 (center of the cart reaches the edge of the display)
3. Truncation: Episode length is greater than 500 (200 for v0)

### Arguments

Expand Down Expand Up @@ -109,7 +110,7 @@ def __init__(self):
self.isopen = True
self.state = None

self.steps_beyond_done = None
self.steps_beyond_terminated = None

def step(self, action):
err_msg = f"{action!r} ({type(action)}) invalid"
Expand Down Expand Up @@ -143,31 +144,31 @@ def step(self, action):

self.state = (x, x_dot, theta, theta_dot)

done = bool(
terminated = bool(
x < -self.x_threshold
or x > self.x_threshold
or theta < -self.theta_threshold_radians
or theta > self.theta_threshold_radians
)

if not done:
if not terminated:
reward = 1.0
elif self.steps_beyond_done is None:
elif self.steps_beyond_terminated is None:
# Pole just fell!
self.steps_beyond_done = 0
self.steps_beyond_terminated = 0
reward = 1.0
else:
if self.steps_beyond_done == 0:
if self.steps_beyond_terminated == 0:
logger.warn(
"You are calling 'step()' even though this "
"environment has already returned done = True. You "
"should always call 'reset()' once you receive 'done = "
"environment has already returned terminated = True. You "
"should always call 'reset()' once you receive 'terminated = "
"True' -- any further steps are undefined behavior."
)
self.steps_beyond_done += 1
self.steps_beyond_terminated += 1
reward = 0.0

return np.array(self.state, dtype=np.float32), reward, done, {}
return np.array(self.state, dtype=np.float32), reward, terminated, False, {}

def reset(
self,
Expand All @@ -178,7 +179,7 @@ def reset(
):
super().reset(seed=seed)
self.state = self.np_random.uniform(low=-0.05, high=0.05, size=(4,))
self.steps_beyond_done = None
self.steps_beyond_terminated = None
if not return_info:
return np.array(self.state, dtype=np.float32)
else:
Expand Down
16 changes: 9 additions & 7 deletions gym/envs/classic_control/continuous_mountain_car.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,11 +76,11 @@ class Continuous_MountainCarEnv(gym.Env):

The position of the car is assigned a uniform random value in `[-0.6 , -0.4]`. The starting velocity of the car is always assigned to 0.

### Episode Termination
### Episode End

The episode terminates if either of the following happens:
1. The position of the car is greater than or equal to 0.45 (the goal position on top of the right hill)
2. The length of the episode is 999.
The episode ends if either of the following happens:
1. Termination: The position of the car is greater than or equal to 0.45 (the goal position on top of the right hill)
2. Truncation: The length of the episode is 999.

### Arguments

Expand Down Expand Up @@ -145,15 +145,17 @@ def step(self, action: np.ndarray):
velocity = 0

# Convert a possible numpy bool to a Python bool.
done = bool(position >= self.goal_position and velocity >= self.goal_velocity)
terminated = bool(
position >= self.goal_position and velocity >= self.goal_velocity
)

reward = 0
if done:
if terminated:
reward = 100.0
reward -= math.pow(action[0], 2) * 0.1

self.state = np.array([position, velocity], dtype=np.float32)
return self.state, reward, done, {}
return self.state, reward, terminated, False, {}

def reset(
self,
Expand Down
14 changes: 8 additions & 6 deletions gym/envs/classic_control/mountain_car.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,11 +72,11 @@ class MountainCarEnv(gym.Env):

The position of the car is assigned a uniform random value in *[-0.6 , -0.4]*. The starting velocity of the car is always assigned to 0.

### Episode Termination
### Episode End

The episode terminates if either of the following happens:
1. The position of the car is greater than or equal to 0.5 (the goal position on top of the right hill)
2. The length of the episode is 200.
The episode ends if either of the following happens:
1. Termination: The position of the car is greater than or equal to 0.5 (the goal position on top of the right hill)
2. Truncation: The length of the episode is 200.


### Arguments
Expand Down Expand Up @@ -125,11 +125,13 @@ def step(self, action: int):
if position == self.min_position and velocity < 0:
velocity = 0

done = bool(position >= self.goal_position and velocity >= self.goal_velocity)
terminated = bool(
position >= self.goal_position and velocity >= self.goal_velocity
)
reward = -1.0

self.state = (position, velocity)
return np.array(self.state, dtype=np.float32), reward, done, {}
return np.array(self.state, dtype=np.float32), reward, terminated, False, {}

def reset(
self,
Expand Down
Loading