Merge branch 'master' into new_hanabi

Farama-Foundation · May 10, 2020 · 66b2f49 · 66b2f49
2 parents 09b191d + 8de254c
commit 66b2f49
Show file tree

Hide file tree

Showing 85 changed files with 2,215 additions and 2,629 deletions.
diff --git a/.gitignore b/.gitignore
@@ -3,4 +3,7 @@ __pycache__/
 *.swp
 .DS_Store
 .vscode/
-saved_observations/
+saved_observations/
+build/
+dist/
+PettingZoo.egg-info/
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -0,0 +1,3 @@
+recursive-include pettingzoo *
+global-exclude __pycache__
+global-exclude *.pyc
diff --git a/README.md b/README.md
@@ -14,9 +14,9 @@ PettingZoo includes the following sets of games:
 * mpe: A set of simple nongraphical communication tasks, originally from https://github.com/openai/multiagent-particle-envs
 * sisl: 3 cooperative environments, originally from https://github.com/sisl/MADRL
 
-To install a set of games, use `pip3 install pettingzoo[atari]` (or whichever set of games you want).
+To install, use `pip install pettingzoo` 
 
-We support Python 3.5, 3.6, 3.7 and 3.8
+We support Python 3.6, 3.7 and 3.8
 
 
 ## Initializing Environments
@@ -155,7 +155,7 @@ from pettingzoo.utils import random_demo
 random_demo(env)
 ```
 
-### Observation Saver
+### Observation Saving
 
 If the agents in a game make observations that are images, the observations can be saved to an image file. This function takes in the environment, along with a specified agent. If no agent is specified, the current selected agent for the environment is chosen. If all_agents is passed in as True, then the observations of all agents in the environment is saved. By default the images are saved to the current working directory, in a folder matching the environment name. The saved image will match the name of the observing agent. If save_dir is passed in, a new folder is created where images will be saved to.
 
@@ -180,14 +180,10 @@ Our cooperative games have leaderboards for best total (summed over all agents)
 The following environments are under active development:
 
 * atari/* (Ben)
-* classic/checkers (Ben)
-* classic/go (Luis)
+* classic/backgammon (Caroline)
+* classic/checkers (Caroline)
 * classic/hanabi (Clemens)
+* classic/shogi (Caroline)
 * gamma/prospector (Yashas)
 * magent/* (Mario)
 * robotics/* (Yiling)
-* classic/backgammon (Caroline)
-
-Development has not yet started on:
-
-* classic/shogi (python-shogi)
diff --git a/docs/gamma.md b/docs/gamma.md
@@ -42,10 +42,10 @@ Move the left paddle using the 'W' and 'S' keys. Move the right paddle using 'UP
 
 ```
 cooperative_pong.env(ball_speed=18, left_paddle_speed=25,
-right_paddle_speed=25, is_cake_paddle=True, max_frames=900, bounce_randomness=False)
+right_paddle_speed=25, cake_paddle=True, max_frames=900, bounce_randomness=False)
 ```
 
-The speed of the ball (`ball_speed` )is held constant throughout the game, while the initial direction of the ball is randomized when `reset()` method is called. The speed of left and right paddles are controlled by `left_paddle_speed` and `right_paddle_speed` respectively. If `is_cake_paddle` is `True`, the right paddle has the shape of a 4-tiered wedding cake. `done` of all agents are set to `True` after `max_frames` number of frames elapse. If `bounce_randomness` is `True`, each collision of the ball with the paddles adds a small random angle to the direction of the ball, while the speed of the ball remains unchanged.
+The speed of the ball (`ball_speed` )is held constant throughout the game, while the initial direction of the ball is randomized when `reset()` method is called. The speed of left and right paddles are controlled by `left_paddle_speed` and `right_paddle_speed` respectively. If `cake_paddle` is `True`, the right paddle has the shape of a 4-tiered wedding cake. `done` of all agents are set to `True` after `max_frames` number of frames elapse. If `bounce_randomness` is `True`, each collision of the ball with the paddles adds a small random angle to the direction of the ball, while the speed of the ball remains unchanged.
 
 Leaderboard:
 
@@ -67,7 +67,7 @@ Leaderboard:
 
 *AEC diagram*
 
-Zombies walk from the top border of the screen down to the bottom border in unpredictable paths. The agents you control are knights and archers (default 2 knights and 2 archers) that are initially positioned at the bottom border of the screen. Each agent can rotate clockwise or counter-clockwise and move forward or backward. Each agent can also attack to kill zombies. When a knight attacks, it swings a mace in an arc in front of its current heading direction. When an archer attacks, it fires an arrow in a straight line in the direction of the archer's heading. The game ends when all agents die (collide with a zombie) or a zombie reaches the bottom screen border. An agent gets a reward when it kills a zombie. Each agent observes the environment as a square region around itself, with its own body in the center of the square. The observation is represented as a 512x512 image around the agent.
+Zombies walk from the top border of the screen down to the bottom border in unpredictable paths. The agents you control are knights and archers (default 2 knights and 2 archers) that are initially positioned at the bottom border of the screen. Each agent can rotate clockwise or counter-clockwise and move forward or backward. Each agent can also attack to kill zombies. When a knight attacks, it swings a mace in an arc in front of its current heading direction. When an archer attacks, it fires an arrow in a straight line in the direction of the archer's heading. The game ends when all agents die (collide with a zombie) or a zombie reaches the bottom screen border. A knight is rewarded 1 point when its mace hits and kills a zombie. An archer is rewarded 1 point when one of their arrows hits and kills a zombie. Each agent observes the environment as a square region around itself, with its own body in the center of the square. The observation is represented as a 512x512 pixel image around the agent, or in other words, a 16x16 agent sized space around the agent.
 
 Manual Control:
 
@@ -80,7 +80,7 @@ Press 'M' key to spawn a new knight.
 
 ```
 knights_archers_zombies.env(spawn_rate=20, knights=2, archers=2, 
-killable_knights=True, killable_archers=True, line_death=True, pad_observation=True, max_frames=900)
+killable_knights=True, killable_archers=True, black_death=True, line_death=True, pad_observation=True, max_frames=900)
 ```
 
 *about arguments*
@@ -96,7 +96,9 @@ killable_knights: if set to False, knight agents cannot be killed by zombies.
 
 killable_archers: if set to False, archer agents cannot be killed by zombies.
 
-line_death:
+black_death: if set to True, agents who die will observe only black. If False, dead agents do not have reward, done, info or observations and are removed from agent list.
+
+line_death: if set to False, agents do not die when they touch the top or bottom border. If True, agents die as soon as they touch the top or bottom border.
 
 pad_observation: if agents are near edge of environment, their observation cannot form a 40x40 grid. If this is set to True, the observation is padded with black.
 ```
@@ -122,7 +124,7 @@ Leaderboard:
 *AEC diagram*
 
 This is a simple physics based cooperative game where the goal is to move the ball to the left wall of the game border by activating any of the twenty vertically moving pistons. Pistons can only see themselves, and the two pistons next to them. 
-Thus, pistons must learn highly coordinated emergent behavior to achieve an optimal policy for the environment. Each agent get's a reward that is a combination of how much the ball moved left overall, and how much the ball moved left if it was close to the piston (i.e. movement it contributed to). Balancing the ratio between these appears to be critical to learning this environment, and as such is an environment parameter. If the ball moves to the left, a positive global reward is applied. If the ball moves to the right then a negative global reward is applied. Additionally, pistons that are within a radius of the ball are given a local reward.
+Thus, pistons must learn highly coordinated emergent behavior to achieve an optimal policy for the environment. Each agent get's a reward that is a combination of how much the ball moved left overall, and how much the ball moved left if it was close to the piston (i.e. movement it contributed to). Balancing the ratio between these appears to be critical to learning this environment, and as such is an environment parameter. The local reward applied is 0.5 times the change in the ball's x-position. Additionally, the global reward is change in x-position divided by the starting position, times 100. For each piston, the reward is .02 * local_reward + 0.08 * global_reward. The local reward is applied to pistons surrounding the ball while the global reward is provided to all pistons.
 
 Pistonball uses the chipmunk physics engine, and are thus the physics are about as realistic as Angry Birds.
 
@@ -167,9 +169,9 @@ Continuous Leaderboard:
 
 ### Prison
 
-| Actions | Agents | Manual Control | Action Shape | Action Values | Observation Shape | Observation Values | Num States |
-|---------|--------|----------------|--------------|---------------|-------------------|--------------------|------------|
-| Either  | 8      | Yes            | (1,)         | [0, 2]        | (100, 300, 3)     | (0, 255)           | ?          |
+| Actions | Agents | Manual Control | Action Shape | Action Values | Observation Shape    | Observation Values     | Num States |
+|---------|--------|----------------|--------------|---------------|----------------------|------------------------|------------|
+| Either  | 8      | Yes            | (1,)         | [0, 2]        | (100, 300, 3) or (1,)| (0, 255) or (-300, 300)| ?          |
 
 `from pettingzoo.gamma import prison_v0`
 
@@ -181,6 +183,10 @@ Continuous Leaderboard:
 
 In prison, 8 aliens locked in identical prison cells are controlled by the user. They cannot communicate with each other in any way, and can only pace in their cell. Every time they touch one end of the cell and then the other, they get a reward of 1. Due to the fully independent nature of these agents and the simplicity of the task, this is an environment primarily intended for debugging purposes- it's multiple individual purely single agent tasks. To make this debugging tool as compatible with as many methods as possible, it can accept both discrete and continuous actions and the observation can be automatically turned into a number representing position of the alien from the left of it's cell instead of the normal graphical output.
 
+Manual Control:
+
+Select different aliens with 'W', 'A', 'S' or 'D'. Move the selected alien left with 'J' and right with 'K'.
+
 Arguments:
 
 ```

diff --git a/docs/mpe.md b/docs/mpe.md
@@ -159,7 +159,7 @@ max_frames: number of frames (a step for each agent) until game terminates
 
 *AEC diagram*
 
-In this environment, there are 2 good agents (Alice and Bob) and 1 adversary (Eve). Alice must sent a private 1 bit message to Bob over a public channel. Alice and Bob are rewarded if Bob reconstructs the message, but are negatively rewarded if Eve reconstruct the message. Eve is rewarded based on how well it can reconstruct the signal. Alice and Bob have a private key (randomly generated at beginning of each episode), which they must learn to use to encrypt the message.
+In this environment, there are 2 good agents (Alice and Bob) and 1 adversary (Eve). Alice must sent a private 1 bit message to Bob over a public channel. Alice and Bob are rewarded +2 if Bob reconstructs the message, but are rewarded -2 if Eve reconstruct the message (that adds to 0 if both teams recontruct the bit). Eve is rewarded -2 based if it cannot reconstruct the signal, zero if it can. Alice and Bob have a private key (randomly generated at beginning of each episode), which they must learn to use to encrypt the message.
 
 
 Alice observation space: `[message, private_key]`

diff --git a/docs/sisl.md b/docs/sisl.md
@@ -39,7 +39,7 @@ Please additionally cite:
 
 *AEC diagram*
 
-A package is placed on top of (by default) 3 pairs of robot legs which you control. The robots must learn to move the package as far as possible to the right. Each walker gets a reward of 1 for moving the package forward, and a reward of -100 for dropping the package. Each walker exerts force on two joints in their two legs, giving a continuous action space represented as a 4 element vector. Each walker observes via a 32 element vector, containing simulated noisy lidar data about the environment and information about neighboring walkers. The environment runs for 500 frames by default.
+A package is placed on top of (by default) 3 pairs of robot legs which you control. The robots must learn to move the package as far as possible to the right. A positive reward is awarded to each walker, which is the change in the package distance summed with 130 times the change in the walker's position. A walker is given a reward of -100 if they fall and a reward of -100 for each fallen walker in the environment. If the global reward mechanic is chosen, the mean of all rewards is given to each agent. Each walker exerts force on two joints in their two legs, giving a continuous action space represented as a 4 element vector. Each walker observes via a 32 element vector, containing simulated noisy lidar data about the environment and information about neighboring walkers. The environment runs for 500 frames by default.
 
 ```
 multiwalker.env(n_walkers=3, position_noise=1e-3, angle_noise=1e-3, reward_mech='local',
@@ -94,6 +94,11 @@ Add Gupta et al and DDPG paper results too
 
 By default there are 30 blue evaders and 8 red pursuer agents, in a 16 x 16 grid with an obstacle in the center, shown in white. The evaders move randomly, and the pursuers are controlled. Every time the pursuers fully surround an evader, each of the surrounding agents receives a reward of 5, and the evader is removed from the environment. Pursuers also receive a reward of 0.01 every time they touch an evader. The pursuers have a discrete action space of up, down, left, right and stay. Each pursuer observes a 7 x 7 grid centered around itself, depicted by the orange boxes surrounding the red pursuer agents. The enviroment runs for 500 frames by default. Observation shape takes the full form of `(3, obs_range, obs_range)`.
 
+Manual Control:
+
+Select different pursuers with 'J' and 'K'. The selected pursuer can be moved with the arrow keys.
+
+
 ```
 pursuit.env(max_frames=500, xs=16, ys=16, reward_mech='local', n_evaders=30, n_pursuers=8,
 obs_range=7, layer_norm=10, n_catch=2, random_opponents=False, max_opponents=10,

diff --git a/pettingzoo/__init__.py b/pettingzoo/__init__.py
@@ -1,7 +1,2 @@
 from pettingzoo.utils.env import AECEnv
 import pettingzoo.utils
-import pettingzoo.gamma
-import pettingzoo.sisl
-import pettingzoo.classic
-import pettingzoo.tests
-import pettingzoo.magent
diff --git a/pettingzoo/classic/checkers/checkers.py b/pettingzoo/classic/checkers/checkers.py
@@ -14,7 +14,7 @@ class env(AECEnv):
     metadata = {'render.modes': ['human']}
 
     def __init__(self):
-        super(env, self).__init__()
+        super().__init__()
 
         self.ch = CheckersRules()
         self.num_agents = 2

diff --git a/pettingzoo/classic/chess/chess_env.py b/pettingzoo/classic/chess/chess_env.py
@@ -6,14 +6,24 @@
 import warnings
 from pettingzoo.utils.agent_selector import agent_selector
 from pettingzoo.utils.env_logger import EnvLogger
+from pettingzoo.utils import wrappers
 
 
-class env(AECEnv):
+def env():
+    env = raw_env()
+    env = wrappers.TerminateIllegalWrapper(env, illegal_reward=-1)
+    env = wrappers.AssertOutOfBoundsWrapper(env)
+    env = wrappers.NaNRandomWrapper(env)
+    env = wrappers.OrderEnforcingWrapper(env)
+    return env
+
+
+class raw_env(AECEnv):
 
     metadata = {'render.modes': ['human', 'ascii']}
 
     def __init__(self):
-        super(env, self).__init__()
+        super().__init__()
 
         self.board = chess.Board()
 
@@ -25,20 +35,15 @@ def __init__(self):
         self.action_spaces = {name: spaces.Discrete(8 * 8 * 73) for name in self.agents}
         self.observation_spaces = {name: spaces.Box(low=0, high=1, shape=(8, 8, 20), dtype=np.float32) for name in self.agents}
 
-        # self.rewards = None
-        # self.dones = None
-        # self.infos = None
-        #
-        # self.agent_selection = None
+        self.rewards = None
+        self.dones = None
+        self.infos = {name: {} for name in self.agents}
 
-        self.has_reset = False
-        self.has_rendered = False
+        self.agent_selection = None
 
         self.num_agents = len(self.agents)
 
     def observe(self, agent):
-        if not self.has_reset:
-            EnvLogger.error_observe_before_reset()
         return chess_utils.get_observation(self.board, self.agents.index(agent))
 
     def reset(self, observe=True):
@@ -66,49 +71,32 @@ def set_game_result(self, result_val):
             self.infos[name] = {'legal_moves': []}
 
     def step(self, action, observe=True):
-        if not self.has_reset:
-            EnvLogger.error_step_before_reset()
-        backup_policy = "game terminating with current player losing"
-        act_space = self.action_spaces[self.agent_selection]
-        if np.isnan(action).any():
-            EnvLogger.warn_action_is_NaN(backup_policy)
-        if not act_space.contains(action):
-            EnvLogger.warn_action_out_of_bound(action, act_space, backup_policy)
-
         current_agent = self.agent_selection
         current_index = self.agents.index(current_agent)
         self.agent_selection = next_agent = self._agent_selector.next()
 
-        old_legal_moves = self.infos[current_agent]['legal_moves']
+        chosen_move = chess_utils.action_to_move(self.board, action, current_index)
+        assert chosen_move in self.board.legal_moves
+        self.board.push(chosen_move)
 
-        if action not in old_legal_moves:
-            EnvLogger.warn_on_illegal_move()
-            player_loses_val = -1 if current_index == 0 else 1
-            self.set_game_result(player_loses_val)
-            self.rewards[next_agent] = 0
-        else:
-            chosen_move = chess_utils.action_to_move(self.board, action, current_index)
-            assert chosen_move in self.board.legal_moves
-            self.board.push(chosen_move)
-
-            next_legal_moves = chess_utils.legal_moves(self.board)
+        next_legal_moves = chess_utils.legal_moves(self.board)
 
-            is_stale_or_checkmate = not any(next_legal_moves)
+        is_stale_or_checkmate = not any(next_legal_moves)
 
-            # claim draw is set to be true to allign with normal tournament rules
-            is_repetition = self.board.is_repetition(3)
-            is_50_move_rule = self.board.can_claim_fifty_moves()
-            is_claimable_draw = is_repetition or is_50_move_rule
-            game_over = is_claimable_draw or is_stale_or_checkmate
+        # claim draw is set to be true to allign with normal tournament rules
+        is_repetition = self.board.is_repetition(3)
+        is_50_move_rule = self.board.can_claim_fifty_moves()
+        is_claimable_draw = is_repetition or is_50_move_rule
+        game_over = is_claimable_draw or is_stale_or_checkmate
 
-            if game_over:
-                result = self.board.result(claim_draw=True)
-                result_val = chess_utils.result_to_int(result)
-                self.set_game_result(result_val)
-            else:
-                self.infos[current_agent] = {'legal_moves': []}
-                self.infos[next_agent] = {'legal_moves': next_legal_moves}
-                assert len(self.infos[next_agent]['legal_moves'])
+        if game_over:
+            result = self.board.result(claim_draw=True)
+            result_val = chess_utils.result_to_int(result)
+            self.set_game_result(result_val)
+        else:
+            self.infos[current_agent] = {'legal_moves': []}
+            self.infos[next_agent] = {'legal_moves': next_legal_moves}
+            assert len(self.infos[next_agent]['legal_moves'])
 
         if observe:
             next_observation = self.observe(next_agent)
@@ -117,10 +105,7 @@ def step(self, action, observe=True):
         return next_observation
 
     def render(self, mode='human'):
-        self.has_rendered = True
         print(self.board)
 
     def close(self):
-        if not self.has_rendered:
-            EnvLogger.warn_close_unrendered_env()
-        self.has_rendered = False
+        pass
diff --git a/pettingzoo/classic/connect_four/__init__.py b/pettingzoo/classic/connect_four/__init__.py