Add multiwalker tutorial for MLP example

Farama-Foundation · elliottower · Jul 11, 2023 · Jul 7, 2023 · Jul 7, 2023 · Jul 7, 2023
commit 8cfd867292d4ec629e029869220b0ed7919ba754
diff --git a/docs/tutorials/sb3/index.md b/docs/tutorials/sb3/index.md
@@ -6,11 +6,17 @@ title: "Stable-Baselines3"
 
 These tutorials show you how to use the [SB3](https://stable-baselines3.readthedocs.io/en/master/) library to train agents in PettingZoo environments.
 
-* [PPO for Pistonball](/tutorials/sb3/pistonball/): _Train a PPO model in vectorized Parallel environment_
+For environments with visual observations, we use a [CNN](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#stable_baselines3.ppo.CnnPolicy) policy and perform pre-processing steps such as frame-stacking, color reduction, and resizing using [SuperSuit](/api/wrappers/supersuit_wrappers/)
 
-* [PPO for Knights-Archers-Zombies](/tutorials/sb3/kaz/) _Train a PPO model in a vectorized AEC environment_
+* [PPO for Pistonball](/tutorials/sb3/pistonball/): _Train agents using PPO in vectorized Parallel environment_
 
-* [Action Masked PPO for Chess](/tutorials/sb3/connect_four/): _Train an action masked PPO model in an AEC environment_
+* [PPO for Knights-Archers-Zombies](/tutorials/sb3/kaz/) _Train agents using PPO in a vectorized AEC environment_
+
+For non-visual environments, we use [Actor Critic](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#stable_baselines3.ppo.CnnPolicy) or [Maskable Actor Critic](https://sb3-contrib.readthedocs.io/en/master/modules/ppo_mask.html#maskableppo-policies) policies and do not perform any pre-processing steps.
+
+* [PPO for Multiwalker](/tutorials/sb3/multiwalker/): _Train agents using PPO in a vectorized AEC environment_
+
+* [Action Masked PPO for Connect Four](/tutorials/sb3/connect_four/): _Train an agent using Action Masked PPO in an AEC environment_
 
 
 ## Stable-Baselines Overview
@@ -33,5 +39,6 @@ Note: SB3 does not officially support PettingZoo, as it is designed for single-a
 
 pistonball
 kaz
+multiwalker
 connect_four
 ```
diff --git a/docs/tutorials/sb3/multiwalker.md b/docs/tutorials/sb3/multiwalker.md
@@ -0,0 +1,29 @@
+---
+title: "SB3: PPO for Pistonball (Parallel)"
+---
+
+# SB3: PPO for Pistonball
+
+This tutorial shows how to train a [Proximal Policy Optimization](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html) (PPO) model on the [Multiwalker](https://pettingzoo.farama.org/environments/sisl/multiwalker/) environment ([Parallel](https://pettingzoo.farama.org/api/parallel/)).
+
+Note: this environment uses a discrete 1-dimensional observation space, so we use an MLP extractor rather than CNN
+
+After training and evaluation, this script will launch a demo game human rendering. Trained models are saved and loaded from disk (see SB3's [documentation](https://stable-baselines3.readthedocs.io/en/master/guide/save_format.html) for more information).
+
+
+## Environment Setup
+To follow this tutorial, you will need to install the dependencies shown below. It is recommended to use a newly-created virtual environment to avoid dependency conflicts.
+```{eval-rst}
+.. literalinclude:: ../../../tutorials/SB3/requirements.txt
+   :language: text
+```
+
+## Code
+The following code should run without any issues. The comments are designed to help you understand how to use PettingZoo with SB3. If you have any questions, please feel free to ask in the [Discord server](https://discord.gg/nhvKkYa6qX).
+
+### Training and Evaluation
+
+```{eval-rst}
+.. literalinclude:: ../../../tutorials/SB3/sb3_multiwalker_vector.py
+   :language: python
+```
diff --git a/docs/tutorials/sb3/pistonball.md b/docs/tutorials/sb3/pistonball.md
@@ -4,7 +4,7 @@ title: "SB3: PPO for Pistonball (Parallel)"
 
 # SB3: PPO for Pistonball
 
-This tutorial shows how to train a [Proximal Policy Optimization](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html) (PPO) model on the [Pistonball](https://pettingzoo.farama.org/environments/butterfly/pistonball/) environment ([parallel](https://pettingzoo.farama.org/api/parallel/)).
+This tutorial shows how to train a [Proximal Policy Optimization](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html) (PPO) model on the [Pistonball](https://pettingzoo.farama.org/environments/butterfly/pistonball/) environment ([Parallel](https://pettingzoo.farama.org/api/parallel/)).
 
 After training and evaluation, this script will launch a demo game human rendering. Trained models are saved and loaded from disk (see SB3's [documentation](https://stable-baselines3.readthedocs.io/en/master/guide/save_format.html) for more information).
 

diff --git a/tutorials/SB3/sb3_connect_four_action_mask.py b/tutorials/SB3/sb3_connect_four_action_mask.py
@@ -1,4 +1,4 @@
-"""Uses Stable-Baselines3 to train agents to play Connect Four using invalid action masking.
+"""Uses Stable-Baselines3 to train agents in the Connect Four environment using invalid action masking.
 
 For information about invalid action masking in PettingZoo, see https://pettingzoo.farama.org/api/aec/#action-masking
 For more information about invalid action masking in SB3, see https://sb3-contrib.readthedocs.io/en/master/modules/ppo_mask.html

diff --git a/tutorials/SB3/sb3_kaz_vector.py b/tutorials/SB3/sb3_kaz_vector.py
@@ -1,4 +1,4 @@
-"""Uses Stable-Baselines3 to train agents to play Knights-Archers-Zombies using SuperSuit vector envs.
+"""Uses Stable-Baselines3 to train agents in the Knights-Archers-Zombies environment using SuperSuit vector envs.
 
 This environment requires using SuperSuit's Black Death wrapper, to handle agent death.
 

diff --git a/tutorials/SB3/sb3_multiwalker_vector.py b/tutorials/SB3/sb3_multiwalker_vector.py
@@ -0,0 +1,109 @@
+"""Uses Stable-Baselines3 to train agents to play the Multiwalker environment using SuperSuit vector envs.
+
+For more information, see https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html
+
+Author: Elliot (https://github.com/elliottower)
+"""
+from __future__ import annotations
+
+import glob
+import os
+import time
+
+import supersuit as ss
+from stable_baselines3 import PPO
+from stable_baselines3.ppo import MlpPolicy
+
+from pettingzoo.sisl import multiwalker_v9
+
+
+def train_butterfly_supersuit(
+    env_fn, steps: int = 10_000, seed: int | None = 0, **env_kwargs
+):
+    # Train a single agent to play both sides in a Parallel environment,
+    env = env_fn.parallel_env(**env_kwargs)
+
+    env.reset(seed=seed)
+
+    print(f"Starting training on {str(env.metadata['name'])}.")
+
+    env = ss.pettingzoo_env_to_vec_env_v1(env)
+    env = ss.concat_vec_envs_v1(env, 8, num_cpus=2, base_class="stable_baselines3")
+
+    # Note: Multiwalker's observation space is discrete, therefore we use an MLP policy rather than CNN
+    model = PPO(
+        MlpPolicy,
+        env,
+        verbose=3,
+        learning_rate=1e-3,
+        batch_size=256,
+    )
+
+    model.learn(total_timesteps=steps)
+
+    model.save(f"{env.unwrapped.metadata.get('name')}_{time.strftime('%Y%m%d-%H%M%S')}")
+
+    print("Model has been saved.")
+
+    print(f"Finished training on {str(env.unwrapped.metadata['name'])}.")
+
+    env.close()
+
+
+def eval(env_fn, num_games: int = 100, render_mode: str | None = None, **env_kwargs):
+    # Evaluate a trained agent vs a random agent
+    env = env_fn.env(render_mode=render_mode, **env_kwargs)
+
+    print(
+        f"\nStarting evaluation on {str(env.metadata['name'])} (num_games={num_games}, render_mode={render_mode})"
+    )
+
+    try:
+        latest_policy = max(
+            glob.glob(f"{env.metadata['name']}*.zip"), key=os.path.getctime
+        )
+    except ValueError:
+        print("Policy not found.")
+        exit(0)
+
+    model = PPO.load(latest_policy)
+
+    rewards = {agent: 0 for agent in env.possible_agents}
+
+    # Note: We train using the Parallel API but evaluate using the AEC API
+    # SB3 models are designed for single-agent settings, we get around this by using he same model for every agent
+    for i in range(num_games):
+        env.reset(seed=i)
+
+        for agent in env.agent_iter():
+            obs, reward, termination, truncation, info = env.last()
+
+            if termination or truncation:
+                for agent in env.agents:
+                    rewards[agent] += env.rewards[agent]
+                break
+            else:
+                act = model.predict(obs, deterministic=True)[0]
+
+            env.step(act)
+    env.close()
+
+    avg_reward = sum(rewards.values()) / len(rewards.values())
+    print(f"Avg reward: {avg_reward}")
+    return avg_reward
+
+
+if __name__ == "__main__":
+    env_fn = multiwalker_v9
+
+    env_kwargs = {}
+
+    # Train a model (takes ~3 minutes on a laptop CPU)
+    # Note: stochastic environment makes training difficult, hyperparameters have not been fully tuned for this example
+    train_butterfly_supersuit(env_fn, steps=49_152 * 4, seed=0, **env_kwargs)
+
+    # Evaluate 10 games (takes ~10 seconds on a laptop CPU)
+    eval(env_fn, num_games=10, render_mode=None, **env_kwargs)
+
+    # Watch 2 games (takes ~10 seconds on a laptop CPU)
+    eval(env_fn, num_games=2, render_mode="human", **env_kwargs)
diff --git a/tutorials/SB3/sb3_pistonball_vector.py b/tutorials/SB3/sb3_pistonball_vector.py
@@ -1,4 +1,4 @@
-"""Uses Stable-Baselines3 to train agents to play PettingZoo Butterfly (cooprative) environments using SuperSuit vector envs.
+"""Uses Stable-Baselines3 to train agents in the Pistonball environment using SuperSuit vector envs.
 
 For more information, see https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html
 

diff --git a/tutorials/SB3/test_sb3_action_mask.py b/tutorials/SB3/test_sb3_action_mask.py
@@ -1,4 +1,4 @@
-"""Test file to ensure that action masking code works for all PettingZoo classic environments (except rps, which has no action mask)."""
+"""Tests that action masking code works properly with all PettingZoo classic environments."""
 
 try:
     import pytest