Skip to content

Clean baseline implementation of PPO using an episodic TransformerXL memory

License

Notifications You must be signed in to change notification settings

MarcoMeter/episodic-transformer-memory-ppo

Repository files navigation

TransformerXL as Episodic Memory in Proximal Policy Optimization

This repository features a PyTorch based implementation of PPO using TransformerXL (TrXL). Its intention is to provide a clean baseline/reference implementation on how to successfully employ memory-based agents using Transformers and PPO.

Features

  • Episodic Transformer Memory
    • TransformerXL (TrXL)
    • Gated TransformerXL (GTrXL)
  • Environments
    • Proof-of-concept Memory Task (PocMemoryEnv)
    • CartPole
      • Masked velocity
    • Minigrid Memory
      • Visual Observation Space 3x84x84
      • Egocentric Agent View Size 3x3 (default 7x7)
      • Action Space: forward, rotate left, rotate right
    • MemoryGym
      • Mortar Mayhem
      • Mystery Path
      • Searing Spotlights (WIP)
  • Tensorboard
  • Enjoy (watch a trained agent play)

Citing this work

@article{pleines2023trxlppo,
  title = {TransformerXL as Episodic Memory in Proximal Policy Optimization},
  author = {Pleines, Marco and Pallasch, Matthias and Zimmer, Frank and Preuss, Mike},
  journal= {Github Repository},
  year = {2023},
  url = {https://github.com/MarcoMeter/episodic-transformer-memory-ppo}
}

Contents

Installation

Install PyTorch 1.12.1 depending on your platform. We recommend the usage of Anaconda.

Create Anaconda environment:

conda create -n transformer-ppo python=3.7 --yes
conda activate transformer-ppo

CPU:

conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cpuonly -c pytorch

CUDA:

conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge

Install the remaining requirements and you are good to go:

pip install -r requirements.txt

Train a model

The training is launched via train.py. --config specifies the path to the yaml config file featuring hyperparameters. The --run-id is used to distinguish training runs. After training, the trained model will be saved to ./models/$run-id$.nn.

python train.py --config configs/minigrid.yaml --run-id=my-trxl-training

Enjoy a model

To watch an agent exploit its trained model, execute enjoy.py. Some pre-trained models can be found in: ./models/. The to-be-enjoyed model is specified using the --model flag.

python enjoy.py --model=models/mortar_mayhem_grid_trxl.nn

Episodic Transformer Memory Concept

transformer-xl-model

Hyperparameters

Episodic Transformer Memory

Hyperparameter Description
num_blocks Number of transformer blocks
embed_dim Embedding size of every layer inside a transformer block
num_heads Number of heads used in the transformer's multi-head attention mechanism
memory_length Length of the sliding episodic memory window
positional_encoding Relative and learned positional encodings can be used
layer_norm Whether to apply layer normalization before or after every transformer component. Pre layer normalization refers to the identity map re-ordering.
gtrxl Whether to use Gated TransformerXL
gtrxl_bias Initial value for GTrXL's bias weight

General

gamma Discount factor
lamda Regularization parameter used when calculating the Generalized Advantage Estimation (GAE)
updates Number of cycles that the entire PPO algorithm is being executed
n_workers Number of environments that are used to sample training data
worker_steps Number of steps an agent samples data in each environment (batch_size = n_workers * worker_steps)
epochs Number of times that the whole batch of data is used for optimization using PPO
n_mini_batch Number of mini batches that are trained throughout one epoch
value_loss_coefficient Multiplier of the value function loss to constrain it
hidden_layer_size Number of hidden units in each linear hidden layer
max_grad_norm Gradients are clipped by the specified max norm

Schedules

These schedules can be used to polynomially decay the learning rate, the entropy bonus coefficient and the clip range.

learning_rate_schedule The learning rate used by the AdamW optimizer
beta_schedule Beta is the entropy bonus coefficient that is used to encourage exploration
clip_range_schedule Strength of clipping losses done by the PPO algorithm

Add Environment

Follow these steps to train another environment:

  1. Implement a wrapper of your desired environment. It needs the properties observation_space, action_space and max_episode_steps. The needed functions are render(), reset() and step.
  2. Extend the create_env() function in utils.py by adding another if-statement that queries the environment's "type"
  3. Adjust the "type" and "name" key inside the environment's yaml config

Note that only environments with visual or vector observations are supported. Concerning the environment's action space, it can be either discrte or multi-discrete.

Tensorboard

During training, tensorboard summaries are saved to summaries/run-id/timestamp.

Run tensorboad --logdir=summaries to watch the training statistics in your browser using the URL http://localhost:6006/.

Results

Every experiment is repeated on 5 random seeds. Each model checkpoint is evaluated on 50 unknown environment seeds, which are repeated 5 times. Hence, one data point aggregates 1250 (5x5x50) episodes. Rliable is used to retrieve the interquartile mean and the bootstrapped confidence interval. The training is conducted using the more sophisticated DRL framework neroRL. The clean GRU-PPO baseline can be found here.

Mystery Path Grid (Goal & Origin Hidden)

mpg_off_results

TrXL and GTrXL have identical performance. See Issue #7.

Mortar Mayhem Grid (10 commands)

mmg_10_results