Vanilla Policy Gradient 🤖

Introduction

A relatively simple implementation of Vanilla policy gradients from https://spinningup.openai.com/en/latest/algorithms/vpg.html trained on Gymnasium's Lunar Lander.

In Lunar Lander, an agent tries to land a spacecraft inside the goal (denoted by the flags) but gets penalized for each timestep it fires it's thrusters or if it crashes into the moon's surface. To add a bit of randomness, the spacecraft is already moving in a direction when the environment starts.

Uses some "tricks" not in the original implementation of VPG that are used in PPO:

Entropy loss to promote exploration
Learning rate annealing
Orthogonal initialization of weights

The model in vpg.pth was trained on 2,000,000 steps, but the average rewards remained the same past ~400,000 steps, which is probably due to known issues with VPG (getting trapped in local optima).

Running

Install the dependencies:

pip install poetry
poetry run shell
poetry install

Run a training loop:

python lunarlander_train.py

Hyperparameters are managed by Hydra in /config/vpg.cfg but can be overriden in the command line:

python lunarlander_train.py epochs=20

See a saved model interacting with the environment:

python lunarlander_eval.py vpg.pth

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
config		config
.gitignore		.gitignore
README.md		README.md
lunarlander_eval.py		lunarlander_eval.py
lunarlander_train.py		lunarlander_train.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
vpg.pth		vpg.pth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vanilla Policy Gradient 🤖

Introduction

Running

Additional references

About

Releases

Packages

Languages

Bamboofungus/VPG

Folders and files

Latest commit

History

Repository files navigation

Vanilla Policy Gradient 🤖

Introduction

Running

Additional references

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages