A relatively simple implementation of Vanilla policy gradients from https://spinningup.openai.com/en/latest/algorithms/vpg.html trained on Gymnasium's Lunar Lander.
In Lunar Lander, an agent tries to land a spacecraft inside the goal (denoted by the flags) but gets penalized for each timestep it fires it's thrusters or if it crashes into the moon's surface. To add a bit of randomness, the spacecraft is already moving in a direction when the environment starts.
Uses some "tricks" not in the original implementation of VPG that are used in PPO:
- Entropy loss to promote exploration
- Learning rate annealing
- Orthogonal initialization of weights
The model in vpg.pth
was trained on 2,000,000 steps, but the average rewards remained the same past ~400,000 steps, which
is probably due to known issues with VPG (getting trapped in local optima).
Install the dependencies:
pip install poetry
poetry run shell
poetry install
Run a training loop:
python lunarlander_train.py
Hyperparameters are managed by Hydra in /config/vpg.cfg
but can be overriden in the command line:
python lunarlander_train.py epochs=20
See a saved model interacting with the environment:
python lunarlander_eval.py vpg.pth