https://stable-baselines.readthedocs.io/en/master/modules/policies.html
- MLP (Multi-layer perceptron)
- MLPPolicy
- Basic implementation, 2 layers of 64
- MLPLstmPolicy
LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!
- The problem of Mario shouldn't need long term dependencies
- MLPLnLstmPolicy
- LSTM but input is normalized
- MLPPolicy
- CNN
- cnns are for images only
We can customize by setting the parameters of the Policy class
https://stable-baselines.readthedocs.io/en/master/guide/custom_policy.html
Ones we probably care about:
- n_env - (int) The number of environments to run
- n_steps - (int) The number of steps to run for each environment
- n_batch - (int) The number of batch to run (n_envs * n_steps)
PPO hyper parameters explained: https://medium.com/aureliantactics/ppo-hyperparameters-and-ranges-6fc2d29bccbe
- learning_rate
- noptepochs - number of epochs
There is a project that created some pre-trained agents called rl-zoo. They use a project called Optuna to find the best hyper-parameters for the agents so we might want to use it too.