-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for pretraining [feature request] #27
Comments
As mentioned in the design choices (see hill-a/stable-baselines#576), everything that is related to imitation learning (it includes GAIL and the pretraining using behavior cloning) will be done outside (certainly in this repo: https://github.com/HumanCompatibleAI/imitation by @AdamGleave et al.). Otherwise, you can check that repo https://github.com/joonaspu/video-game-behavioural-cloning by @Miffyli et al. where pre-training is done using PyTorch. We may add an example though (and maybe include it in the zoo), as it is simple to implement in some cases. |
@skervim we would be happy if you could provide such example ;) (maybe as a colab notebook) |
With SB3, I think this should be off-loaded to users indeed. The SB's pretrain function was promising but it was somewhat limiting. With SB3 we could provide interfaces to obtain a policy of right shape given an environment, then user can take this policy and do their own imitation learning (e.g. supervised learning on some dataset of demonstrations), and upload those parameters to policy. |
This is already the case, no? |
Fair point, it is not hidden per-se, one just needs to know what to access to obtain this policy. An example code of this in the docs should do the trick :) |
I'm not completely sure if I am following. In case of behavioral cloning, you two suggest something like the following? """
Example code for behavioral cloning
"""
from stable_baselines3 import PPO
import gym
# Initialize environment and agent
env = gym.make("MountainCarContinuous-v0")
ppo = PPO("MlpPolicy", env)
# Extract initial policy
policy = ppo.policy
# Perform behavioral cloning with external code
pretrained_policy = external_supervised_learning(policy, external_dataset)
# Insert pretrained policy back into agent
ppo.policy = pretrained_policy
# Perform training
ppo.learn(total_timesteps=int(1e6)) |
yes. In practice, because |
FYI, my use case is that I have a custom environment and would like to pretrain an SB3 ppo agent with an expert dataset that I have created for that environment in a simple behavioral cloning fashion. Then I would like to continue training the pretrained agent. I would gladly provide an example, as suggested by @araffin, but I'm not completely sure how it should look like. Is @AdamGleave's https://github.com/HumanCompatibleAI/imitation going to support SB3 soon? In that case, should the part: # Perform behavioral cloning with external code
pretrained_policy = external_supervised_learning(policy, external_dataset) be implemented there and then an example should be created in the SB3 documentation? Which parts are needed for such an implementation?
Am I missing anything? I would like to contribute back to the repository and try to work on this task, however I think I would need some hint on how to start and could benefit from some guidance of those who have already worked on this problem. |
@AdamGleave is busy with NeurIPS deadline... so better to just create a stand-alone example as a colab notebook here (SB3 branch).
Usually people have their own format, but yes the dataset creation code from SB2 can be reused (it is not depending on TF at all).
Yes, but this will be contained in the training loop normally. (the SB2 code can be simplified as we don't support GAIL)
your 2nd and 3rd point can be merged into one I think. |
Last thing, it is not documented yet, but policies can be saved and loaded without a model now ;). EDIT: |
Alright, thanks for the clarifications. |
@araffin: Glad that I could contribute, and happy to have learned something new from your improvements to the notebook :) |
I want to ask something related to this. Instead of generating "expert data" after the teacher has been trained, how do I directly save the trajectory of the teacher during training as the "expert data", and use that data to train my student? |
I downloaded the notebook and run on RTX2070 GPU with CUDA10.1 on Ubuntu 18.04. The whole notebook is working fine except for the last cell of evaluating the policy giving the error the following error. Any hints?
|
Easiest way to do this would be to save states and actions in the environment, e.g. some kind of a wrapper that keeps track of states and actions and saves them into a file once done is encountered.
I have no idea what could cause that, sorry :/ |
thanks
Ah, np. It seems to be from pytorch's side. |
First: I'm very happy to see the new PyTorch SB3 version! Great job!
My question is whether pretraining-support is planned for SB3 (like for SB: https://stable-baselines.readthedocs.io/en/master/guide/pretrain.html). I couldn't find it being mentioned in the Roadmap.
In my opinion it is a very valuable feature!
The text was updated successfully, but these errors were encountered: