Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of potential improvements #38

Closed
maximecb opened this issue Dec 3, 2018 · 22 comments
Closed

List of potential improvements #38

maximecb opened this issue Dec 3, 2018 · 22 comments

Comments

@maximecb
Copy link
Contributor

maximecb commented Dec 3, 2018

I'm opening an issue to keep track of potential improvements to a MiniGrid v2.0. These are things I would do differently if I had to do it again. I'm not sure if these will get merged in the current version of MiniGrid because there is a big risk of breaking existing code, and this is particularly sensitive since the people using this code are using it for research purposes, where replicability is important. What may eventually happen is that I create a MiniGrid2 repository.

  1. Separate out the gridworld from the OpenAI Gym interface. As pointed out in Multi-agent extension? #37, it would make sense to have a gridworld class separate from the OpenAI Gym environment class. This would help us support multi-agent type of setups. It might also be slightly cleaner. We are already part of the way there with the current Grid class.

  2. The agent should be treated like any other gridworld object. This again goes in line with multi-agent support. I think it would also be cleaner.

  3. Observations should be encoded using a one-hot scheme rather than integers corresponding to each object type and color. The observations may not actually be any bigger in terms of bytes taken if we use a numpy array of bools (bits). This would likely be easier for a neural network to decode. (Won't be done, results inconclusive)

  4. By default, observations should be tensors, so that the code works out of the box with most RL frameworks. Mission strings should be provided as part of the info dict. The unfortunate truth is that OpenAI Gym has very poor support for any other kind of observation encoding.

  5. Doors and locked doors should probably have the same object type. We could take advantage of a one-hot scheme here and have a "door" bit as well as a "locked" flag bit. Possibly, each object should have a method to produce its own encoding in terms of bits. (Done)

  6. Some people have had difficulty installing PyQT. It might be nice to explore alternative rendering options. We could potentially generate graphics directly using 2D NumPy tensors, but we would still need some library to open display windows. I don't know what is the simplest Python option for that. The goal is for this package to work everywhere without any issues as much as possible.

  7. Render current step, max steps, reward obtained in env.render('human').

  8. Rename max_steps to max_episode_steps, which is more in keeping with OpenAI Gym conventions.

Other suggestions welcome.

@d3sm0
Copy link
Contributor

d3sm0 commented Dec 4, 2018

How about stochastic transitions and maybe stochastic rewards? And maybe a better test suite?

@maximecb
Copy link
Contributor Author

maximecb commented Dec 4, 2018

I think stochastic transitions and rewards are already possible using the random number generation methods.

For a better test suite, I'm open to it, but would need more concrete suggestions.

@d3sm0
Copy link
Contributor

d3sm0 commented Dec 4, 2018

I'll think more about it. I never wrote one but I'll be happy to take care of it.

One colleague of mine was asking about the possibility of creating envs from a text file. It has been done for a long time before gym, and maybe it can be useful, for specific research.

I would also like to have a benchmark or leaderboard if you want over a subset of all the envs like the top 10 that ore more difficult.

@maximecb
Copy link
Contributor Author

maximecb commented Dec 4, 2018

It would make sense to open issues for all 3 of these things to discuss what's involved. Like, for tests, what tests make sense to have. For creating envs from text files, the format needs to be discussed. For the leaderboard, I don't know exactly how that works. One issue is that the maximum reward you can get is 1.0. Maybe if it was averaged over 100 runs?

@d3sm0
Copy link
Contributor

d3sm0 commented Dec 6, 2018

Maybe let's gather a list of potential improvements /changes and desiderata for the 2.0 and with this we can split the the issues and start working? At least that's what I understood what was this thread about. If not, sorry i'll open an issue for each of those.

Beside there are two more things that i was wondering, could do some rendering of some stats like current step, reward, action, and maybe value of the state. Especially the latter would be super useful in debugging the methods

@maximecb
Copy link
Contributor Author

maximecb commented Dec 6, 2018

Some of these improvements we can integrate in the current MiniGrid because they are not at risk of breaking existing code. We could already improve the tests, for instance. Creating envs from a text file too, possibly. The things that are the most risky are changing the APIs that people call when interacting with MiniGrid and changing the observation format.

We could definitely render current state, reward and action. Value of the state I'm not sure because that's dependent on the RL implementation you're using.

@abaisero
Copy link
Contributor

abaisero commented Dec 7, 2018

Hello, just throwing in my 2c since I've taken interest in this project recently:

Separate out the gridworld from the OpenAI Gym interface. As pointed out in #37, it would make sense to have a gridworld class separate from the OpenAI Gym environment class. This would help us support multi-agent type of setups. It might also be slightly cleaner. We are already part of the way there with the current Grid class.

While OpenAI Gym is technically designed for single-agent environments in mind, there is nothing stopping people to use the same library / interface to handle multi-agent environments. E.g. env.step receives a tuple of actions, and returns a tuple of observations. At least, this is how I'm planning to handle this scenario.

The agent should be treated like any other gridworld object. This again goes in line with multi-agent support. I think it would also be cleaner.

+1

Observations should be encoded using a one-hot scheme rather than integers corresponding to each object type and color. The observations may not actually be any bigger in terms of bytes taken if we use a numpy array of bools (bits). This would likely be easier for a neural network to decode.

I don't think this is necessary; Yes, one should obviously use one-hot-encodings (categorical indices are not features!) but these can happen outside of the environment, e.g. I do it at the input of whatever model you want to use them with (e.g. with pytorch, one can just use torch.nn.Embedding.from_pretrained). On the other hand, having the index-version makes some things easier, e.g. to debug or inspect, one can print the observation map knowing what each index represents.

Doors and locked doors should probably have the same object type. We could take advantage of a one-hot scheme here and have a "door" bit as well as a "locked" flag bit. Possibly, each object should have a method to produce its own encoding in terms of bits.

Absolutely agree that they should be the same object type, what I'm not sure about is, why isn't the state of the object completely contained in the third channel? e.g. for doors, 0 = open, 1 = closed, 2 = locked. And this could be used to contain the state of other types of objects too.. This leaves some open questions, e.g. would each object have their own "state" encoding? For now only door have this type of state, but I don't see a reason why this same channel shouldn't be used more broadly.

@d3sm0
Copy link
Contributor

d3sm0 commented Dec 8, 2018

It would make sense to open issues for all 3 of these things to discuss what's involved. Like, for tests, what tests make sense to have. For creating envs from text files, the format needs to be discussed. For the leaderboard, I don't know exactly how that works. One issue is that the maximum reward you can get is 1.0. Maybe if it was averaged over 100 runs?

For the leaderboard I'm thinking more of a page with final score for the top 10 hardest environment and the learning curve of the top 3 algorithms, PPO, A2C, DQN. It should serve as a reference if someone wants to benchmark its' algorithm on this environment against the basic version. We could pick the OpenAi/Baselines implementation and just run MiniGrid or implement DQN in the linked repo in pytorch.

About tests if we are going to break the env in the upcoming weeks, I would rather work on it in the 2.0.

About text files I'm going to open an issue :)

@d3sm0
Copy link
Contributor

d3sm0 commented Dec 8, 2018

Some of these improvements we can integrate in the current MiniGrid because they are not at risk of breaking existing code. We could already improve the tests, for instance. Creating envs from a text file too, possibly. The things that are the most risky are changing the APIs that people call when interacting with MiniGrid and changing the observation format.

We could definitely render current state, reward and action. Value of the state I'm not sure because that's dependent on the RL implementation you're using.

Sounds fair, how about a debug mode with an optional string or variable to render?

@maximecb
Copy link
Contributor Author

maximecb commented Dec 9, 2018

While OpenAI Gym is technically designed for single-agent environments in mind, there is nothing stopping people to use the same library / interface to handle multi-agent environments. E.g. env.step receives a tuple of actions, and returns a tuple of observations. At least, this is how I'm planning to handle this scenario.

@abaisero That's a fair point.

I don't think this is necessary; Yes, one should obviously use one-hot-encodings (categorical indices are not features!) but these can happen outside of the environment, e.g. I do it at the input of whatever model you want to use them with (e.g. with pytorch, one can just use torch.nn.Embedding.from_pretrained). On the other hand, having the index-version makes some things easier, e.g. to debug or inspect, one can print the observation map knowing what each index represents.

We do have a Grid.decode function to translate back from observations though. I'd be curious to see how the training performance differs with a one-hot encoding. Would be happy if someone benchmarked it.

For the leaderboard I'm thinking more of a page with final score for the top 10 hardest environment and the learning curve of the top 3 algorithms, PPO, A2C, DQN. It should serve as a reference if someone wants to benchmark its' algorithm on this environment against the basic version.

@d3sm0 If you have the time to do this kind of study it would be awesome.

We could pick the OpenAi/Baselines implementation and just run MiniGrid or implement DQN in the linked repo in pytorch.

Would be happy to have sample code using OpenAI baselines in this repo, to serve as a reference, though we would then ideally want to maintain this code afterwards.

About tests if we are going to break the env in the upcoming weeks, I would rather work on it in the 2.0.

After thinking about this some more, I think that we may actually want to make the changes in this repo. The reason being, no code is ever perfect. We will likely need to make yet more changes in the future. Creating a new repo would just be kicking the can down the road, and people might be confused if there are multiple repos named MiniGrid. We will try to be careful and not break too many things at once, avoid unnecessary changes to outward-facing APIs.

@d3sm0
Copy link
Contributor

d3sm0 commented Dec 9, 2018

While OpenAI Gym is technically designed for single-agent environments in mind, there is nothing stopping people to use the same library / interface to handle multi-agent environments. E.g. env.step receives a tuple of actions, and returns a tuple of observations. At least, this is how I'm planning to handle this scenario.

@abaisero That's a fair point.

I don't think this is necessary; Yes, one should obviously use one-hot-encodings (categorical indices are not features!) but these can happen outside of the environment, e.g. I do it at the input of whatever model you want to use them with (e.g. with pytorch, one can just use torch.nn.Embedding.from_pretrained). On the other hand, having the index-version makes some things easier, e.g. to debug or inspect, one can print the observation map knowing what each index represents.

I do agree that the one-hot encoding should be left to the agent because is an architectural choice, but as we have seen with the FullyObsWrapper the number inside the encoding can change the results of the learning. Maybe a wrapper then?

We do have a Grid.decode function to translate back from observations though. I'd be curious to see how the training performance differs with a one-hot encoding. Would be happy if someone benchmarked it.

For the leaderboard I'm thinking more of a page with final score for the top 10 hardest environment and the learning curve of the top 3 algorithms, PPO, A2C, DQN. It should serve as a reference if someone wants to benchmark its' algorithm on this environment against the basic version.

@d3sm0 If you have the time to do this kind of study it would be awesome.

Indeed I do, cause I'm using this for my own research and for the repr challenge. But the algorithms are my own which is not ideal. Baselines changed a lot and it will change. Eventually we will just need to update the results from time to time, without maintaining the codebase, and run script that collects the results and render an html file with it.

We could pick the OpenAi/Baselines implementation and just run MiniGrid or implement DQN in the linked repo in pytorch.

Would be happy to have sample code using OpenAI baselines in this repo, to serve as a reference, though we would then ideally want to maintain this code afterwards.

About tests if we are going to break the env in the upcoming weeks, I would rather work on it in the 2.0.

After thinking about this some more, I think that we may actually want to make the changes in this repo. The reason being, no code is ever perfect. We will likely need to make yet more changes in the future. Creating a new repo would just be kicking the can down the road, and people might be confused if there are multiple repos named MiniGrid. We will try to be careful and not break too many things at once, avoid unnecessary changes to outward-facing APIs.

Fair, maybe we can just branch this, so who is using it will be stable on the master and iterate in it, starting from the proposed core changes.

@maximecb
Copy link
Contributor Author

maximecb commented Dec 9, 2018

I do agree that the one-hot encoding should be left to the agent because is an architectural choice, but as we have seen with the FullyObsWrapper the number inside the encoding can change the results of the learning. Maybe a wrapper then?

I like to make things as easy and as optimal out of the box for newcomers. I would like to benchmark the training performance of a one-hot type embedding vs the class indices we have now, and decide based on that. If the one-hot scheme trains noticeably faster, we might want to make it the default.

Indeed I do, cause I'm using this for my own research and for the repr challenge. But the algorithms are my own which is not ideal. Baselines changed a lot and it will change. Eventually we will just need to update the results from time to time, without maintaining the codebase, and run script that collects the results and render an html file with it.

If you write such a script I will merge it. What is the repr challenge?

@d3sm0
Copy link
Contributor

d3sm0 commented Dec 10, 2018

I do agree that the one-hot encoding should be left to the agent because is an architectural choice, but as we have seen with the FullyObsWrapper the number inside the encoding can change the results of the learning. Maybe a wrapper then?

I like to make things as easy and as optimal out of the box for newcomers. I would like to benchmark the training performance of a one-hot type embedding vs the class indices we have now, and decide based on that. If the one-hot scheme trains noticeably faster, we might want to make it the default.

Fair. I think can be related to #39 and #26. I'm not confident that the neural networks learn the same pattern if it's one-hot or real-valued. Eventually they should.

Indeed I do, cause I'm using this for my own research and for the repr challenge. But the algorithms are my own which is not ideal. Baselines changed a lot and it will change. Eventually we will just need to update the results from time to time, without maintaining the codebase, and run script that collects the results and render an html file with it.

If you write such a script I will merge it. What is the repr challenge?

This is the reproducibility challenge (repr) https://www.cs.mcgill.ca/~jpineau/ICLR2019-ReproducibilityChallenge.html

Alright i'll make the script. I'll open an issue for this, since it's not clear what the script should do.

@maximecb
Copy link
Contributor Author

maximecb commented Dec 14, 2018

@d3sm0 if you'd like to benchmark the one-hot encoding, I quickly hacked it together in the maximecb-onehot branch. I'd like to try this on non-trivial environment (eg: harder than the empty ones) and measure the time it takes to reach some success rate, as well as max success rate.

@maximecb
Copy link
Contributor Author

I've done some benchmarks of my own on the BabyAI GoTo and PutNextLocal environments and the results are inconclusive for the onehot change.

@d3sm0
Copy link
Contributor

d3sm0 commented Dec 17, 2018

@maximecb
Sweet thanks! What do you mean by inconclusive? what are you looking at? Only the cumulative reward?

@maximecb
Copy link
Contributor Author

@d3sm0 I mean that on some levels the success rate was very slightly higher, on others it was worse. Success rate as measured as the percentage of episodes ending with a positive reward.

@d3sm0
Copy link
Contributor

d3sm0 commented Dec 18, 2018

@maximecb I see. If you change the representation of the state, in principle the function approximator should behave differently, thus maybe is worth looking at the moments of the gradient of the neural network, the smoothness of the losses that you are considering and the cumulative reward over time during training.

I might run some experiments too over the weekend.

@maximecb
Copy link
Contributor Author

maximecb commented Jan 8, 2019

I've pushed a commit that unifies Door and LockedDoor. This simplifies some code.

@TheodoreGalanos
Copy link

Hello! I was wondering if it is possible to assign a custom, arbitrary, shape for the grid world. I'm not sure if that is available already (apologies if it is!) but it would make for a nice addition. I plant to use RL for assessing occupant satisfaction and custom shapes allow for different layouts to be tested and would make minigrid a nice tool for my use case, although not sure how useful it'd be for the rest.

Kind regards,
Theodore.

@maximecb
Copy link
Contributor Author

maximecb commented Jan 29, 2019

@TheodoreGalanos Hi Theodore. Right now you can have square shapes of any size. I assume you mean support for environments that are rectangular, but not necessarily squares? I will try to get that working today.

@maximecb
Copy link
Contributor Author

@TheodoreGalanos it's done. See the RedBlueDoors environment for an example: https://github.com/maximecb/gym-minigrid/blob/master/gym_minigrid/envs/redbluedoors.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants