Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skipping observation in multi agent env #6757

Closed
nicofirst1 opened this issue Jan 9, 2020 · 7 comments
Closed

Skipping observation in multi agent env #6757

nicofirst1 opened this issue Jan 9, 2020 · 7 comments
Labels
question Just a question :)

Comments

@nicofirst1
Copy link

nicofirst1 commented Jan 9, 2020

Describe your feature request

I am working on an implementation of the warewolf game using the rllib wrapper for gym multi agent envs. In this game there are wolves and villagers.

The game is divided into night and day phase.
During day every agent can perform an action while during night only wolves can.
Precisely, night observation should not be visible to villager agents.
I have an observation which specify the current phase and would like to filter out night observation for the latter case.
Is there a way to implement it easily?

What have I tried

I tried modifying the _process_observations function adding a line after line 403. Using a custom Preprocessor I am able to return None if the current observation should be discarded (given an agent id). Then if the processed observation is none just skip the step with:

 if prep_obs is None:
                continue

I don't know if this implementation if conceptually correct or if there is another way to do it.
Please let me know.

Edit 1

Applying the previous method yields:
{ValueError}The environment terminated for all agents, but we still don't have a last observation for agent villager_2 (policy vill_p). Please ensure that you include the last observations of all live agents when setting '__all__' done to True. Alternatively, set no_done_at_end=True to allow this.
In here.

@nicofirst1 nicofirst1 added the enhancement Request for new feature and/or capability label Jan 9, 2020
@ericl
Copy link
Contributor

ericl commented Jan 9, 2020

I think you should be able to model this using the multi-agent API without any changes to rllib.

In your MultiAgentEnv class

  1. during day phase: obs dict with all agent ids as keys is emitted. All agents return actions.
  2. during night phase: obs dict with only wolf agent ids is emitted. Only wolves return actions.
  3. termination: include an empty obs for all agent ids when setting all done

Does that work?

@ericl ericl added question Just a question :) and removed enhancement Request for new feature and/or capability labels Jan 9, 2020
@nicofirst1
Copy link
Author

nicofirst1 commented Jan 10, 2020

Thank you for the fast reply.
I didn't quite get what you are suggesting but it should be one of the following.

1) Emitting empty obs for villagers during night time

In this case the observation dictionary stays constant in the number of elements (agent ids).
However I get the following error
ValueError: Cannot feed value of shape (3, 0) for Tensor 'vill_p/observation:0', which has shape '(?, 32)'
Since I am trying to feed an empty list instead of the required size (32). I should add that I am using a custom Preprocessor class which is the one setting the initial size, but I think the predefined one will lead to the same error.

Edit 1: Using default preprocessor

Using the default preprocessor yields the following error:
ValueError: ('Observation outside expected value range', Dict(day:Discrete(1000), phase:Discrete(4), status_map:MultiBinary(5), targets:Box(5, 5)), {})
Which is kind of obvious since empty dict is different from the full one.

2) Not emitting observation for villagers during night time

In this case the observation dict id dynamic, e.g. the number of agent ids changes during steps.
For this one I get:
ValueError: Key set for obs and rewards must be the same: dict_keys(['werewolf_0', 'werewolf_1']) vs dict_keys(['werewolf_0', 'werewolf_1', 'villager_2', 'villager_3', 'villager_4'])
coming from the base env

Let me know if I misunderstood you in some way.

@nicofirst1
Copy link
Author

nicofirst1 commented Jan 10, 2020

I manage to fixed the
ValueError: Key set for obs and rewards must be the same: dict_keys(['werewolf_0', 'werewolf_1']) vs dict_keys(['werewolf_0', 'werewolf_1', 'villager_2', 'villager_3', 'villager_4'])
In solution 2 by not returning rewards for villagers during night phase.

At the moment I am getting a shape error:

File "/usr/local/anaconda3/envs/ww/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1367, in _do_call
    return fn(*args)
  File "/usr/local/anaconda3/envs/ww/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1352, in _run_fn
    target_list, run_metadata)
  File "/usr/local/anaconda3/envs/ww/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1445, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [120] vs. [6]
	 [[{{node vill_p_1/tower_1/gradients_1/vill_p_1/tower_1/add_7_grad/BroadcastGradientArgs}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/anaconda3/envs/ww/lib/python3.6/code.py", line 91, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/giulia/Desktop/rl-werewolf/src/tests/simple_policy.py", line 73, in <module>
    trainer.train()
  File "/usr/local/anaconda3/envs/ww/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 447, in train
    raise e
  File "/usr/local/anaconda3/envs/ww/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 433, in train
    result = Trainable.train(self)
  File "/usr/local/anaconda3/envs/ww/lib/python3.6/site-packages/ray/tune/trainable.py", line 176, in train
    result = self._train()
  File "/usr/local/anaconda3/envs/ww/lib/python3.6/site-packages/ray/rllib/agents/trainer_template.py", line 129, in _train
    fetches = self.optimizer.step()
  File "/usr/local/anaconda3/envs/ww/lib/python3.6/site-packages/ray/rllib/optimizers/multi_gpu_optimizer.py", line 204, in step
    self.per_device_batch_size)
  File "/usr/local/anaconda3/envs/ww/lib/python3.6/site-packages/ray/rllib/optimizers/multi_gpu_impl.py", line 260, in optimize
    return sess.run(fetches, feed_dict=feed_dict)
  File "/usr/local/anaconda3/envs/ww/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 960, in run
    run_metadata_ptr)
  File "/usr/local/anaconda3/envs/ww/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1183, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/anaconda3/envs/ww/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1361, in _do_run
    run_metadata)
  File "/usr/local/anaconda3/envs/ww/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1386, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [120] vs. [6]
	 [[node vill_p_1/tower_1/gradients_1/vill_p_1/tower_1/add_7_grad/BroadcastGradientArgs (defined at /usr/local/anaconda3/envs/ww/lib/python3.6/site-packages/ray/rllib/agents/ppo/ppo_policy.py:211) ]]

Where 6 is the number of players.

Changing the number of player to 8 yields the same error :
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [120] vs. [8] [[node vill_p_1/tower_1/gradients_1/vill_p_1/tower_1/add_9_grad/BroadcastGradientArgs (defined at usr/local/anaconda3/envs/ww/lib/python3.6/site-packages/ray/rllib/agents/ppo/ppo_policy.py:211) ]]

@ericl
Copy link
Contributor

ericl commented Jan 10, 2020

In this case the observation dictionary stays constant in the number of elements (agent ids).
However I get the following error

I mean omitting the key for the player entirely. For example: during day: {"player1": obs1a, "werewolf1": obs1b}. During night: just {"werewolf1": obs2b}.

Key set for obs and rewards must be the same.

Yeah, you can't emit rewards if there are no obs. The reward must be delayed to the next step (whenever an obs shows up).

Edit: Ah, I see this is resolved.

@ericl
Copy link
Contributor

ericl commented Jan 10, 2020

Not sure what's going on with the gradient error (probably some incorrect shape emitted as an observation). Is it possible to post a script to run?

@nicofirst1
Copy link
Author

Sorry for the late reply,
I manage to solve the problem by running

   analysis = tune.run(
        "PG",
        local_dir=Params.RAY_DIR,
        config=configs,
        trial_name_creator=trial_name_creator,

    )
    

Rather then :

trainer = PGTrainer(configs, PolicyWw)
for i in tqdm(range(20)):
    trainer.train()

@nicofirst1
Copy link
Author

Moreover the second solution seems to work for the issue so we could consider the issue closed

@ericl ericl closed this as completed Jan 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Just a question :)
Projects
None yet
Development

No branches or pull requests

2 participants