-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sample: Using Dask with ESPREsSo #4781
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it looks really nice and is a great addition to the samples as it will help with really high throughout simulation studies. I have some questions scattered throughout the review but I also wanted to ask one here:
How does it deal with having more jobs than can be run at one time. For example, if I open 5 workers on a slurm cluster, can I keep passing jobs to these 5 workers or do they close after a simulation is finished? I didn't see any closing in the script so I assume it is the former but how does that work exactly?
VOLUME_FRACTIONS = np.arange(0.1, 0.52, 0.01) | ||
|
||
|
||
client = dask.distributed.Client(sys.argv[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the argument theoretically supposed to be either a Cluster
instance or None
or is it something different altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it clear that this is a scheuler address that LocalCluster does not work and clusters with remote workers probably will.
Answering the general questoin: the workers stay alive and can be re-used until they are explicitly shut down. Espresso globals are kept out of the worker by running Espresso in a sub-process, i.e., in an independent Python instance. This make sthe serialization of input and output via pickle and base64 necessary, so they can be safely passed via stdion and stdout. |
I also added some docstrings and comments throughout the sample. |
Anything still open here? |
I was asked by @jngrad to run this solution inside of our RL workflow in order to correctly assess whether it resolves the issues raised during our meetings. This will, however, take a little bit of time as we need to restructure the SwarmRL code such that it fits this structure. I think the code here works and is well written but whether it will solve the issues with our distributed deployment is still an open issue. |
Co-authored-by: Rudolf Weeber <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This was produced as a side project of learning Dask, but might be useful for others.