Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample: Using Dask with ESPREsSo #4781

Merged
merged 3 commits into from
Oct 31, 2023
Merged

Conversation

RudolfWeeber
Copy link
Contributor

This was produced as a side project of learning Dask, but might be useful for others.

Copy link
Contributor

@SamTov SamTov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks really nice and is a great addition to the samples as it will help with really high throughout simulation studies. I have some questions scattered throughout the review but I also wanted to ask one here:

How does it deal with having more jobs than can be run at one time. For example, if I open 5 workers on a slurm cluster, can I keep passing jobs to these 5 workers or do they close after a simulation is finished? I didn't see any closing in the script so I assume it is the former but how does that work exactly?

samples/high_throughput_with_dask/dask_espresso.py Outdated Show resolved Hide resolved
samples/high_throughput_with_dask/dask_espresso.py Outdated Show resolved Hide resolved
samples/high_throughput_with_dask/dump_test_output.py Outdated Show resolved Hide resolved
samples/high_throughput_with_dask/echo.py Show resolved Hide resolved
samples/high_throughput_with_dask/echo.py Outdated Show resolved Hide resolved
VOLUME_FRACTIONS = np.arange(0.1, 0.52, 0.01)


client = dask.distributed.Client(sys.argv[1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the argument theoretically supposed to be either a Cluster instance or None or is it something different altogether?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it clear that this is a scheuler address that LocalCluster does not work and clusters with remote workers probably will.

@RudolfWeeber
Copy link
Contributor Author

Answering the general questoin: the workers stay alive and can be re-used until they are explicitly shut down. Espresso globals are kept out of the worker by running Espresso in a sub-process, i.e., in an independent Python instance. This make sthe serialization of input and output via pickle and base64 necessary, so they can be safely passed via stdion and stdout.

@RudolfWeeber
Copy link
Contributor Author

I also added some docstrings and comments throughout the sample.

@RudolfWeeber
Copy link
Contributor Author

Anything still open here?

@SamTov
Copy link
Contributor

SamTov commented Sep 26, 2023

Anything still open here?

I was asked by @jngrad to run this solution inside of our RL workflow in order to correctly assess whether it resolves the issues raised during our meetings. This will, however, take a little bit of time as we need to restructure the SwarmRL code such that it fits this structure. I think the code here works and is well written but whether it will solve the issues with our distributed deployment is still an open issue.

Copy link
Member

@jngrad jngrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jngrad jngrad added Documentation automerge Merge with kodiak labels Oct 31, 2023
@jngrad jngrad added this to the ESPResSo 4.3.0 milestone Oct 31, 2023
@kodiakhq kodiakhq bot merged commit b70d9e8 into espressomd:python Oct 31, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants