Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tpv ranking function discussion #1142

Closed
cat-bro opened this issue Jan 16, 2023 · 1 comment
Closed

tpv ranking function discussion #1142

cat-bro opened this issue Jan 16, 2023 · 1 comment

Comments

@cat-bro
Copy link
Collaborator

cat-bro commented Jan 16, 2023

The ranking function defined in default_tools.yml.j2 queries the influx API and receives info on the core utilisation (percentage) for each destination under consideration. If it is choosing between slurm and pulsar-mel2 it will choose the one with lowest proportion of cores currently being used. This has worked pretty well but there are some issues:

  • In the event that a job's datasets take a few minutes to get to a pulsar destination, the job will not start and the cores will not be used until the datasets arrive, so there is a lag between a job being assigned and the cores being utilised. This can lead to destinations being oversubscribed when others are available.
  • The ranking is considering the relative amount of nodes available rather than the absolute number, which makes it more likely for a small destination like pulsar-mel2 (40 cores) to be oversubscribed.
  • The API does not make memory data available. As long as jobs always have 3.8 Gb memory per core this is OK, but we are moving away from this by using the shared database, and there is no longer any expectation of a fixed ratio between cores and mem. It will no longer be enough to be using core allocation info to choose between destinations.

Simon's intention was to try out using galaxy database data for ranking instead of data from the stats API. Every job has cores and mem data available in the destination_params field. It would make the allocation data more accurate and enable us to take into account both cores and mem allocated to a destination. It could also put pressure on the database, though Simon's initial tests of the query suggested that it wouldn't.

So there are several possibilities for updating this including:

  • tpv queries database during ranking and no API calls are needed
  • stats API endpoint includes data from the galaxy db
  • add using memory and queued jobs to the stats API info (from slurm) and use this in the ranking function
@cat-bro
Copy link
Collaborator Author

cat-bro commented Jan 16, 2023

Things that we can get from slurm data but not from galaxy db are
(1) The breakdown of space available on each worker node for a destination
(2) Whether all of the worker nodes are running

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant