You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The ranking function defined in default_tools.yml.j2 queries the influx API and receives info on the core utilisation (percentage) for each destination under consideration. If it is choosing between slurm and pulsar-mel2 it will choose the one with lowest proportion of cores currently being used. This has worked pretty well but there are some issues:
In the event that a job's datasets take a few minutes to get to a pulsar destination, the job will not start and the cores will not be used until the datasets arrive, so there is a lag between a job being assigned and the cores being utilised. This can lead to destinations being oversubscribed when others are available.
The ranking is considering the relative amount of nodes available rather than the absolute number, which makes it more likely for a small destination like pulsar-mel2 (40 cores) to be oversubscribed.
The API does not make memory data available. As long as jobs always have 3.8 Gb memory per core this is OK, but we are moving away from this by using the shared database, and there is no longer any expectation of a fixed ratio between cores and mem. It will no longer be enough to be using core allocation info to choose between destinations.
Simon's intention was to try out using galaxy database data for ranking instead of data from the stats API. Every job has cores and mem data available in the destination_params field. It would make the allocation data more accurate and enable us to take into account both cores and mem allocated to a destination. It could also put pressure on the database, though Simon's initial tests of the query suggested that it wouldn't.
So there are several possibilities for updating this including:
tpv queries database during ranking and no API calls are needed
stats API endpoint includes data from the galaxy db
add using memory and queued jobs to the stats API info (from slurm) and use this in the ranking function
The text was updated successfully, but these errors were encountered:
Things that we can get from slurm data but not from galaxy db are
(1) The breakdown of space available on each worker node for a destination
(2) Whether all of the worker nodes are running
The ranking function defined in default_tools.yml.j2 queries the influx API and receives info on the core utilisation (percentage) for each destination under consideration. If it is choosing between slurm and pulsar-mel2 it will choose the one with lowest proportion of cores currently being used. This has worked pretty well but there are some issues:
Simon's intention was to try out using galaxy database data for ranking instead of data from the stats API. Every job has cores and mem data available in the destination_params field. It would make the allocation data more accurate and enable us to take into account both cores and mem allocated to a destination. It could also put pressure on the database, though Simon's initial tests of the query suggested that it wouldn't.
So there are several possibilities for updating this including:
The text was updated successfully, but these errors were encountered: