tpv ranking function discussion #1142

cat-bro · 2023-01-16T04:41:22Z

The ranking function defined in default_tools.yml.j2 queries the influx API and receives info on the core utilisation (percentage) for each destination under consideration. If it is choosing between slurm and pulsar-mel2 it will choose the one with lowest proportion of cores currently being used. This has worked pretty well but there are some issues:

In the event that a job's datasets take a few minutes to get to a pulsar destination, the job will not start and the cores will not be used until the datasets arrive, so there is a lag between a job being assigned and the cores being utilised. This can lead to destinations being oversubscribed when others are available.
The ranking is considering the relative amount of nodes available rather than the absolute number, which makes it more likely for a small destination like pulsar-mel2 (40 cores) to be oversubscribed.
The API does not make memory data available. As long as jobs always have 3.8 Gb memory per core this is OK, but we are moving away from this by using the shared database, and there is no longer any expectation of a fixed ratio between cores and mem. It will no longer be enough to be using core allocation info to choose between destinations.

Simon's intention was to try out using galaxy database data for ranking instead of data from the stats API. Every job has cores and mem data available in the destination_params field. It would make the allocation data more accurate and enable us to take into account both cores and mem allocated to a destination. It could also put pressure on the database, though Simon's initial tests of the query suggested that it wouldn't.

So there are several possibilities for updating this including:

tpv queries database during ranking and no API calls are needed
stats API endpoint includes data from the galaxy db
add using memory and queued jobs to the stats API info (from slurm) and use this in the ranking function

cat-bro · 2023-01-16T05:03:07Z

Things that we can get from slurm data but not from galaxy db are
(1) The breakdown of space available on each worker node for a destination
(2) Whether all of the worker nodes are running

cat-bro mentioned this issue Jan 30, 2023

Update tpv on production to version 2.2.0 #1158

Merged

cat-bro mentioned this issue May 28, 2024

replace production tpv api call with database query #1978

Merged

cat-bro closed this as completed Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tpv ranking function discussion #1142

tpv ranking function discussion #1142

cat-bro commented Jan 16, 2023

cat-bro commented Jan 16, 2023

tpv ranking function discussion #1142

tpv ranking function discussion #1142

Comments

cat-bro commented Jan 16, 2023

cat-bro commented Jan 16, 2023