Harmonize gres across clusters #115

satyaog · 2024-03-27T06:47:45Z

No description provided.

bouthilx · 2024-03-28T18:03:25Z

C'est pourquoi le filelock?

satyaog · 2024-04-02T13:41:29Z

C'était dans une tentative de résoudre le problème des tests qui fails de temps en temps. Je me disais qu'avec le multi-process il y avait peut-être un problème avec les ports qui se faisait parfois assigner 2 fois pour deux instances différentes. Mais c'était seulement des tests, il n'y en a plus maintenant dans cette branche

bouthilx · 2024-04-02T19:27:33Z

Oups, le ticket mentionnait seulement le problème pour les format gpu:nom:nombre. Il y a aussi un mismatch avec la façon d'afficher les noms entre node_gpu_mapping et les données de prometheus.

Voici un exemple des noms qu'on peut trouver dans la dernière année:

[
    'NVIDIA A100-SXM4-40GB', 'gpu:v100:8', 'Quadro RTX 8000',
    'Tesla V100-SXM2-16GB', 'Tesla V100-SXM2-32GB',
    'NVIDIA A100-SXM4-80GB', 'Tesla V100-SXM2-32GB-LS', 'gpu:v100l:4',
    'gpu:p100:4', 'NVIDIA A100 80GB PCIe', 'gpu:p100l:4', 'gpu:p100:2',
    'gpu:t4:4', 'gpu:v100:6', 'a100', 'NVIDIA RTX A6000', '3g.40gb',
    '2g.20gb', '4g.40gb'
]

Idéallement, on voudrait que v100 soit remplacé par une description plus précise tel que Tesla V100-SXM2-16GB. Dans le cas de Cedar, le mapping va être différent selon les noeuds, car il y a des GPUs avec différente quantité de RAM.

Il y a aussi les '3g.40gb', '2g.20gb', '4g.40gb' qui sont problématique car ça ne dit pas le type de GPU sur lequel est la tranche MIG. Ça devrait peut-être être un ticket à part. 🤔

satyaog · 2024-04-02T19:52:49Z

Ah je comprends mieux maintenant merci! J'avais regarder la base de donnée de dev mais il n'y avait que des infos sur graham. J'ai complètement oublié de redemander s'il était possible d'avoir une version à jour de la db de test pour vraiment vérifier les valeurs

config/sarc-dev.json

satyaog · 2024-04-03T21:43:07Z

The great part of the logic should be good but for the details I think I will need an updated dev db to list the available gpu data we have there. I'll let you decide if this should be merged or not before that

config/sarc-dev.json

satyaog · 2024-04-10T15:31:07Z

I have very weird results with the updated sarc-bc dev database. For the mila cluster (I haven't verify the other clusters as there's way too many nodes), there seams to be a the cross-over for the type of gpus assigned with nodes that don't seam to match the cluster architecture :

mila	4g.40gb			cn-a001	
mila	NVIDIA A100-SXM4-80GB	cn-a001	
mila	Quadro RTX 8000		cn-a001	
mila	Quadro RTX 8000		cn-c015	cn-c023	cn-c025	cn-c028	cn-g021	cn-g022	cn-g023	cn-g024	
mila	Tesla V100-SXM2-32GB	cn-a004
mila	Tesla V100-SXM2-32GB	cn-b001	cn-g004	
mila	Tesla V100-SXM2-32GB	cn-b001	cn-d001	cn-g017	cn-g018	cn-g022	cn-g023	cn-g024	cn-g026	
mila	Tesla V100-SXM2-32GB-LS	cn-b001

The aggregation I made was :

db.getCollection('jobs').aggregate(
  [
    {
      $match: {
        'allocated.gpu_type': { $ne: null }
      }
    },
    {
      $group: {
        _id: {
          cluster: '$cluster_name',
          gres: '$allocated.gpu_type',
          nodes: '$nodes'
        }
      }
    },
    {
      $sort: {
        '_id.cluster': 1,
        '_id.gres': 1,
        '_id.nodes': 1
      }
    }
  ],
  { maxTimeMS: 60000, allowDiskUse: true }
);

Did I made my query wrong or am I missing something?

satyaog force-pushed the feature/uniform_gres branch 10 times, most recently from e2ca33f to a199b6b Compare March 28, 2024 13:27

satyaog force-pushed the feature/uniform_gres branch from a199b6b to f853ed6 Compare April 2, 2024 13:27

satyaog marked this pull request as ready for review April 2, 2024 13:36

Harmonize gres across clusters

262d37c

satyaog force-pushed the feature/uniform_gres branch from f853ed6 to 262d37c Compare April 2, 2024 13:42

satyaog force-pushed the feature/uniform_gres branch from c648c98 to 11dbc63 Compare April 3, 2024 20:09

satyaog commented Apr 3, 2024

View reviewed changes

config/sarc-dev.json Outdated Show resolved Hide resolved

config/sarc-dev.json Outdated Show resolved Hide resolved

config/sarc-dev.json Outdated Show resolved Hide resolved

config/sarc-dev.json Show resolved Hide resolved

satyaog force-pushed the feature/uniform_gres branch 2 times, most recently from cc1118b to f4a0030 Compare April 3, 2024 20:26

Add per cluster's node gpu maps

8d26d49

satyaog force-pushed the feature/uniform_gres branch from f4a0030 to 8d26d49 Compare April 3, 2024 21:41

Merge branch 'master' into feature/uniform_gres

1ff1ad1

satyaog commented Apr 10, 2024

View reviewed changes

config/sarc-dev.json Outdated Show resolved Hide resolved

Finalize the config

c845106

satyaog force-pushed the feature/uniform_gres branch from 10706cb to c845106 Compare April 10, 2024 15:50

Merge branch 'master' into feature/uniform_gres

d23d7ab

satyaog force-pushed the feature/uniform_gres branch from 74b3392 to d23d7ab Compare April 10, 2024 15:56

satyaog and others added 4 commits April 10, 2024 16:14

Finalize config

aaae508

Merge branch 'master' into feature/uniform_gres

5cd9f42

Merge branch 'master' into feature/uniform_gres

1d380a4

Merge branch 'master' into feature/uniform_gres

d336ac8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harmonize gres across clusters #115

Harmonize gres across clusters #115

satyaog commented Mar 27, 2024

bouthilx commented Mar 28, 2024

satyaog commented Apr 2, 2024 •

edited

Loading

bouthilx commented Apr 2, 2024

satyaog commented Apr 2, 2024

satyaog commented Apr 3, 2024

satyaog commented Apr 10, 2024 •

edited

Loading

Harmonize gres across clusters #115

Are you sure you want to change the base?

Harmonize gres across clusters #115

Conversation

satyaog commented Mar 27, 2024

bouthilx commented Mar 28, 2024

satyaog commented Apr 2, 2024 • edited Loading

bouthilx commented Apr 2, 2024

satyaog commented Apr 2, 2024

satyaog commented Apr 3, 2024

satyaog commented Apr 10, 2024 • edited Loading

satyaog commented Apr 2, 2024 •

edited

Loading

satyaog commented Apr 10, 2024 •

edited

Loading