Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure: update fetch_azure to support two H100 families. #2844

Merged
merged 2 commits into from
Dec 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 30 additions & 20 deletions sky/clouds/service_catalog/data_fetchers/fetch_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,34 @@

SINGLE_THREADED = False

# Family name to SkyPilot GPU name mapping.
#
# When adding a new accelerator:
# - The instance type is typically already fetched, but we need to find the
# family name and add it to this mapping.
# - To inspect family names returned by Azure API, check the dataframes in
# get_all_regions_instance_types_df().
FAMILY_NAME_TO_SKYPILOT_GPU_NAME = {
'standardNCFamily': 'K80',
'standardNCSv2Family': 'P100',
'standardNCSv3Family': 'V100',
'standardNCPromoFamily': 'K80',
'StandardNCASv3_T4Family': 'T4',
'standardNDSv2Family': 'V100-32GB',
'StandardNCADSA100v4Family': 'A100-80GB',
'standardNDAMSv4_A100Family': 'A100-80GB',
'StandardNDASv4_A100Family': 'A100',
'standardNVFamily': 'M60',
'standardNVSv2Family': 'M60',
'standardNVSv3Family': 'M60',
'standardNVPromoFamily': 'M60',
'standardNVSv4Family': 'Radeon MI25',
'standardNDSFamily': 'P40',
'StandardNVADSA10v5Family': 'A10',
'StandardNCadsH100v5Family': 'H100',
'standardNDSH100v5Family': 'H100',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weirdly, I don't see this family in az vm list-skus --all --resource-type virtualMachines -l southcentralus | grep v5. How do we get this family?

Copy link
Member Author

@concretevitamin concretevitamin Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running this on my end got

...
    "family": "StandardNCadsH100v5Family",
    "name": "Standard_NC40ads_H100_v5",
    "size": "NC40ads_H100_v5",
    "family": "StandardNCadsH100v5Family",
    "name": "Standard_NC80adis_H100_v5",
    "size": "NC80adis_H100_v5",
...

My az account subscription list shows subscriptions *7 and *a.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I was wondering where we get the family standardNDSH100v5Family. Seems it is not included in the output you sent either?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was obtained by inspecting an intermediate dataframe. Also seen in:

» az vm list-skus --all --resource-type virtualMachines  | grep -i h100

i.e., with the location removed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, sorry. I forgot that location argument. Just removed the location and see the family. Thanks for the explanation.

}


def get_regions() -> List[str]:
"""Get all available regions."""
Expand Down Expand Up @@ -78,7 +106,7 @@ def get_pricing_url(region: Optional[str] = None) -> str:
def get_pricing_df(region: Optional[str] = None) -> pd.DataFrame:
all_items = []
url = get_pricing_url(region)
print(f'Getting pricing for {region}')
print(f'Getting pricing for {region}, url: {url}')
page = 0
while url is not None:
page += 1
Expand Down Expand Up @@ -125,29 +153,11 @@ def get_sku_df(region_set: Set[str]) -> pd.DataFrame:


def get_gpu_name(family: str) -> Optional[str]:
gpu_data = {
'standardNCFamily': 'K80',
'standardNCSv2Family': 'P100',
'standardNCSv3Family': 'V100',
'standardNCPromoFamily': 'K80',
'StandardNCASv3_T4Family': 'T4',
'standardNDSv2Family': 'V100-32GB',
'StandardNCADSA100v4Family': 'A100-80GB',
'standardNDAMSv4_A100Family': 'A100-80GB',
'StandardNDASv4_A100Family': 'A100',
'standardNVFamily': 'M60',
'standardNVSv2Family': 'M60',
'standardNVSv3Family': 'M60',
'standardNVPromoFamily': 'M60',
'standardNVSv4Family': 'Radeon MI25',
'standardNDSFamily': 'P40',
'StandardNVADSA10v5Family': 'A10',
}
# NP-series offer Xilinx U250 FPGAs which are not GPUs,
# so we do not include them here.
# https://docs.microsoft.com/en-us/azure/virtual-machines/np-series
family = family.replace(' ', '')
return gpu_data.get(family)
return FAMILY_NAME_TO_SKYPILOT_GPU_NAME.get(family)


def get_all_regions_instance_types_df(region_set: Set[str]):
Expand Down
10 changes: 8 additions & 2 deletions sky/utils/accelerator_registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,22 @@
# NOTE: Must include accelerators supported for local clusters.
#
# 1. What if a name is in this list, but not in any catalog?
#
# The name will be canonicalized, but the accelerator will not be supported.
# Optimizer will print an error message.
#
# 2. What if a name is not in this list, but in a catalog?
#
# The list is simply an optimization to short-circuit the search in the catalog.
# If the name is not found in the list, it will be searched in the catalog
# with its case being ignored. If a match is found, the name will be
# canonicalized to that in the catalog. Note that this lookup can be an
# expensive operation, as it requires reading the catalog or making external
# API calls (such as for Kubernetes). Thus it is desirable to keep this list
# up-to-date with commonly used accelerators.

# 3. (For SkyPilot dev) What to do if I want to add a new accelerator?
#
# Append its case-sensitive canonical name to this list. The name must match
# `AcceleratorName` in the service catalog, or what we define in
# `onprem_utils.get_local_cluster_accelerators`.
Expand All @@ -42,6 +47,7 @@
'Radeon MI25',
'P4',
'L4',
'H100',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems GKE is using H100-80GB as the name. Not sure if we want to align with that.
Pro: keep the name the same for GKE and other cloud's native GPU name
Con: it complicates the name, and H100 only has 80GB and 188GB versions, we may only want to have H100 and H100-188GB.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it seems we've been using H100 for AWS and Lambda already (added in catalog files, not in this registry). Should be fine to use it for now.

]


Expand Down Expand Up @@ -72,11 +78,11 @@ def canonicalize_accelerator_name(accelerator: str) -> str:
if len(names) == 1:
return names[0]

# Do not print an error meessage here. Optimizer will handle it.
# Do not print an error message here. Optimizer will handle it.
if len(names) == 0:
return accelerator

# Currenlty unreachable.
# Currently unreachable.
# This can happen if catalogs have the same accelerator with
# different names (e.g., A10g and A10G).
assert len(names) > 1
Expand Down