Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LEAP prod hub unable to start Large profile notebooks #2237

Closed
pnasrat opened this issue Feb 21, 2023 · 14 comments
Closed

LEAP prod hub unable to start Large profile notebooks #2237

pnasrat opened this issue Feb 21, 2023 · 14 comments
Assignees

Comments

@pnasrat
Copy link
Contributor

pnasrat commented Feb 21, 2023

Context

Filed through support.

Get a ValueError when spawning Expected option gpu-image for profile large, not found in posted form

@pnasrat
Copy link
Contributor Author

pnasrat commented Feb 21, 2023

[W 2023-02-21 17:07:50.443 JupyterHub base:1039] 2 consecutive spawns failed.  Hub will exit if failure count reaches 5 before succeeding
[E 2023-02-21 17:07:50.443 JupyterHub gen:630] Exception in Future <Task finished name='Task-439647' coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.11/site-packages/jupyterhub/handlers/base.py:963> exception=ValueError('Expected option gpu-image for profile large, not found in posted form')> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.11/site-packages/tornado/gen.py", line 625, in error_callback
        future.result()
      File "/usr/local/lib/python3.11/site-packages/jupyterhub/handlers/base.py", line 970, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.11/site-packages/jupyterhub/user.py", line 851, in spawn
        raise e
      File "/usr/local/lib/python3.11/site-packages/jupyterhub/user.py", line 748, in spawn
        url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/jovyan/.local/lib/python3.11/site-packages/jupyterhub_configurator/mixins.py", line 45, in start
        return await super().start(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.11/site-packages/kubespawner/spawner.py", line 2645, in _start
        await self.load_user_options()
      File "/usr/local/lib/python3.11/site-packages/kubespawner/spawner.py", line 3103, in load_user_options
        await self._load_profile(selected_profile, selected_profile_user_options)
      File "/usr/local/lib/python3.11/site-packages/kubespawner/spawner.py", line 3034, in _load_profile
        raise ValueError(
    ValueError: Expected option gpu-image for profile large, not found in posted form
    
[E 2023-02-21 17:07:50.449 JupyterHub pages:373] Previous spawn for pnasrat failed: Expected option gpu-image for profile large, not found in posted form

@pnasrat pnasrat self-assigned this Feb 21, 2023
@pnasrat
Copy link
Contributor Author

pnasrat commented Feb 21, 2023

kubectl get nodes -n prod
NAME                                       STATUS   ROLES    AGE     VERSION
gke-leap-cluster-core-pool-1cc6bf7d-5fnp   Ready    <none>   130d    v1.24.5-gke.600
gke-leap-cluster-core-pool-1cc6bf7d-5fwt   Ready    <none>   62d     v1.24.5-gke.600
gke-leap-cluster-core-pool-1cc6bf7d-7wg2   Ready    <none>   31d     v1.24.5-gke.600
gke-leap-cluster-nb-huge-cf099c13-pplg     Ready    <none>   68m     v1.24.5-gke.600
gke-leap-cluster-nb-medium-b9c8ba20-45v5   Ready    <none>   3h28m   v1.24.5-gke.600
gke-leap-cluster-nb-medium-b9c8ba20-c6gs   Ready    <none>   137m    v1.24.5-gke.600
gke-leap-cluster-nb-medium-b9c8ba20-xdtg   Ready    <none>   136m    v1.24.5-gke.600

@pnasrat
Copy link
Contributor Author

pnasrat commented Feb 21, 2023

The traceback looks to be in profile handling - specifidcally for profile_options.

The form posts:

small-image: pangeo
profile-option-medium-image: pangeo
profile: large
profile-option-large-image: pangeo
profile-option-huge-image: pangeo
profile-option-large-gpu-image: tensorflow

See https://github.com/jupyterhub/kubespawner/blob/main/kubespawner/spawner.py#L3034

        if profile.get('profile_options'):
            # each option specified here *must* have a value in our POST, as we
            # render our HTML such that there's always something selected.

            # We only honor options that are defined in the selected profile *and*
            # are in the form data posted. This prevents users who may be authorized
            # to only use one profile from being able to access options set for other
            # profiles
            for user_selected_option_name in selected_profile_user_options.keys():
                if (
                    user_selected_option_name
                    not in profile.get('profile_options').keys()
                ):
                    raise ValueError(
                        f'Expected option {user_selected_option_name} for profile {slug}, not found in posted form'
                    )

@pnasrat
Copy link
Contributor Author

pnasrat commented Feb 21, 2023

It doesn't look like the kubespawner side of this has changed lately but I suspect something is not working correctly having

profile-option-large-image
profile-option-large-gpu-image

@pnasrat
Copy link
Contributor Author

pnasrat commented Feb 21, 2023

It looks like potentially there was a change to how profile_options were configured in this commit by @yuvipanda in f20aa17

@pnasrat
Copy link
Contributor Author

pnasrat commented Feb 21, 2023

That might be a red herring - trying to understand the flow

The form post will yield

{'profile': 'large',
 'profile-option-large-image': 'pangeo',
 'profile-option-large-gpu-image': 'tensorflow'}

This then gets parsed to be the options per profile ie having image and gpu-image for profile large.

@pnasrat
Copy link
Contributor Author

pnasrat commented Feb 21, 2023

I suspect that Large + GPU is supposed to be a separated profile and not an option on large

@pnasrat
Copy link
Contributor Author

pnasrat commented Feb 21, 2023

Looking for hubs with profile lists with GPU

ack -- '\+ GPU' 
m2lines/common.values.yaml
129:        - display_name: Large + GPU

leap/common.values.yaml
156:        - display_name: Large + GPU

2i2c-aws-us/researchdelight.values.yaml
79:        - display_name: "Large + GPU"

I tried on researchdelight and it worked, however the display name for large means the slug is profile-item-large-m5-2xlarge

 grep Large 2i2c-aws-us/researchdelight.values.yaml leap/common.values.yaml m2lines/common.values.yaml 
2i2c-aws-us/researchdelight.values.yaml:        - display_name: "Large: m5.2xlarge"
2i2c-aws-us/researchdelight.values.yaml:        - display_name: "Large + GPU"
leap/common.values.yaml:        - display_name: Large
leap/common.values.yaml:        - display_name: Large + GPU
m2lines/common.values.yaml:        - display_name: Large
m2lines/common.values.yaml:        - display_name: Large + GPU

@pnasrat
Copy link
Contributor Author

pnasrat commented Feb 21, 2023

I suspect either setting slug or changing the display name might work around this. PR to test. I'm also pulling down minikube to test kubespawner's logic (based on the unit tests of the spawner using a mock spawner)

@pnasrat
Copy link
Contributor Author

pnasrat commented Feb 21, 2023

Deleted prior kubespawner comment as not properly wired the test data.

@consideRatio
Copy link
Contributor

Thanks for excellent overview of your investigation!

In the spawn page, pressing start on the large option, I found that the sent form data:

image

So, all options are passed, but only the "large" is relevant. But when kubespawner consideres these options, we have a failure because it reacts on the profile-option-large-gpu-image option as it was a problem - an option beloning to the large profile as compared to the large-gpu profile.

I assume the issue is that kubespawner fail to distinguish between <profile-name-slug-form>-<profile-option-slug-form> because the separator can be part of the slug name as well.

I've confirmed this via this code segment, which just looks on a prefix. The large profile fails to start because its profile name is part of large-gpu.

https://github.com/jupyterhub/kubespawner/blob/main/kubespawner/spawner.py#L2981-L2991

I think the change that won't disrupt users is to add a slug (the name of the profile in a computer friendly format) that ensures Large + GPU doesn't get a similar slug to large, such as just gpu.

@consideRatio
Copy link
Contributor

I opened jupyterhub/kubespawner#702 but doesn't look to resolve it myself now that its documented well enough.

For us right now, the workaround should be to add a slug for the Large + GPU profile list entry so its not generated to be large-gpu, but to let it be declared like gpu or gpu-large.

Note that we should not try to combine the GPU profile_list entry with the Large server entry, as I know that we want to start using n2-highmem servers for non-gpu servers, but they can't have GPUs etc.

pnasrat added a commit that referenced this issue Feb 22, 2023
LEAP Work around for slug handling and user options in kubespawner

Fixes: #2237
@pnasrat
Copy link
Contributor Author

pnasrat commented Feb 22, 2023

Deployer failed on LEAP staging health check but checking the staging hub the change is there and the hub is healthy. Will check manually.

@pnasrat
Copy link
Contributor Author

pnasrat commented Feb 22, 2023

Restarting the workflow failed jobs succeeded the health check, if this is recurrent we might need to revisit the retry strategy to not have flaky failures due to slow bringup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

No branches or pull requests

2 participants