Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Specify image in server request #452

Merged
merged 46 commits into from
Nov 24, 2020
Merged

Conversation

olevski
Copy link
Member

@olevski olevski commented Nov 3, 2020

Closes #304

It took a while because I realized that some of the code I wrote was already present when we are checking for the gitlab image. So I spent a bit more time to clean this up and avoid repeating/similar code.

The logic implemented here is as follows:

  • look for image in the payload from the request to create a new user interactive server
    • if image is found then:
      • check if the image name passed regex for dockerhub public image, google container registry (gcr) public image and gitlab
      • if any of those match and the image is confirmed to exist then use that image
      • in the case of a gitlab image if the image is part of renku's gitlab check if the image is public or not and accordingly add a image pull secret
      • in the case that the image requested cannot be found respond with a 404
    • if image is not found (what we had before this PR):
      • look at the commit for the current project, try to find an image that matches the current repo and commit
      • if such an image is not found use the default renku image specified in the environment variable

Also in every of the above cases the following annotations are added to the user pod that is launched:

  • renku.io/default_image_used, True or False depending on wheter the image tied to the current commit could not be found and the default image was used instead
  • renku.io/image, the name of the image used i.e. renku/renkulab-py:3.7-renku0.10.4-0.6.3

This is currently deployed at https://tasko.dev.renku.ch/. I found the easiest way to test is to just use telepresence and replace None with an image value that you would like to test in here requested_image = payload.get("image", None) in line 85 in api/notebooks.py.

@olevski olevski requested a review from a team as a code owner November 3, 2020 22:37
@lorenzo-cavazzi
Copy link
Member

At first sight, the code looks great. I'm going to test this soon.

Should we add a few more tests to cover the newly created functions, including gcr_public_image_exists, dockerhub_public_image_exists, etc. ?
Not sure about possible rate limits problems when invoking 3rd party services like dockerhub and grc though 🤔

Copy link
Contributor

@ableuler ableuler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good and seems to give us what we need! 🎉
The only bigger refactoring that I would propose is to try and make the logic for checking the existence of an image a bit more simple and generic. I believe that this should be possible doing something along those lines:

  • if image does not include host, prepend registry-1.docker.io/
  • parse image into host, repository, tag
  • get request to https://{host}/v2/{repository}/manifests/{tag}, if 200, image exists, all good, return
  • if 401, get token URL and service from the Www-Authenticate header of the 401 response
  • do access token request with specifying the pull scope for the given repository (for RenkuLab's gitlab include the users oauth token at Authorization: basic {gitlab_oauth_token})
  • use aquired access token for another get request to https://{host}/v2/{repository}/manifests/{tag}, if 200, image exists, all good, return

I haven't tested this through myself, but I'm pretty sure that this (or some small variations of it should work. This would allow for all publicly available image registries and reduce the need for all the logical branching in the code.

@olevski
Copy link
Member Author

olevski commented Nov 4, 2020

@lorenzo-cavazzi I can add the tests you suggest that is a good idea.

@ableuler if a response from the API (either for docker or gitlab) without a token tells you where to go to authenticate then that is great. This is what I had issues with - I did not try to see what is in the response when you do not have a token. The most trouble I had is figuring out the URL to authenticate for renku's gitlab (which may also change with different deployments) and also even for a managed gitlab. So I will try what you propose. I originally did want to have a more unified approach but gave up mostly because I was not sure what is the endpoint where you should authenticate.

@ableuler
Copy link
Contributor

ableuler commented Nov 4, 2020

because I was not sure what is the endpoint where you should authenticate.

Yes, that's not specified since the token endpoint is not part of the registry itself. The v2 registry documentation only specifies how the client must be informed about where to authenticate: https://github.com/docker/distribution/blob/c192a281f8ac6f2a351fe729c8a56108f8edb377/docs/spec/auth/token.md#how-to-authenticate

@rokroskar
Copy link
Member

Thanks @olevski!

I have yet to test it out but I'm a bit confused by this:

The most trouble I had is figuring out the URL to authenticate for renku's gitlab (which may also change with different

You have the GITLAB_URL environment variable and you have the user's oauth token so you can use the gitlab registry api.

Two more points:

  • "if such an image is not found use the default renku image specified in the environment variable" --> I would just fail if no valid image can be found. It's super confusing for users to get an environment with an image they don't expect. The UI should give them the option to launch with the default image but then this will be specified in the UI request with image in the payload.

  • why do we need the extra annotations? Why would you need to know that a default image was used? The renku.io/image annotation is redundant with the image set for the container that you can get out of the pod manifest anyway. I can definitely see the usefulness of giving this information to the UI in the response, but it doesn't need to be a pod annotation.

@lorenzo-cavazzi
Copy link
Member

@rokroskar

  • "if such an image is not found use the default renku image specified in the environment variable" --> I would just fail if no valid image can be found. It's super confusing for users to get an environment with an image they don't expect. The UI should give them the option to launch with the default image but then this will be specified in the UI request with image in the payload.

This applies only when no custom image is specified, so the behavior is the same as before but the information about the default image being used is saved and the UI can give feedback to the user when that actually happens.
It always fails when the image is provided and it's not accessible.

The second one is a good point, we didn't consider the information is already there and I thought an annotation was the easiest solution.

@olevski
Copy link
Member Author

olevski commented Nov 4, 2020

So I think I either have some really weird behaviour with telepresence or someone else is trying to run telepresence at the same time as me right now. If it is the latter then sorry for reinstalling the whole deployment a few times. I did not think of this. Also I just deployed some new code that addresses Andreas' comments. I have not had a chance to test it out fully yet. But checking for a public image is now much simpler as Andreas suggested. I did not know about the Www-Authenticate in the header.

@ableuler
Copy link
Contributor

ableuler commented Nov 5, 2020

@olevski sometimes when telepresence doesn't exit properly, you have to clean up manually:

  • Check that the local flask server isn't running anymore, I usually do this by checking if something is using the specific port, for example lsof -i tcp:8000.
  • Delete the deployment created by telepresence using kubectl
  • Scale the original deployment back up from 0 to the desired number of replicas.

After that the application should be in a clean state again and you should be able to start another telepresence session normally.

@olevski
Copy link
Member Author

olevski commented Nov 5, 2020

Ok so I made additional changes to the code to address comments you all posted:

  • @ableuler
    • As you suggested checking whether a public image exists is now done with less logic and one single function. Suggesting to use www-authenticate in the header was a lifesaver.
    • I also tested with a sha256 instead of a tag and things work
  • @rokroskar
    • I am not sure if you want to change how we operate with regard to the default image. Even without this PR (and as @lorenzo-cavazzi mentioned) if the image that is tied to the current commit does not exist renku will not fail but will use a default image. I have retained this in the new code here. When the user requests a specific image (i.e. the API gets something from the UI for the image parameter) then we fail if that image does not exist. Let me know if you agree with this approach.
    • I removed the annotations from the pod and edited the function that lists the pods parameters for the servers endpoint to return a field called image that is the image name and another field called default_image_used. So there is no more unneeded annotations this data I pull from the pod manifest as you suggested.
  • @lorenzo-cavazzi
    • I added tests for the added code that parses a user specified image name and also for the functions that checks if the image exists.

Let me know what you think.

@rokroskar rokroskar changed the title enhancement: Specify image in server request feat: Specify image in server request Nov 6, 2020
@rokroskar
Copy link
Member

Even without this PR (and as @lorenzo-cavazzi mentioned) if the image that is tied to the current commit does not exist renku will not fail but will use a default image.

Right, this is the behavior I'm talking about changing. Maybe it's better to do it in a separate PR, but atm it can happen that a user's environment will be created with some default image, even though they think they asked for something else. This situation is super confusing. It's true that this is normally handled by the UI, but we've had cases where there was some issue with the image late in the process which resulted in the default image being used and things got super confusing. So my point is just that I'd rather be conservative and fail early in this case.

@lorenzo-cavazzi
Copy link
Member

If that is a desirable change, I'd rather do it now than later since we are adding variables both here and in the UI to handle that specific case better.
I personally think the default image is confusing, and it may not be necessary anymore now that we check for the image existance in the UI -- the user already knows beforehand if something is wrong.
We could give it a try and, if it turns out that the default image was used frequently and it was actually useful, we could re-introduce it later.

@olevski
Copy link
Member Author

olevski commented Nov 14, 2020

Ok this is super weird, the tests I added pass on python 3.8.5 (which is the version in my local environment) but fail on python 3.7 which is what is used on the tests in the git actions. Will figure this out. I switched to 3.7 and I can replicate the failed tests. So I can figure out what the problem is. We should also maybe specify the python version in the pipfile so that we avoid similar issues in the future.

@olevski
Copy link
Member Author

olevski commented Nov 14, 2020

Hi guys, sorry for the delays. But I fixed the tests I mentioned were failing earlier - they were failing because the unittest.mock API is just a bit different between python 3.7 and 3.8.

In addition to this I addressed all the outstanding comments and did some tests to confirm everything works. When someone requests a nested gitlab image now things work (as long as the image exists).

This is currently deployed at https://tasko.dev.renku.ch/.

Lastly one thing I did not touch and I think we agreed on tackling this in a separate PR is what is returned when repeated POST or GET requests are made to the endpoint that creates the servers (and the sever tied to that commit exists). Currently the server that was created is returned but if one changes the image in the POST request the response is the original server that was created even though the original server has a different image than in the POST request. I propose that we tackle this in a separate PR.

One last thing that I wanted to mention is that I added tests for the logic of requesting a non-existent image, the case where you should fall back to the default or the case where a specific image is requested. These are only unit tests but they still test that the expected image name is found in the request to create the server on the backend.

@olevski
Copy link
Member Author

olevski commented Nov 16, 2020

If someone is testing right now, I have to deploy an older version to test something else at https://tasko.dev.renku.ch/. Will post back when the version tied to this PR is back on.

@olevski
Copy link
Member Author

olevski commented Nov 16, 2020

ok the version that matches this PR is back on https://tasko.dev.renku.ch/

@lorenzo-cavazzi
Copy link
Member

I am trying to work with this PR while developing the UI counterpart. All seems to work fine for a while, then I start getting 404 errors when trying to create new environments.

The response to POST /servers is a 404 with a text like this:

Cannot find project New-project---all-good for user: lorenzo.cavazzi.tech.

The 2 relevant entries logged in the notebook pod are

[2020-11-17 15:52:30,485] DEBUG in notebooks: Request to create server: New-project---all-good-5b3c3d1d with options: {'namespace': 'lorenzo.cavazzi.tech', 'project': 'New-project---all-good', 'commit_sha': 'eb86b2c9296a062354606c4c0a124db62086e666', 'branch': 'master', 'serverOptions': {'cpu_request': 0.1, 'defaultUrl': '/lab', 'gpu_request': 0, 'lfs_auto_fetch': False, 'mem_request': '1G'}} for user: {'kind': 'user', 'name': 'lorenzo.cavazzi.tech', 'admin': False, 'groups': [], 'server': None, 'pending': None, 'created': '2020-10-22T17:07:55.600737Z', 'last_activity': '2020-11-17T15:51:40.525623Z', 'servers': None}
[2020-11-17 15:52:30,767] ERROR in gitlab_: Cannot get project: lorenzo.cavazzi.tech/New-project---all-good for user: {'kind': 'user', 'name': 'lorenzo.cavazzi.tech', 'admin': False, 'groups': [], 'server': None, 'pending': None, 'created': '2020-10-22T17:07:55.600737Z', 'last_activity': '2020-11-17T15:51:40.525623Z', 'servers': None}, error: 401: invalid_token

If I try to log out and log in again, everything works fine.

[2020-11-17 15:59:01,800] DEBUG in notebooks: Request to create server: New-project---all-good-5b3c3d1d with options: {'namespace': 'lorenzo.cavazzi.tech', 'project': 'New-project---all-good', 'commit_sha': 'eb86b2c9296a062354606c4c0a124db62086e666', 'branch': 'master', 'serverOptions': {'cpu_request': 0.1, 'defaultUrl': '/lab', 'gpu_request': 0, 'lfs_auto_fetch': False, 'mem_request': '1G'}} for user: {'kind': 'user', 'name': 'lorenzo.cavazzi.tech', 'admin': False, 'groups': [], 'server': None, 'pending': None, 'created': '2020-10-22T17:07:55.600737Z', 'last_activity': '2020-11-17T15:58:17.705863Z', 'servers': None}
[2020-11-17 15:59:03,052] DEBUG in notebooks: Creating server New-project---all-good-5b3c3d1d with {'namespace': 'lorenzo.cavazzi.tech', 'project': 'New-project---all-good', 'branch': 'master', 'commit_sha': 'eb86b2c9296a062354606c4c0a124db62086e666', 'project_id': 5177, 'notebook': None, 'image': 'registry.dev.renku.ch/lorenzo.cavazzi.tech/new-project---all-good:eb86b2c', 'git_clone_image': 'lorenzocavazzitech/git-clone:0.8.3-334f16b', 'server_options': {'cpu_request': 0.1, 'defaultUrl': '/lab', 'gpu_request': 0, 'lfs_auto_fetch': False, 'mem_request': '1G'}}
[2020-11-17 15:59:04,304] DEBUG in notebooks: spawn initialized for New-project---all-good-5b3c3d1d

It seems that renku-notebooks considers the credentials expired but they aren't -- I can browse private repositories in the UI without problems. Maybe the error is misleading and the issue is with the JypiterHub credentials that we use to create a server but we don't use to get the servers list through GET /servers.
It's a bit hard to replicate. I think it will happen if you interact with the environments from the UI as a logged user, then you close the interface and you open it again later (after a few hours?).
Not sure if this helps, but I think that in the past I was getting a 404 on GET /servers when JypiterHub credentials were expired, triggering a re-login in the UI. Is it possible that this behavior has changed? It would explain the problem, but solving it would be hard.

P.S. I've used a private project, but it's the same with public projects.

@olevski
Copy link
Member Author

olevski commented Nov 17, 2020

@lorenzo-cavazzi I cannot replicate the behaviour you describe exactly. I think I did get something similar though because for a while even though I was logged in as tasko.olevski renku thought I was logged in as my other account olevski90 and I would not be able to create environments or even see the active ones because somehow the requests sent were for olevski90 and not tasko.olevski. This happened even though my profile said I was logged in as tasko.olevski. After I logged out and logged back in things worked normally though. I have not been able to get anything weird ever since.

This is also really weird because I did not touch the authentication code at all in this PR.

@ableuler
Copy link
Contributor

Jupyterhub gets its gitlab oauth token "directly" from gitlab (without the gateway being involved). This can lead to weird login situations where the gateway has a gitlab oauth token for user A and Jupyterhub has one for user B. Especially in our dev setup where different RenkuLab instances rely on the same gitlab AND we often juggle with multiple users, this is likely to happen. I am pretty confident that this is unrelated to the code changes proposed in this PR.

Copy link
Contributor

@ableuler ableuler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're there 🎉 .
Only 3 one-liner suggestions.

renku_notebooks/api/notebooks.py Outdated Show resolved Hide resolved
renku_notebooks/api/notebooks.py Outdated Show resolved Hide resolved
renku_notebooks/api/notebooks.py Show resolved Hide resolved
Copy link
Contributor

@ableuler ableuler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ableuler ableuler requested a review from rokroskar November 23, 2020 20:09
@olevski olevski merged commit 95d4f92 into master Nov 24, 2020
@olevski olevski deleted the specify-image-in-server-request branch November 24, 2020 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

allow default image to be specified in server start request
4 participants