Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test latest (1.22) and oldest supported and testable (1.18) k8s version #1392

Merged

Conversation

consideRatio
Copy link
Member

@consideRatio consideRatio commented Sep 23, 2021

Instead of testing against a fixed k8s version, currently against k8s 1.19, we can test against the latest k3s version - which now is k8s 1.22.

This PR was updated to also test against the oldest supported version, k8s 1.17, but that failed due to k3s 1.17 having a bug that made the nodePorts we configure and use in our tests fail. Due to that, I've settled for a compromise were we test k8s 1.18 instead. Running the tests against all versions in between worked well as I also had assumed they would, but it was verified explicitly by me.

@consideRatio consideRatio changed the title Test against k8s 1.22 Test against latest k3s (k8s 1.21 currently) Sep 23, 2021
@consideRatio consideRatio changed the title Test against latest k3s (k8s 1.21 currently) Test against latest k3s (k8s 1.22 currently) Sep 24, 2021
@consideRatio consideRatio requested a review from minrk September 24, 2021 12:09
@yuvipanda
Copy link
Collaborator

What do you think of running tests against the version of k8s mybinder.org is running?

@consideRatio
Copy link
Member Author

Ah, mybinder.org is a federation and not all memers have the same k8s version - so test specificallt for prod on gke then? In my experience, the k8s version is most important for testing the newest version that can deprecate various API etc.

  • I like not needing to maintain this test version manually
  • I like to test against the latest available k3s version for two reasons: a) if a bug is reported, i can know it should work with the very new k8s versions and i dont need to consider those as explanations, and b) most issues come from using a too new version rather a too old.

@manics
Copy link
Member

manics commented Sep 25, 2021

Since this is critical production code how about if we test two K8s versions, the oldest version in the mybinder federation and latest? Problems due to a new bit of code not working with an older version of K8s might be rare, but when it does happen it requires a lot of manual investigation whereas CI should immediately identify the issue.

@yuvipanda
Copy link
Collaborator

I like @manics's suggestion!

@consideRatio
Copy link
Member Author

Okay lets add another test on k8s 1.17, lowest supported by z2jh. I suggest we don't add 2x the number of tests but instead run only the helm chart test twice. Agree?

@betatim
Copy link
Member

betatim commented Sep 27, 2021

What is the problem with running the full set of tests?

@consideRatio
Copy link
Member Author

@betatim

  • iteration time for PRs - push two commits and you have 2x all the jobs already running, they are likely to start queing up
  • diagnostic noise - if you run the same tests without changing anything or relevance, it just adds noise
  • ci systems intermittent failures are amplified, and binderhub has some intermittency issues unresolved for years
  • merge -> publish delay: if we run tests on merge as well, we may very well end up queuing the publish job as well.

I am pushing back because I've felt that the CI systems slowness has held me back and I've put in a lot of work to make it faster. Duplicating the amount of jobs we run is quite far from being worth it in my evaluation of the pro/cons.


Hmmm... Looking back at this PR. I tried to bump the version so we test against a more modern version. If it is a controversial topic, which it is to me, I'd like a separate PR argue for the value of doubling the amount of jobs, so this PR can be as simple as "bump from outdated k8s version to the latest in our tests?".

@betatim
Copy link
Member

betatim commented Sep 28, 2021

Exclusively testing against a more modern version makes it harder to stay compatible with older versions. Making changes that require a bleeding edge version of k8s sounds like a fast track to alienating users who rely on something like GKE. At least in my experience a service like GKE is a bit behind the latest release of kubernetes. So I would define "outdated version" based on how many people use it. So I would target a version similar to what you get by default in GKE or similar offerings from AWS, Azure, etc. And then use that targeted version for our tests. We currently don't promise how many releases of kubernetes we officially support, so we don't need to go super old.

In my experience developers will test and try and use bleeding edge features all by themselves. Because it allows you to do new things and it is exciting. The task of checking for backwards compatibility is tedious and often gets forgotten. Hence if we can only have one version I'd use a "old" one. If we can have two I'd use an "old" one and a bleeding edge one to get early warning of deprecations and so on.

For me the value of the CI jobs is that they test things that I think are irrelevant to my change. When making changes I run the tests that I think are relevant locally, the ones I think are irrelevant (like older versions of dependencies or Python) I rarely test because I think "shouldn't change anything". Once in a while the CI will tell me that I was wrong. From that point of view I like it when the CI does the work.

Some CI system have a setting that will cancel builds for "outdated commits" in a PR. So when you push a new commit to a PR tests on the now "old" commit will stop, making the queue for jobs on relevant commits shorter. Can we do something like this for the current CI system?

Which PRs are limited by CI run time? My impression is that reviewing PRs takes a lot longer than the time for the CI to run. And even if the CI hasn't completed yet, you can start reviewing a PR.

I don't understand the argument around tests on merge to the main branch. Who cares if the package appears on PyPI in 10min, 20min or 2h? If we can make it take 15s that is a different topic, but if it is several minutes or an hour -> I will go and do something else to come back to it later.

@yuvipanda
Copy link
Collaborator

I mostly wanted to chime in and say I've really appreciated all the work that @consideRatio has put into our CI systems - making them fast, accurate and very helpful. Fast CI is extremely helpful in not breaking you 'out of flow state', and I appreciate that too. So often I will 'wait for CI to complete, it will just be 30 minutes' but then the 'break out of the flow state' is enough to basically stop me from working on it for the day.

As this PR stands, if I understand it correctly, it'll start testing changes against k8s 1.17 - the lowest version supported by z2jh. This changes status quo, where it currently tests against 1.19 - a more recent version, and perhaps also the version that the GKE cluster on mybinder.org runs?

From @betatim's message, I love this one sentence:

For me the value of the CI jobs is that they test things that I think are irrelevant to my change

Being able to merge to this repo and confidently deploy is a testament to our CI process! Keeping it like that would be great. Hence my suggestion of testing just on the 'current default GKE', as it is most likely to catch issues most likely to affect most of our users.

Kubernetes is also fairly aggressive about deprecation - 1.17 has been end of life for a while, and 1.19 will reach EOL a few months from now (https://kubernetes.io/releases/). As such, I don't think we should target 'lowest version supported by z2jh' - the k8s community doesn't recommend anyone run 1.17, and we shouldn't either :D

So maybe we can just run the tests on push on 'oldest k8s version that is not EOL'? Perhaps that strikes a decent balance between the backwards compatibility situation that @betatim is advocating for and developer velocity.

@yuvipanda
Copy link
Collaborator

And for velocity - @consideRatio do you know if adding more runners will help us move faster here? Do the tests stay in queue?

@consideRatio
Copy link
Member Author

consideRatio commented Sep 30, 2021

Thank you for investing time to consider this @yuvipanda and @betatim!

I realize I'm quite strongly opined that if we have either one or more tested k8s versions, I'd like one k8s version tested should be the latest k8s version we can test against. Like that, we can catch outdated k8s api-versions (#1396) via helm template --validate for example. Over some years time working with Helm charts, I can't recollect a single test failure related to a specific k8s version that wasn't also failing for either newer or older versions unless it was a upstream bug that was fixed in ~1 day. Due to this, I value tests against the latest version most, the lowest known supported version second most, and intermediary versions the least.

I'm open to the idea of testing against the lowest known supported k8s version as well, I really find it to serve a purpose. It can catch situations when we bump to a k8s api-version from for example some v1beta1 to v1 without retaining backward compatibility. I'm very open to having a test to catch these issues, but I'm pushing back against having three tests rather than one.

Because as @yuvipanda captured well, that one can get "out of flow", I push back on the idea of doubling all tests against k8s and only feel comfortable extending the scope of this PR to include a single test where I think it can make sense - for the helm chart test. I'm open to the following ideas:

  1. To merge either
    1. Original PR - 3 jobs
      We merge this PR as originally made, where tests against v1.19 is replaced with the latest available k8s version we can test against, currently k8s 1.22 via k3s.
    2. Semi extended PR - 3+1 jobs
      We merge this PR as adjusted, where we test main, auth, and helm against latest k3s (k8s 1.22) and helm CLI, and add the oldest known supported k8s version and helm CLI version.
  2. To followup a discussion on having 3*2 jobs
    I'd like to have this discussion separate. Since I have objections to doing this and since its somewhat unrelated to bumping the k8s version of the one test we have, I'd like for it to not block this PR.
    My assumption is that there is no point in testing a k8s version again if its been tested in the helm test. If someone champions having 3*2 tests, I'd be willing to invest time to verify if my assumption seems to hold up properly and give that as feedback to a PR.

As this PR stands, if I understand it correctly, it'll start testing changes against k8s 1.17 - the lowest version supported by z2jh.

As this PR stands it will now test against k8s 1.22 (set dynamically via latest k3s) instead of k8s 1.19 (fixed), but it will also run one specific test against k8s 1.17 - the one where the entire BinderHub Helm chart is installed (aka the helm test).

@consideRatio consideRatio changed the title Test against latest k3s (k8s 1.22 currently) Test against latest k3s (k8s 1.22 currently), and one test for k8s 1.17 Sep 30, 2021
@consideRatio
Copy link
Member Author

As a concrete example of the "out of flow" things that could increase by having 3*2 tests instead of 3+1 tests where we only have a single k8s 1.17 test running for the helm test, but not for the main and auth test.

I just updated #1381 to only add some pre-commit related config. I then switched my focus to something else. After some minutes I get an intermittent error from the main test, it shows up as a github notification I need to investigate. Anyhow, I then conclude "ah it's just an unrelated intermittent issue". Having done this, I want to press re-run test again, so anyone considering a merge will see the tests have succeeded and not ended up investigating tests themselves.

image

I opened #1399 to capture the details from that failure so it can be made more reliable somehow.

@betatim
Copy link
Member

betatim commented Oct 1, 2021

So maybe we can just run the tests on push on 'oldest k8s version that is not EOL'?

Sounds like a good balance. Like I said, if we only test one version then I'd use a version that is (weighted by users of deployments) somewhere in the middle. Not super old (few hubs are still running on old k8s) but also not super new (few hubs run on latest k8s). Right now GKE mybinder.org is on 1.19, the default version GKE suggests to me if I create a new cluster today is 1.20. So something like that sounds like a good place to be.

@consideRatio
Copy link
Member Author

@betatim it seems to me that we are both very opinionated about this. Can you clarify if you are open to one of the ideas I outlined in #1392, if not, can you be quite specific what you suggest?

@betatim
Copy link
Member

betatim commented Oct 1, 2021

Unfortunately I can't tell what all the possible proposals are as there are too many with too many variations. This PR right now proposes to always use the latest version of k8s and run one additional test with k8s 1.17.

I think you oppose running multiple sets of tests because they take too long to complete. That is fine with me, though I'd be happy to have more tests to test bonus things.

We disagree on which version to use for this one set of tests. I propose using 1.19 or 1.20 because it means we test against a version a lot of people are using in production.

Two bonus things to test in addition to the main version could be:

  1. the oldest version of k8s we promise to be compatible with
  2. the latest version off k8s to get a glimpse of upcoming features and deprecations

@MridulS
Copy link
Contributor

MridulS commented Oct 1, 2021

Just to chime in a little bit: Does it make sense to run the extra tests for different k8s version (lowest supported to latest k8s release) in batch (every 5/10 PRs merged) /CRON mode (weekly/monthly) ? This way the individual flow state wouldn't be disrupted as testing of PRs will be quick and we could still catch issues with old/latest k8s version albeit a bit delayed. I guess these breakages would be rare so the CI will only be sending "fail" notifications rarely.

@yuvipanda
Copy link
Collaborator

I like @MridulS' suggestion!

@consideRatio
Copy link
Member Author

  • I think this PR, as it is, is perfectly fine, with a single k8s job 1.17.
  • I think this PR, as it were, was perfectly fine, without any k8s job 1.17 addition.
  • I think it's a reasonable option to have have a CI cronjob test for older versions as well

The current status of this PR in my mind is that @betatim made the following suggestion which I disagree with.

I propose using 1.19 or 1.20 because it means we test against a version a lot of people are using in production.

This is equivalent of not making a change at all (we currently test against k8s 1.19). I disagree with it because it wouldn't catch several relevant errors as new versions of k8s comes, and we have run into issues like that in binderhub before when for example the OVH federation bumped the k8s version if I recall correctly.


Practically, this PR does not strictly block me, but it would been great to have when developing the binderhub equivalent of jupyterhub/zero-to-jupyterhub-k8s#2403 which would close #1396.

@consideRatio
Copy link
Member Author

Thanks @betatim for chatting with me about things and this PR!

@betatim
Copy link
Member

betatim commented Oct 5, 2021

Thanks for calling and talking Erik!

Let's use the latest version for tests and have one helm test against v1.17 (or maybe a newer version, to be decided in the future). As well as adding to our tests so that this PR's tests would have failed because of #1396

@consideRatio consideRatio force-pushed the pr/test-against-k8s-1.22 branch from 9130051 to f560e55 Compare October 5, 2021 11:49
@consideRatio
Copy link
Member Author

consideRatio commented Oct 5, 2021

I fixed a bug in the implementation of this PR, where we never passed the variable k3s-channel to the action to setup a k8s cluster.

By doing so, the latest version test now correctly fail as expected. Fixing the failure is represented by #1396 which I'm happy to address.

@consideRatio consideRatio marked this pull request as draft October 5, 2021 12:17
@consideRatio consideRatio force-pushed the pr/test-against-k8s-1.22 branch from b9ff306 to 7755683 Compare October 5, 2021 12:22
@consideRatio
Copy link
Member Author

consideRatio commented Oct 5, 2021

There was a failure for k8s 1.17 to debug.

My conclusion from a lot of debugging is that k3s 1.17 has an issue that somehow makes us unable to send HTTP requests via our nodePorts, and we rely on that to make our tests. I believe that may be related to the unresolved issue k3s-io/k3s#763.

I don't believe this is indicative that the Helm chart won't work in k8s 1.17, just that it won't work to test against in k3s 1.17, a specific implementation of k8s no longer maintained.

My suggestion is that we instead test against the oldest k8s version that can be tested against without running into a k3s issue, which is 1.18.

image

@consideRatio consideRatio force-pushed the pr/test-against-k8s-1.22 branch from 0c2270a to 5235116 Compare October 5, 2021 13:29
@consideRatio consideRatio requested a review from betatim October 5, 2021 13:29
@consideRatio consideRatio marked this pull request as ready for review October 5, 2021 13:29
@consideRatio consideRatio changed the title Test against latest k3s (k8s 1.22 currently), and one test for k8s 1.17 Test against latest k3s (k8s 1.22 currently), and one test for k8s 1.18 Oct 5, 2021
@consideRatio consideRatio force-pushed the pr/test-against-k8s-1.22 branch from 5235116 to b6f2cac Compare October 5, 2021 13:31
@consideRatio consideRatio changed the title Test against latest k3s (k8s 1.22 currently), and one test for k8s 1.18 Test latest (1.22) and oldest supported and testable (1.18) k8s version Oct 5, 2021
Copy link
Member

@sgibson91 sgibson91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you everybody for the discussion in this PR. I appreciate everyone taking time to reach a healthy compromise that both retains rigorous, useful testing infrastructure while minimising developer "flow" interruptions - which is a very tricky problem!

@choldgraf
Copy link
Member

Hey all - is the decision made in this PR something broader than just this PR? Can we encode this discussion and where we've reached compromise so that future PRs like this can refer to the precedent without requiring lots of discussion?

E.g. should some of the extra explanation in this bit:

          # Chart.yaml contains the chart's oldest supported k8s version, we
          # want to test against that. We also test against the oldest known
          # supported helm cli version which isn't documented anywhere, but is
          # currently at 3.5.0. Currently we have listed 1.17 in Chart.yaml, but
          # due to a k3s issue with their k8s 1.17 release, we can't test
          # against k8s 1.17 and settle for k8s 1.18 for now.

could be encoded in the team compass somewhere? (feel free to tell me that the best place for it is as comments in the worfklow file as we have already)

@consideRatio
Copy link
Member Author

Thank you for considering the long term angle of things @choldgraf!

could be encoded in the team compass somewhere? (feel free to tell me that the best place for it is as comments in the worfklow file as we have already)

I've seen a tendency to have "min/max" tests, for example @minrk has made several repo's start testing the oldest supported python versions and library versions explicitly besides the latest versions.

I'd be happy to approve/merge a guideline PR about aiming to do min/max tests in our repos for example. I don't think this PR would been easier to get merged with/without a guideline though, as I see it as the complexity of arriving at an agreement on this PR was caused by me pushing back on details that can't be captured by a guideline - they were too specific.

Copy link
Member

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I restarted the failed test, assuming it is a flake.

@consideRatio
Copy link
Member Author

Thanks @betatim!!

Looks good to me. I restarted the failed test, assuming it is a flake.

It isn't a flake, it it just a successful catch of an issue with k8s 1.22. We previously tested against 1.21 even though it said "v1.19" because there was a bug in the workflow. Now that I fixed the bug, the test correctly fails and runs against k8s 1.22 rather than 1.21 due to the bug.

I fixed a bug in the implementation of this PR, where we never passed the variable k3s-channel to the action to setup a k8s cluster.

By doing so, the latest version test now correctly fail as expected. Fixing the failure is represented by #1396 which I'm happy to address.

I've already opened the fix, its in #1407

@consideRatio
Copy link
Member Author

I'll go for a merge now and rebase the fix in #1407 to verify it makes our test suite not have the error on k8s 1.22 any more.

@consideRatio consideRatio merged commit ee78c16 into jupyterhub:master Oct 7, 2021
@choldgraf
Copy link
Member

Thanks for clarifying for me @consideRatio - it feels to me like it'd be easiest to just get this PR in and not worry about a general practice to encode, perhaps these issues are just inherently complex!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants