Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing integrations against unsupported stacks #3208

Closed
mtojek opened this issue Apr 27, 2022 · 22 comments
Closed

Testing integrations against unsupported stacks #3208

mtojek opened this issue Apr 27, 2022 · 22 comments
Assignees
Labels
Stalled Team:Ecosystem Label for the Packages Ecosystem team [elastic/ecosystem]

Comments

@mtojek
Copy link
Contributor

mtojek commented Apr 27, 2022

Hi,

We've started observing issues in stacks that we don't actively support. For example, in 7.14 we can see this flaky issue and we know that it won't be fixed. It impacts CI jobs by randomly failing them.

As long as these stacks are unsupported, should we stop testing against anything < 7.17.x? Considering that this is the latest 7.x and it will still receive bugfixes.

What should we do in such cases? Do you have any preference?

@mtojek
Copy link
Contributor Author

mtojek commented Apr 27, 2022

cc @jsoriano @jlind23

@jlind23 jlind23 added the Team:Ecosystem Label for the Packages Ecosystem team [elastic/ecosystem] label Apr 27, 2022
@jlind23
Copy link
Contributor

jlind23 commented Apr 27, 2022

For the 7.X we should now test only the 7.17.X which are the last supported release of the 7.X series.

For 8.X it is a tricky question, as we tend to work on painless upgrades I am for testing only the two latest 8.X releases:

  • The one that is currently available
  • The one that will be shipped soon

For example, now we will have to test 8.2 and 8.3, nothing more.

@jsoriano
Copy link
Member

I am ok with testing at some point only with 7.17, but take into account that for example the line 47 error also happens in 7.17 at the moment. I would wait to find an issue specific of older versions that complicates maintenance.

If we stop supporting versions older than 7.17, we should also decide what to do with integrations that support these versions, packages could unconsciously introduce breaking changes. We have packages targeting all 7.x versions from 7.14 to 7.17. Same thing will probably happen with 8.x.

For flaky issues maybe we can introduce some retry mechanism in CI. I don't like retries so much because they may end up increasing general flakiness, but we could use this approach when supporting less-reliable versions.

@jlind23
Copy link
Contributor

jlind23 commented Apr 28, 2022

Even if we have packages targeting older 7.X versions, no more fixes will be provided to those. Then integration developers should rather update the constraints isn't?

@mtojek
Copy link
Contributor Author

mtojek commented Apr 28, 2022

I am ok with testing at some point only with 7.17, but take into account that for example the line 47 error also happens in 7.17 at the moment. I would wait to find an issue specific of older versions that complicates maintenance.

elastic/beats#31449

If we stop supporting versions older than 7.17, we should also decide what to do with integrations that support these versions, packages could unconsciously introduce breaking changes. We have packages targeting all 7.x versions from 7.14 to 7.17. Same thing will probably happen with 8.x.

Yes, unfortunately, the same question for 8.1.

For flaky issues maybe we can introduce some retry mechanism in CI. I don't like retries so much because they may end up increasing general flakiness, but we could use this approach when supporting less-reliable versions.

I wouldn't use that card until we have no choice. This will lead to the worse quality of our solution.

Even if we have packages targeting older 7.X versions, no more fixes will be provided to those. Then integration developers should rather update the constraints isn't?

It's a product based decision to be honest. But you're right, if you don't support part of the ingestion pipeline (Beats modules, Agent), why should we still support them with integrations. It doesn't make sense :)

@mtojek
Copy link
Contributor Author

mtojek commented Apr 28, 2022

Let's focus on actionable items. I see these options.

  1. Bump up all Kibana constraints using unsupported stacks < 7.17 to 7.17. and <8.2 to 8.2.
  2. Force automation to use the last support stack for testing, but don't change constraints - at least 7.17., at least 8.2.
  3. Retry on flaky tests 3 times.

The easiest option to implement would be 1. Voting?

@jsoriano
Copy link
Member

elastic/beats#31449

Great :)

Even if we have packages targeting older 7.X versions, no more fixes will be provided to those. Then integration developers should rather update the constraints isn't?

It's a product based decision to be honest. But you're right, if you don't support part of the ingestion pipeline (Beats modules, Agent), why should we still support them with integrations. It doesn't make sense :)

Even if no fixes are going to be provided, some users are going to continue using unmaintained versions. This may not be such a problem with 7.x versions if there are not so many users of fleet in <7.17 versions, but this will become more and more a problem in newer versions with more users.

Adding a process to require integration developers update the constraints when a version of the stack is discontinued worsens developer experience, as it requires communication and efforts at unexpected moments. Also, relying on this may become unrealistic as we open package development to more teams and we have more and more packages.

It's a product based decision to be honest.

Yes, at the end I think that these should be decisions of the developer teams: what versions to support, what versions to test with. And we provide the tooling to do that.

Apart of flakiness problems, testing with old versions shouldn't be a problem, builds should be reproducible. If a package works with a pinned version, then this is going to continue working with this version. If some change breaks it, then developers may decide to discontinue these older versions at this point.

@jsoriano
Copy link
Member

The easiest option to implement would be 1. Voting?

I see an option 4: doing nothing 🙂 and wait a bit more to see if we have so many problems with flakiness.

We could have a 5th option, to require an extra confirmation before publishing a package that supports an unmaintained version of the stack. This could eventually lead to packages being more aligned, without having strong constraints.

Option 1 is effectively easy, but requires bulk changes, something that we have tried to avoid, and that cannot be sustained if we open development to more teams or to community. Using 8.2 may lead soon to require backports for 8.1, a version users are likely going to continue using.

Option 2 may introduce unexpected issues: packages are unexpectedly tested with new versions, this may block developers if the new version has some unexpected regression. If constraints are not updated, then packages are not tested with older versions that are in theory supported. We are talking about using 8.2, but many users are going to be for some time in 8.0/8.1.

Option 3 could be a short term workaround for flakiness, but I agree that long term this can lead to more instability.

@mtojek
Copy link
Contributor Author

mtojek commented Apr 28, 2022

I see an option 4: doing nothing 🙂 and wait a bit more to see if we have so many problems with flakiness.

That's what's going on in Beats, I'm upset with that approach. Integration tests should be trustful as much as possible.

We could have a 5th option, to require an extra confirmation before publishing a package that supports an unmaintained version of the stack. This could eventually lead to packages being more aligned, without having strong constraints.

In the future, we may face bugs in different stack versions, what leads to more confirmations. Not sure if we want to end up with such user experience. I know about 2 bugs we can't easily fix (elastic/elastic-agent#98 and elastic/elastic-agent#144) and they are both hidden in either agent or fleet.

Option 6 (too many options now):

"Teach" elastic-package how to operate against these bugs. It might be hard to implement and maintain spaghetti in the future, but always an option.

Option 2 may introduce unexpected issues: packages are unexpectedly tested with new versions, this may block developers if the new version has some unexpected regression. If constraints are not updated, then packages are not tested with older versions that are in theory supported. We are talking about using 8.2, but many users are going to be for some time in 8.0/8.1.

I think that this is the option I would pick out of all of them. Look from customer's perspective. If somebody uses 7.16 or 8.0 and faces that problem, our official Elastic recommendation would be to upgrade to the latest supported version as we don't fully support the stack.

@jlind23
Copy link
Contributor

jlind23 commented Apr 28, 2022

I would definitely vote for 1 !

@mtojek
Copy link
Contributor Author

mtojek commented May 4, 2022

It looks like we don't have an agreement on this issue yet. Let's look for another alternative solution:

Maybe we can detect these flaky situations I mentioned in the issue above by analyzing Elastic stack logs and elastic-package's output, and "force" Jenkins to skip the test as it's flaky?

Otherwise, I'm for going with a simpler option - not testing against unsupported stack (option 2.), or with option 1.

@jlind23
Copy link
Contributor

jlind23 commented May 10, 2022

@mtojek lets start by not testing unsupported stacks versions.

@mtojek mtojek self-assigned this May 10, 2022
@mtojek
Copy link
Contributor Author

mtojek commented May 10, 2022

Status update: we had an internal chat about it. I sent an email to collect more feedback from package developers.

@mtojek
Copy link
Contributor Author

mtojek commented May 23, 2022

Status update. We agreed on the following action items.

  1. Configure PR jobs:
  • Test integrations using the latest supported version (8.x).
  • If the integration is deprecated (the latest snapshot is not included in the Kibana version constraint range), don’t run system tests. We don’t need any flaky tests here too.
  1. Configure daily scheduled jobs:
  • Test all relevant packages against the latest stack version (8.x)
  • Test all relevant packages against the LTS version (7.17)
  1. Configure weekly scheduled jobs:
  • Test all packages against all supported stacks - matrix (flaky). It would be an improvement compared to what we have here.

@jlind23
Copy link
Contributor

jlind23 commented May 23, 2022

@mtojek stupid question, how are we going to be notified in case of failure? I am afraid that weekly failure may be lost in translation.

@mtojek
Copy link
Contributor Author

mtojek commented May 23, 2022

There are options possible:

  1. Stack notifications similarly to E2E tests, but to be honest, they might be too noisy.
  2. Auto-open Github issues in SDH-beats (or any other repo). The person on rotation can assign those to appropriate teams. Treating them as SDHs may give those some priority.

I would be for option no. 2.

@ruflin
Copy link
Contributor

ruflin commented May 23, 2022

I like the idea of automatically creating github issues but lets not mix them with sdh repos.

@jsoriano
Copy link
Member

With current plan we won't see if a package is breaking backwards compatibility till too late.

Example: package with kibana.constraint: ^8.1.0 includes something that only works starting with 8.3.0, and developers forget to update the constraint. This is not detected in PRs, this package can be released right after the change is merged. Users of 8.1 and 8.2 can upgrade, and face problems.
The longer tail of users using old versions we have, the riskier this is.

Are we ok with this risk?

@jsoriano
Copy link
Member

Btw, going back to the idea of retries. We don't like retries, sure, but does the flakiness we are trying to address here happen when running the tests, or when starting the stack? If it happens when starting the stack, we wouldn't need to retry the tests, only the stack setup, and we could be quite generous on the retries there as this is not actually testing the code in this repo.

@jlind23
Copy link
Contributor

jlind23 commented May 23, 2022

@mtojek i do like the option 2 but as stated by @ruflin lets rather use another repo and not the sdh one.
Integration repo sounds like a good place.

@mtojek
Copy link
Contributor Author

mtojek commented May 24, 2022

Btw, going back to the idea of retries. We don't like retries, sure, but does the flakiness we are trying to address here happen when running the tests, or when starting the stack?

Both cases.

If it happens when starting the stack, we wouldn't need to retry the tests, only the stack setup, and we could be quite generous on the retries there as this is not actually testing the code in this repo.

This might be risky if we start skipping/retrying on startup errors. Package developers won't notify us that they have seen recovered stack issues. It might be hard to fight against those if they accumulate over time and the stack will become unstable.

@botelastic
Copy link

botelastic bot commented May 24, 2023

Hi! We just realized that we haven't looked into this issue in a while. We're sorry! We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

@botelastic botelastic bot added the Stalled label May 24, 2023
@botelastic botelastic bot closed this as completed Nov 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stalled Team:Ecosystem Label for the Packages Ecosystem team [elastic/ecosystem]
Projects
None yet
Development

No branches or pull requests

4 participants