Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solutions] Adds bsearch service to FTR e2e tests to reduce flake, boilerplate, and technique choices #116211

Merged
merged 9 commits into from
Oct 27, 2021

Conversation

FrankHassanabad
Copy link
Contributor

@FrankHassanabad FrankHassanabad commented Oct 25, 2021

Summary

Fixes flake tests of:
#115918
#103273
#108640
#109447
#100630
#94535
#104260

Security solution has been using bsearch and has encountered flake in various forms. Different developers have been fixing the flake in a few odd ways (myself included) which aren't 100%. This PR introduces a once-in-for-all REST API retry service called bsearch which will query bsearch and if bsearch is not completed because of async occurring due to slower CI runtimes it will continuously call into the bsearch with the correct API to ensure it gets a complete response before returning.

Usage

Anyone can use this service like so:

const bsearch = getService('bsearch');
const response = await bsearch.send<MyType>({
 supertest,
 options: {
   defaultIndex: ['large_volume_dns_data'],
}
  strategy: 'securitySolutionSearchStrategy',
});

If you're using a custom auth then you can set that beforehand like so:

const bsearch = getService('bsearch');
const supertestWithoutAuth = getService('supertestWithoutAuth');
const supertest supertestWithoutAuth.auth(username, password);
const response = await bsearch.send<MyType>({
 supertest,
 options: {
   defaultIndex: ['large_volume_dns_data'],
  }
  strategy: 'securitySolutionSearchStrategy',
});

Misconceptions in the tests leading to flake

  • Can you just call the bsearch REST API and it will always return data first time? Not always true, as when CI slows down or data increases bsearch will give you back an async reference and then your test will blow up.
  • Can we wrap the REST API in retry to fix the flake? Not always but mostly true, as when CI slows down or data increases bsearch could return the async version continuously which could then fail your test. It's also tedious to tell everyone in code reviews to wrap everything in retry instead of just fixing it with a service as well as inform new people why we are constantly wrapping these tests in retry.
  • Can we manually parse the bsearch if it has async for each test? This is true but is error prone and I did this for one test and it's ugly and I had issues as I have to wrap 2 things in retry and test several conditions. Also it's harder for people to read the tests rather than just reading there is a service call. Also people in code reviews missed where I had bugs with it. Also lots of boiler plate.
  • Can we just increase the timeout with wait_for_completion_timeout and the tests will pass for sure then? Not true today but maybe true later, as this hasn't been added as plumbing yet. See this open ticket. Even if it is and we increase the timeout to a very large number bsearch might return with an async or you might want to test the async path. Either way, if/when we add the ability we can increase it within 1 spot which is this service for everyone rather than going to each individual test to add it. If/when it's added if people don't use the bsearch service we can remove it later if we find this is deterministic enough and no one wants to test bsearch features with their strategies down the road.

Manual test of bsearch service

If you want to manually watch the bsearch operate as if the CI system is running slow or to cause an async manually you manually modify this setting here:
https://github.com/elastic/kibana/blob/master/src/plugins/data/server/search/strategies/ese_search/request_utils.ts#L61

To be of a lower number such as 1ms and then you will see it enter the async code within bsearch consistently

Reference PRs

We cannot set the wait_for_complete just yet
#107241 so we decided this was the best way to reduce flake for testing for now.

Checklist

@FrankHassanabad FrankHassanabad requested a review from a team as a code owner October 25, 2021 20:05
@FrankHassanabad FrankHassanabad self-assigned this Oct 25, 2021
@FrankHassanabad FrankHassanabad added Team:Security Solution Platform Security Solution Platform Team release_note:skip Skip the PR/issue when compiling release notes v8.0.0 v7.16.0 labels Oct 25, 2021
@FrankHassanabad FrankHassanabad added the auto-backport Deprecated - use backport:version if exact versions are needed label Oct 25, 2021
@FrankHassanabad
Copy link
Contributor Author

@elasticmachine merge upstream

Copy link
Contributor

@dhurley14 dhurley14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment about extending functionality around the expects inside of the bsearch service. Other than that LGTM!

.post(`${spaceUrl}/internal/search/${strategy}`)
.set('kbn-xsrf', 'true')
.send(options)
.expect(200);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't found other services using expect inside of their functions. Not sure I see any issue with keeping it there but just wanted to see if there are other instances of expect used within FTR services.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be cool to provide a parameter where users of the bsearch service could specifiy what HTTP status code to expect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expect errors will trigger the retries which is why they're here. I will add the HTTP status for people to expect if they have the need for other than 200, but I think at the moment we aren't concerned about testing bsearch results as we are just trying to ensure the endpoints all work.

@dhurley14
Copy link
Contributor

Side note: Did we run these tests + fix through the flaky test suite?

@dhurley14
Copy link
Contributor

Also should we update our test config to include timeouts.try? Looks like the retry service utilizes that.

timeout: this.config.get('timeouts.try'),

@FrankHassanabad
Copy link
Contributor Author

Side note: Did we run these tests + fix through the flaky test suite?

No, I just looked across the PR's that were already open.

@FrankHassanabad
Copy link
Contributor Author

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @FrankHassanabad

@FrankHassanabad FrankHassanabad merged commit ae7b5a9 into elastic:master Oct 27, 2021
@kibanamachine
Copy link
Contributor

💔 Backport failed

Status Branch Result
7.16 Commit could not be cherrypicked due to conflicts

To backport manually run:
node scripts/backport --pr 116211

FrankHassanabad added a commit to FrankHassanabad/kibana that referenced this pull request Oct 27, 2021
…flake, boilerplate, and technique choices (elastic#116211)

## Summary

Fixes flake tests of:
elastic#115918
elastic#103273
elastic#108640
elastic#109447
elastic#100630
elastic#94535
elastic#104260

Security solution has been using `bsearch` and has encountered flake in various forms. Different developers have been fixing the flake in a few odd ways (myself included) which aren't 100%. This PR introduces a once-in-for-all REST API retry service called `bsearch` which will query `bsearch` and if `bsearch` is not completed because of async occurring due to slower CI runtimes it will continuously call into the `bsearch` with the correct API to ensure it gets a complete response before returning.

## Usage

Anyone can use this service like so:
```ts
const bsearch = getService('bsearch');
const response = await bsearch.send<MyType>({
 supertest,
 options: {
   defaultIndex: ['large_volume_dns_data'],
}
  strategy: 'securitySolutionSearchStrategy',
});
```

If you're using a custom auth then you can set that beforehand like so:
```ts
const bsearch = getService('bsearch');
const supertestWithoutAuth = getService('supertestWithoutAuth');
const supertest supertestWithoutAuth.auth(username, password);
const response = await bsearch.send<MyType>({
 supertest,
 options: {
   defaultIndex: ['large_volume_dns_data'],
  }
  strategy: 'securitySolutionSearchStrategy',
});
```

## Misconceptions in the tests leading to flake
* Can you just call the bsearch REST API and it will always return data first time? Not always true, as when CI slows down or data increases `bsearch` will give you back an async reference and then your test will blow up.
* Can we wrap the REST API in `retry` to fix the flake? Not always but mostly true, as when CI slows down or data increases `bsearch` could return the async version continuously which could then fail your test. It's also tedious to tell everyone in code reviews to wrap everything in `retry` instead of just fixing it with a service as well as inform new people why we are constantly wrapping these tests in `retry`.
* Can we manually parse the `bsearch` if it has `async` for each test? This is true but is error prone and I did this for one test and it's ugly and I had issues as I have to wrap 2 things in `retry` and test several conditions. Also it's harder for people to read the tests rather than just reading there is a service call. Also people in code reviews missed where I had bugs with it. Also lots of boiler plate.
* Can we just increase the timeout with `wait_for_completion_timeout` and the tests will pass for sure then? Not true today but maybe true later, as this hasn't been added as plumbing yet. See this [open ticket](elastic#107241). Even if it is and we increase the timeout to a very large number bsearch might return with an `async` or you might want to test the `async` path. Either way, if/when we add the ability we can increase it within 1 spot which is this service for everyone rather than going to each individual test to add it. If/when it's added if people don't use the bsearch service we can remove it later if we find this is deterministic enough and no one wants to test bsearch features with their strategies down the road.

## Manual test of bsearch service
If you want to manually watch the bsearch operate as if the CI system is running slow or to cause an `async` manually you manually modify this setting here:
https://github.com/elastic/kibana/blob/master/src/plugins/data/server/search/strategies/ese_search/request_utils.ts#L61

To be of a lower number such as `1ms` and then you will see it enter the `async` code within `bsearch` consistently

## Reference PRs
We cannot set the wait_for_complete just yet
elastic#107241 so we decided this was the best way to reduce flake for testing for now.

### Checklist

- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

# Conflicts:
#	x-pack/test/api_integration/apis/security_solution/hosts.ts
FrankHassanabad added a commit to FrankHassanabad/kibana that referenced this pull request Oct 27, 2021
…flake, boilerplate, and technique choices (elastic#116211)

## Summary

Fixes flake tests of:
elastic#115918
elastic#103273
elastic#108640
elastic#109447
elastic#100630
elastic#94535
elastic#104260

Security solution has been using `bsearch` and has encountered flake in various forms. Different developers have been fixing the flake in a few odd ways (myself included) which aren't 100%. This PR introduces a once-in-for-all REST API retry service called `bsearch` which will query `bsearch` and if `bsearch` is not completed because of async occurring due to slower CI runtimes it will continuously call into the `bsearch` with the correct API to ensure it gets a complete response before returning.


## Usage

Anyone can use this service like so:
```ts
const bsearch = getService('bsearch');
const response = await bsearch.send<MyType>({
 supertest,
 options: {
   defaultIndex: ['large_volume_dns_data'],
}
  strategy: 'securitySolutionSearchStrategy',
});
```

If you're using a custom auth then you can set that beforehand like so:
```ts
const bsearch = getService('bsearch');
const supertestWithoutAuth = getService('supertestWithoutAuth');
const supertest supertestWithoutAuth.auth(username, password);
const response = await bsearch.send<MyType>({
 supertest,
 options: {
   defaultIndex: ['large_volume_dns_data'],
  }
  strategy: 'securitySolutionSearchStrategy',
});
```

## Misconceptions in the tests leading to flake
* Can you just call the bsearch REST API and it will always return data first time? Not always true, as when CI slows down or data increases `bsearch` will give you back an async reference and then your test will blow up.
* Can we wrap the REST API in `retry` to fix the flake? Not always but mostly true, as when CI slows down or data increases `bsearch` could return the async version continuously which could then fail your test. It's also tedious to tell everyone in code reviews to wrap everything in `retry` instead of just fixing it with a service as well as inform new people why we are constantly wrapping these tests in `retry`.
* Can we manually parse the `bsearch` if it has `async` for each test? This is true but is error prone and I did this for one test and it's ugly and I had issues as I have to wrap 2 things in `retry` and test several conditions. Also it's harder for people to read the tests rather than just reading there is a service call. Also people in code reviews missed where I had bugs with it. Also lots of boiler plate.
* Can we just increase the timeout with `wait_for_completion_timeout` and the tests will pass for sure then? Not true today but maybe true later, as this hasn't been added as plumbing yet. See this [open ticket](elastic#107241). Even if it is and we increase the timeout to a very large number bsearch might return with an `async` or you might want to test the `async` path. Either way, if/when we add the ability we can increase it within 1 spot which is this service for everyone rather than going to each individual test to add it. If/when it's added if people don't use the bsearch service we can remove it later if we find this is deterministic enough and no one wants to test bsearch features with their strategies down the road.

## Manual test of bsearch service
If you want to manually watch the bsearch operate as if the CI system is running slow or to cause an `async` manually you manually modify this setting here:
https://github.com/elastic/kibana/blob/master/src/plugins/data/server/search/strategies/ese_search/request_utils.ts#L61

To be of a lower number such as `1ms` and then you will see it enter the `async` code within `bsearch` consistently

## Reference PRs
We cannot set the wait_for_complete just yet
elastic#107241 so we decided this was the best way to reduce flake for testing for now. 

### Checklist

- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
FrankHassanabad added a commit that referenced this pull request Oct 27, 2021
…flake, boilerplate, and technique choices (#116211) (#116500)

## Summary

Fixes flake tests of:
#115918
#103273
#108640
#109447
#100630
#94535
#104260

Security solution has been using `bsearch` and has encountered flake in various forms. Different developers have been fixing the flake in a few odd ways (myself included) which aren't 100%. This PR introduces a once-in-for-all REST API retry service called `bsearch` which will query `bsearch` and if `bsearch` is not completed because of async occurring due to slower CI runtimes it will continuously call into the `bsearch` with the correct API to ensure it gets a complete response before returning.

## Usage

Anyone can use this service like so:
```ts
const bsearch = getService('bsearch');
const response = await bsearch.send<MyType>({
 supertest,
 options: {
   defaultIndex: ['large_volume_dns_data'],
}
  strategy: 'securitySolutionSearchStrategy',
});
```

If you're using a custom auth then you can set that beforehand like so:
```ts
const bsearch = getService('bsearch');
const supertestWithoutAuth = getService('supertestWithoutAuth');
const supertest supertestWithoutAuth.auth(username, password);
const response = await bsearch.send<MyType>({
 supertest,
 options: {
   defaultIndex: ['large_volume_dns_data'],
  }
  strategy: 'securitySolutionSearchStrategy',
});
```

## Misconceptions in the tests leading to flake
* Can you just call the bsearch REST API and it will always return data first time? Not always true, as when CI slows down or data increases `bsearch` will give you back an async reference and then your test will blow up.
* Can we wrap the REST API in `retry` to fix the flake? Not always but mostly true, as when CI slows down or data increases `bsearch` could return the async version continuously which could then fail your test. It's also tedious to tell everyone in code reviews to wrap everything in `retry` instead of just fixing it with a service as well as inform new people why we are constantly wrapping these tests in `retry`.
* Can we manually parse the `bsearch` if it has `async` for each test? This is true but is error prone and I did this for one test and it's ugly and I had issues as I have to wrap 2 things in `retry` and test several conditions. Also it's harder for people to read the tests rather than just reading there is a service call. Also people in code reviews missed where I had bugs with it. Also lots of boiler plate.
* Can we just increase the timeout with `wait_for_completion_timeout` and the tests will pass for sure then? Not true today but maybe true later, as this hasn't been added as plumbing yet. See this [open ticket](#107241). Even if it is and we increase the timeout to a very large number bsearch might return with an `async` or you might want to test the `async` path. Either way, if/when we add the ability we can increase it within 1 spot which is this service for everyone rather than going to each individual test to add it. If/when it's added if people don't use the bsearch service we can remove it later if we find this is deterministic enough and no one wants to test bsearch features with their strategies down the road.

## Manual test of bsearch service
If you want to manually watch the bsearch operate as if the CI system is running slow or to cause an `async` manually you manually modify this setting here:
https://github.com/elastic/kibana/blob/master/src/plugins/data/server/search/strategies/ese_search/request_utils.ts#L61

To be of a lower number such as `1ms` and then you will see it enter the `async` code within `bsearch` consistently

## Reference PRs
We cannot set the wait_for_complete just yet
#107241 so we decided this was the best way to reduce flake for testing for now.

### Checklist

- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

# Conflicts:
#	x-pack/test/api_integration/apis/security_solution/hosts.ts
FrankHassanabad added a commit that referenced this pull request Oct 28, 2021
…flake, boilerplate, and technique choices (#116211) (#116514)

## Summary

Fixes flake tests of:
#115918
#103273
#108640
#109447
#100630
#94535
#104260

Security solution has been using `bsearch` and has encountered flake in various forms. Different developers have been fixing the flake in a few odd ways (myself included) which aren't 100%. This PR introduces a once-in-for-all REST API retry service called `bsearch` which will query `bsearch` and if `bsearch` is not completed because of async occurring due to slower CI runtimes it will continuously call into the `bsearch` with the correct API to ensure it gets a complete response before returning.


## Usage

Anyone can use this service like so:
```ts
const bsearch = getService('bsearch');
const response = await bsearch.send<MyType>({
 supertest,
 options: {
   defaultIndex: ['large_volume_dns_data'],
}
  strategy: 'securitySolutionSearchStrategy',
});
```

If you're using a custom auth then you can set that beforehand like so:
```ts
const bsearch = getService('bsearch');
const supertestWithoutAuth = getService('supertestWithoutAuth');
const supertest supertestWithoutAuth.auth(username, password);
const response = await bsearch.send<MyType>({
 supertest,
 options: {
   defaultIndex: ['large_volume_dns_data'],
  }
  strategy: 'securitySolutionSearchStrategy',
});
```

## Misconceptions in the tests leading to flake
* Can you just call the bsearch REST API and it will always return data first time? Not always true, as when CI slows down or data increases `bsearch` will give you back an async reference and then your test will blow up.
* Can we wrap the REST API in `retry` to fix the flake? Not always but mostly true, as when CI slows down or data increases `bsearch` could return the async version continuously which could then fail your test. It's also tedious to tell everyone in code reviews to wrap everything in `retry` instead of just fixing it with a service as well as inform new people why we are constantly wrapping these tests in `retry`.
* Can we manually parse the `bsearch` if it has `async` for each test? This is true but is error prone and I did this for one test and it's ugly and I had issues as I have to wrap 2 things in `retry` and test several conditions. Also it's harder for people to read the tests rather than just reading there is a service call. Also people in code reviews missed where I had bugs with it. Also lots of boiler plate.
* Can we just increase the timeout with `wait_for_completion_timeout` and the tests will pass for sure then? Not true today but maybe true later, as this hasn't been added as plumbing yet. See this [open ticket](#107241). Even if it is and we increase the timeout to a very large number bsearch might return with an `async` or you might want to test the `async` path. Either way, if/when we add the ability we can increase it within 1 spot which is this service for everyone rather than going to each individual test to add it. If/when it's added if people don't use the bsearch service we can remove it later if we find this is deterministic enough and no one wants to test bsearch features with their strategies down the road.

## Manual test of bsearch service
If you want to manually watch the bsearch operate as if the CI system is running slow or to cause an `async` manually you manually modify this setting here:
https://github.com/elastic/kibana/blob/master/src/plugins/data/server/search/strategies/ese_search/request_utils.ts#L61

To be of a lower number such as `1ms` and then you will see it enter the `async` code within `bsearch` consistently

## Reference PRs
We cannot set the wait_for_complete just yet
#107241 so we decided this was the best way to reduce flake for testing for now. 

### Checklist

- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

Co-authored-by: Kibana Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed release_note:skip Skip the PR/issue when compiling release notes Team:Security Solution Platform Security Solution Platform Team v7.16.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants