Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing ES Promotion: Endpoint plugin Resolver tests Resolver tests for the entity route winlogbeat tests "before all" hook for "returns a winlogbeat sysmon event when the event matches the schema correctly" #100697

Closed
tylersmalley opened this issue May 26, 2021 · 13 comments · Fixed by #100835
Labels
failed-es-promotion Feature:Resolver Security Solution Resolver feature Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Threat Hunting Security Solution Threat Hunting Team triage_needed

Comments

@tylersmalley
Copy link
Contributor

This failure is preventing the promotion of the current Elasticsearch nightly snapshot.

For more information on the Elasticsearch snapshot promotion process: https://www.elastic.co/guide/en/kibana/master/development-es-snapshots.html

https://kibana-ci.elastic.co/job/elasticsearch+snapshots+verify/2825/execution/node/512/log

13:25:42         │1)    Endpoint plugin
13:25:42         │       Resolver tests
13:25:42         │         Resolver tests for the entity route
13:25:42         │           winlogbeat tests
13:25:42         │             "before all" hook for "returns a winlogbeat sysmon event when the event matches the schema correctly":
13:25:42         │
13:25:42         │      ResponseError: resource_already_exists_exception
13:25:42         │       at onBody (/dev/shm/workspace/kibana/node_modules/@elastic/elasticsearch/lib/Transport.js:337:23)
13:25:42         │       at IncomingMessage.onEnd (/dev/shm/workspace/kibana/node_modules/@elastic/elasticsearch/lib/Transport.js:264:11)
13:25:42         │       at endReadableNT (internal/streams/readable.js:1336:12)
13:25:42         │       at processTicksAndRejections (internal/process/task_queues.js:82:21)
@tylersmalley tylersmalley added failed-es-promotion Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. labels May 26, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@jonathan-buttner jonathan-buttner added Feature:Resolver Security Solution Resolver feature Team:Threat Hunting Security Solution Threat Hunting Team labels May 26, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-threat-hunting (Team:Threat Hunting)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-threat-hunting (Feature:Resolver)

@jonathan-buttner
Copy link
Contributor

@tylersmalley I'm struggling to figure out why this test started failing. Is this the first time we're seeing this happen? We haven't modified this code directly since December as far as I'm aware. The test's before all is attempting to loading an ES archive before it runs the tests. From the error that was printed it seems like the winlogbeat archive was somehow already loaded or something.

I'll kick off a flacky test runner job and see if that finds anything.

@spalger
Copy link
Contributor

spalger commented May 26, 2021

@jonathan-buttner this is a failure occurring when we upgrade ES and is failing the promotion of the latest master version of ES to be used in CI. It's not happening on current tracked branches yet. The flaky test runner would need to be run against code that has been updated to use the unverified snapshot in order to see if this is somehow a new source of flakiness or a legitimate failure caused by a change in ES. I'm looking into this now and hopefully will have more information.

@spalger
Copy link
Contributor

spalger commented May 26, 2021

All that being said, this failure occurred on both retries two days in a row, so I'm pretty sure it's a legitimate bug and not flakiness.

@spalger
Copy link
Contributor

spalger commented May 26, 2021

Here is the ES change log that could have started causing this issue, not seeing any suspects as of now...

elastic/elasticsearch@12c2c1f...221c571

@spalger
Copy link
Contributor

spalger commented May 27, 2021

I'm able to reproduce this by running, in two tabs:

KBN_ES_SNAPSHOT_USE_UNVERIFIED=true node scripts/functional_tests_server.js --config x-pack/test/security_solution_endpoint_api_int/config.ts
node scripts/functional_test_runner.js --config x-pack/test/security_solution_endpoint_api_int/config.ts --bail --grep "Resolver tests for the entity route"

It seems that for some reason the winlogbeat-7.11.0-2020.12.03-000001 index already exists and also can't be deleted, but there isn't anything clear in the logs about what's happening.

My breakdown of what's happening:

When the esArchiver tries to create the index for the first time it gets a resource_already_exists_exception error and es logs:

   │ info [o.e.c.m.MetadataCreateIndexService] [spalger.local] [winlogbeat-7.11.0-2020.12.03-000001] creating index, cause [api], templates [], shards [1]/[1]
   │ info [o.e.c.r.a.AllocationService] [spalger.local] current.health="RED" message="Cluster health status changed from [YELLOW] to [RED] (reason: [index [winlogbeat-7.11.0-2020.12.03-000001] created])." previous.health="YELLOW" reason="index [winlogbeat-7.11.0-2020.12.03-000001] created"

If the index already exists why did ES log that it was created by the API? What is up with the cluster health going RED?

This leads it to query to attempt to resolve aliases to underlying index names, which returns { error: 'alias [winlogbeat-7.11.0-2020.12.03-000001] missing', status: 404 }. This leads to the esArchiver attempting to delete the index using winlogbeat-7.11.0-2020.12.03-000001, which returns { acknowledged: true }. Causing ES to log this:

   │ info [o.e.c.m.MetadataDeleteIndexService] [spalger.local] [winlogbeat-7.11.0-2020.12.03-000001/DCWNdS1CQp20aHIQxgf1Ww] deleting index
   │ info [r.suppressed] [spalger.local] path: /winlogbeat-7.11.0-2020.12.03-000001, params: {index=winlogbeat-7.11.0-2020.12.03-000001}
   │      org.elasticsearch.tasks.TaskCancelledException: response channel [Netty4HttpChannel{localAddress=/127.0.0.1:9220, remoteAddress=/127.0.0.1:59357}] closed
   │      	at org.elasticsearch.rest.action.RestActionListener.ensureOpen(RestActionListener.java:48) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.rest.action.RestActionListener.onResponse(RestActionListener.java:37) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.client.node.NodeClient.lambda$executeLocally$0(NodeClient.java:100) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:170) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:164) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$applyInternal$2(SecurityActionFilter.java:163) [x-pack-security-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:217) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.action.ActionListener$DelegatingActionListener.onResponse(ActionListener.java:184) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:101) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.cluster.metadata.MetadataCreateIndexService.lambda$createIndex$0(MetadataCreateIndexService.java:275) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.action.support.ActiveShardsObserver$1.onTimeout(ActiveShardsObserver.java:84) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:324) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:241) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:584) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:669) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
   │      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
   │      	at java.lang.Thread.run(Thread.java:831) [?:?]

I don't know why the winlogbeat index exists, or why it can't be deleted, but there aren't any commits between the first bad and last good snapshots which suggest that winlogbeat indexes are now system indices or something...

@jonathan-buttner
Copy link
Contributor

Nice work reproducing it! Is the best option to unblock you all to switch the archive index to something other than winlogbeat-*? That's a very easy fix for the tests and archive. It still doesn't explain why we can't create indices of that name though. I did a quick search and I don't see any other mappings for es archives with winlogbeat- in kibana.

One thought I had is that resolver's tests install packages through the fleet docker registry before running the tests. It's possible that one of the packages that gets installed be default when we hit this API: https://github.com/elastic/kibana/blob/master/x-pack/test/security_solution_endpoint_api_int/apis/index.ts#L28

is creating a data stream with that winlogbeat name.

But I would expect it to fail in the PR that bumped the docker hash here: https://github.com/elastic/kibana/blob/master/x-pack/test/fleet_api_integration/config.ts#L17

and that it would continue to fail when we run the tests locally without this KBN_ES_SNAPSHOT_USE_UNVERIFIED=true.

@spalger
Copy link
Contributor

spalger commented May 27, 2021

For now, hope you don't mind, I'm going to skip this test to unblock the promotion and then let y'all decide if you just want to use a different index name or work out what the underlying cause it. I can't make any sense of what's going on here, I've reached out to the ES team to see if they have any ideas, but if you'd like to try switching to a new index name in the meantime that works for me.

@jonathan-buttner
Copy link
Contributor

For now, hope you don't mind, I'm going to skip this test to unblock the promotion and then let y'all decide if you just want to use a different index name or work out what the underlying cause it. I can't make any sense of what's going on here, I've reached out to the ES team to see if they have any ideas, but if you'd like to try switching to a new index name in the meantime that works for me.

Yeah no worries. Thanks for helping reproduce it. I'll push up the index name change after your PR is merged.

@spalger
Copy link
Contributor

spalger commented May 27, 2021

Alright, I've confirmed with the ES team the issue is triggered by elastic/elasticsearch@95bccda, which is failing because index.routing.allocation.include._tier (defined here) is no longer a special attribute understood by ES so it's trying to find a node to put the index on with that tag, but none exist. The API responses are a little misleading, and I'm asking if these responses are intend, but it seems if you just remove that settings block from the esArchiver your PR should pass without any other changes needed once this job completes: https://kibana-ci.elastic.co/job/elasticsearch+snapshots+verify/2836/

@jonathan-buttner
Copy link
Contributor

Alright, I've confirmed with the ES team the issue is triggered by elastic/elasticsearch@95bccda, which is failing because index.routing.allocation.include._tier (defined here) is no longer a special attribute understood by ES so it's trying to find a node to put the index on with that tag, but none exist. The API responses are a little misleading, and I'm asking if these responses are intend, but it seems if you just remove that settings block from the esArchiver your PR should pass without any other changes needed once this job completes: https://kibana-ci.elastic.co/job/elasticsearch+snapshots+verify/2836/

Oh nice. Sounds good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
failed-es-promotion Feature:Resolver Security Solution Resolver feature Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Threat Hunting Security Solution Threat Hunting Team triage_needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants