Failing ES Promotion: Endpoint plugin Resolver tests Resolver tests for the entity route winlogbeat tests "before all" hook for "returns a winlogbeat sysmon event when the event matches the schema correctly" #100697

tylersmalley · 2021-05-26T15:58:27Z

This failure is preventing the promotion of the current Elasticsearch nightly snapshot.

For more information on the Elasticsearch snapshot promotion process: https://www.elastic.co/guide/en/kibana/master/development-es-snapshots.html

https://kibana-ci.elastic.co/job/elasticsearch+snapshots+verify/2825/execution/node/512/log

13:25:42         │1)    Endpoint plugin
13:25:42         │       Resolver tests
13:25:42         │         Resolver tests for the entity route
13:25:42         │           winlogbeat tests
13:25:42         │             "before all" hook for "returns a winlogbeat sysmon event when the event matches the schema correctly":
13:25:42         │
13:25:42         │      ResponseError: resource_already_exists_exception
13:25:42         │       at onBody (/dev/shm/workspace/kibana/node_modules/@elastic/elasticsearch/lib/Transport.js:337:23)
13:25:42         │       at IncomingMessage.onEnd (/dev/shm/workspace/kibana/node_modules/@elastic/elasticsearch/lib/Transport.js:264:11)
13:25:42         │       at endReadableNT (internal/streams/readable.js:1336:12)
13:25:42         │       at processTicksAndRejections (internal/process/task_queues.js:82:21)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-05-26T15:58:28Z

Pinging @elastic/security-solution (Team: SecuritySolution)

elasticmachine · 2021-05-26T16:14:54Z

Pinging @elastic/security-threat-hunting (Team:Threat Hunting)

elasticmachine · 2021-05-26T16:14:54Z

Pinging @elastic/security-threat-hunting (Feature:Resolver)

jonathan-buttner · 2021-05-26T17:05:44Z

@tylersmalley I'm struggling to figure out why this test started failing. Is this the first time we're seeing this happen? We haven't modified this code directly since December as far as I'm aware. The test's before all is attempting to loading an ES archive before it runs the tests. From the error that was printed it seems like the winlogbeat archive was somehow already loaded or something.

I'll kick off a flacky test runner job and see if that finds anything.

spalger · 2021-05-26T21:17:57Z

@jonathan-buttner this is a failure occurring when we upgrade ES and is failing the promotion of the latest master version of ES to be used in CI. It's not happening on current tracked branches yet. The flaky test runner would need to be run against code that has been updated to use the unverified snapshot in order to see if this is somehow a new source of flakiness or a legitimate failure caused by a change in ES. I'm looking into this now and hopefully will have more information.

spalger · 2021-05-26T21:19:43Z

All that being said, this failure occurred on both retries two days in a row, so I'm pretty sure it's a legitimate bug and not flakiness.

spalger · 2021-05-26T21:49:29Z

Here is the ES change log that could have started causing this issue, not seeing any suspects as of now...

elastic/elasticsearch@12c2c1f...221c571

spalger · 2021-05-27T00:07:06Z

I'm able to reproduce this by running, in two tabs:

KBN_ES_SNAPSHOT_USE_UNVERIFIED=true node scripts/functional_tests_server.js --config x-pack/test/security_solution_endpoint_api_int/config.ts

node scripts/functional_test_runner.js --config x-pack/test/security_solution_endpoint_api_int/config.ts --bail --grep "Resolver tests for the entity route"

It seems that for some reason the winlogbeat-7.11.0-2020.12.03-000001 index already exists and also can't be deleted, but there isn't anything clear in the logs about what's happening.

My breakdown of what's happening:

When the esArchiver tries to create the index for the first time it gets a resource_already_exists_exception error and es logs:

   │ info [o.e.c.m.MetadataCreateIndexService] [spalger.local] [winlogbeat-7.11.0-2020.12.03-000001] creating index, cause [api], templates [], shards [1]/[1]
   │ info [o.e.c.r.a.AllocationService] [spalger.local] current.health="RED" message="Cluster health status changed from [YELLOW] to [RED] (reason: [index [winlogbeat-7.11.0-2020.12.03-000001] created])." previous.health="YELLOW" reason="index [winlogbeat-7.11.0-2020.12.03-000001] created"

If the index already exists why did ES log that it was created by the API? What is up with the cluster health going RED?

This leads it to query to attempt to resolve aliases to underlying index names, which returns { error: 'alias [winlogbeat-7.11.0-2020.12.03-000001] missing', status: 404 }. This leads to the esArchiver attempting to delete the index using winlogbeat-7.11.0-2020.12.03-000001, which returns { acknowledged: true }. Causing ES to log this:

   │ info [o.e.c.m.MetadataDeleteIndexService] [spalger.local] [winlogbeat-7.11.0-2020.12.03-000001/DCWNdS1CQp20aHIQxgf1Ww] deleting index
   │ info [r.suppressed] [spalger.local] path: /winlogbeat-7.11.0-2020.12.03-000001, params: {index=winlogbeat-7.11.0-2020.12.03-000001}
   │      org.elasticsearch.tasks.TaskCancelledException: response channel [Netty4HttpChannel{localAddress=/127.0.0.1:9220, remoteAddress=/127.0.0.1:59357}] closed
   │      	at org.elasticsearch.rest.action.RestActionListener.ensureOpen(RestActionListener.java:48) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.rest.action.RestActionListener.onResponse(RestActionListener.java:37) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.client.node.NodeClient.lambda$executeLocally$0(NodeClient.java:100) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:170) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:164) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$applyInternal$2(SecurityActionFilter.java:163) [x-pack-security-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:217) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.action.ActionListener$DelegatingActionListener.onResponse(ActionListener.java:184) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:101) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.cluster.metadata.MetadataCreateIndexService.lambda$createIndex$0(MetadataCreateIndexService.java:275) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.action.support.ActiveShardsObserver$1.onTimeout(ActiveShardsObserver.java:84) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:324) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:241) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:584) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:669) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
   │      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
   │      	at java.lang.Thread.run(Thread.java:831) [?:?]

I don't know why the winlogbeat index exists, or why it can't be deleted, but there aren't any commits between the first bad and last good snapshots which suggest that winlogbeat indexes are now system indices or something...

jonathan-buttner · 2021-05-27T12:42:59Z

Nice work reproducing it! Is the best option to unblock you all to switch the archive index to something other than winlogbeat-*? That's a very easy fix for the tests and archive. It still doesn't explain why we can't create indices of that name though. I did a quick search and I don't see any other mappings for es archives with winlogbeat- in kibana.

One thought I had is that resolver's tests install packages through the fleet docker registry before running the tests. It's possible that one of the packages that gets installed be default when we hit this API: https://github.com/elastic/kibana/blob/master/x-pack/test/security_solution_endpoint_api_int/apis/index.ts#L28

is creating a data stream with that winlogbeat name.

But I would expect it to fail in the PR that bumped the docker hash here: https://github.com/elastic/kibana/blob/master/x-pack/test/fleet_api_integration/config.ts#L17

and that it would continue to fail when we run the tests locally without this KBN_ES_SNAPSHOT_USE_UNVERIFIED=true.

spalger · 2021-05-27T21:04:23Z

For now, hope you don't mind, I'm going to skip this test to unblock the promotion and then let y'all decide if you just want to use a different index name or work out what the underlying cause it. I can't make any sense of what's going on here, I've reached out to the ES team to see if they have any ideas, but if you'd like to try switching to a new index name in the meantime that works for me.

jonathan-buttner · 2021-05-27T21:28:59Z

For now, hope you don't mind, I'm going to skip this test to unblock the promotion and then let y'all decide if you just want to use a different index name or work out what the underlying cause it. I can't make any sense of what's going on here, I've reached out to the ES team to see if they have any ideas, but if you'd like to try switching to a new index name in the meantime that works for me.

Yeah no worries. Thanks for helping reproduce it. I'll push up the index name change after your PR is merged.

spalger · 2021-05-27T21:53:37Z

Alright, I've confirmed with the ES team the issue is triggered by elastic/elasticsearch@95bccda, which is failing because index.routing.allocation.include._tier (defined here) is no longer a special attribute understood by ES so it's trying to find a node to put the index on with that tag, but none exist. The API responses are a little misleading, and I'm asking if these responses are intend, but it seems if you just remove that settings block from the esArchiver your PR should pass without any other changes needed once this job completes: https://kibana-ci.elastic.co/job/elasticsearch+snapshots+verify/2836/

jonathan-buttner · 2021-05-27T22:04:10Z

Alright, I've confirmed with the ES team the issue is triggered by elastic/elasticsearch@95bccda, which is failing because index.routing.allocation.include._tier (defined here) is no longer a special attribute understood by ES so it's trying to find a node to put the index on with that tag, but none exist. The API responses are a little misleading, and I'm asking if these responses are intend, but it seems if you just remove that settings block from the esArchiver your PR should pass without any other changes needed once this job completes: https://kibana-ci.elastic.co/job/elasticsearch+snapshots+verify/2836/

Oh nice. Sounds good!

tylersmalley added failed-es-promotion Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. labels May 26, 2021

jonathan-buttner added Feature:Resolver Security Solution Resolver feature Team:Threat Hunting Security Solution Threat Hunting Team labels May 26, 2021

MadameSheema added the triage_needed label May 26, 2021

spalger added a commit that referenced this issue May 27, 2021

skip suite failing es promotion (#100697)

a6bbf1b

jonathan-buttner mentioned this issue May 27, 2021

[Security Solution][Resolver] Fixing ES mapping failure for ES archiver #100835

Merged

jonathan-buttner linked a pull request May 28, 2021 that will close this issue

[Security Solution][Resolver] Fixing ES mapping failure for ES archiver #100835

Merged

jonathan-buttner closed this as completed in #100835 May 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing ES Promotion: Endpoint plugin Resolver tests Resolver tests for the entity route winlogbeat tests "before all" hook for "returns a winlogbeat sysmon event when the event matches the schema correctly" #100697

Failing ES Promotion: Endpoint plugin Resolver tests Resolver tests for the entity route winlogbeat tests "before all" hook for "returns a winlogbeat sysmon event when the event matches the schema correctly" #100697

tylersmalley commented May 26, 2021

elasticmachine commented May 26, 2021

elasticmachine commented May 26, 2021

elasticmachine commented May 26, 2021

jonathan-buttner commented May 26, 2021

spalger commented May 26, 2021

spalger commented May 26, 2021

spalger commented May 26, 2021

spalger commented May 27, 2021 •

edited

Loading

jonathan-buttner commented May 27, 2021

spalger commented May 27, 2021

jonathan-buttner commented May 27, 2021

spalger commented May 27, 2021 •

edited

Loading

jonathan-buttner commented May 27, 2021

Failing ES Promotion: Endpoint plugin Resolver tests Resolver tests for the entity route winlogbeat tests "before all" hook for "returns a winlogbeat sysmon event when the event matches the schema correctly" #100697

Failing ES Promotion: Endpoint plugin Resolver tests Resolver tests for the entity route winlogbeat tests "before all" hook for "returns a winlogbeat sysmon event when the event matches the schema correctly" #100697

Comments

tylersmalley commented May 26, 2021

elasticmachine commented May 26, 2021

elasticmachine commented May 26, 2021

elasticmachine commented May 26, 2021

jonathan-buttner commented May 26, 2021

spalger commented May 26, 2021

spalger commented May 26, 2021

spalger commented May 26, 2021

spalger commented May 27, 2021 • edited Loading

jonathan-buttner commented May 27, 2021

spalger commented May 27, 2021

jonathan-buttner commented May 27, 2021

spalger commented May 27, 2021 • edited Loading

jonathan-buttner commented May 27, 2021

spalger commented May 27, 2021 •

edited

Loading

spalger commented May 27, 2021 •

edited

Loading