-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] [CI] Investigate Flaky test failure for windows CI tasks #3021
Comments
[Triage] Thanks for filing, could we get a list of tests that are impacted? |
@DarshitChanpura it looks like there were a few fixes put into main for windows support that may not have been backported to 2.x. Particularly this one: #2180 There is a list of issues that @peternied added to this PR description: #2291 - it may be best to backport the one's of those that can be backported |
From #3032 (review) CI / test (citest, windows-latest, 11) (pull_request) Failing after 29m
|
Updated the issue description to include two sections: 1. Run Details. 2. Test Methods with failure |
I've seen these failures in a couple of places now:
When trying a single test suite locally like this:
I see that it fails in blocks like this and I can see the following error in the logs:
which I had seen before when testing out this PR: #2765 I created a branch to revert that change in 2.x locally to check if it would help with CI stability (https://github.com/cwperks/security/tree/check-2.x-build) and the tests did pass after the first try. Its puzzling why that error is platform specific according to CI. |
I think #2765 may need to be reverted. After seeing the error on both main and 2.x on Friday I did more sanity testing on the RC build to try to produce the error outside of CI, but I was not able to. To see examples of the error you can check a recent build on main and look for
or another on 2.x here From what I have put together the issue is reproducible on many of the ComplianceAuditlogTest like this one. and fails when writing the first document to the The error is coming up during the derivative BulkShardRequest[Primary] and BulkShardRequest[Replica] requests. After tracing through the code it looks like the assumption that the transient headers will still be available on the receiving end of a transport request on a local node may not be accurate. From doing some testing, it looks like direct requests still go through the same Flow in the OutboundHandler
Flow in the InboundHandler
I'm having some trouble tracing all of the calls, but it looks like more work may need to be done to skip the serialization-deserialization of the transient headers on local node calls. I'm still doing testing on the RC to see if the issue is reproducible there. |
CI passed on #2765's merge to main (without any retries) so I don't believe that is the root cause. However, the failures seem oddly specific to requests involving action: All failures point to this source:
|
[Triage] This is being actively looked at by @DarshitChanpura. |
There are 3 mains areas of concern that were identified:
@cwperks From what I see in OutboundMessage class, the transient headers are being written to output stream: https://github.com/opensearch-project/OpenSearch/blob/3ee42920fa886bdf566f1bb7b1666ffa72d36f54/server/src/main/java/org/opensearch/common/util/concurrent/ThreadContext.java#L321-L325 |
@DarshitChanpura do you need a hand with figuring out the cause? |
Yes, please. |
In my local testing, the sender (SecurityInterceptor) and the receiver (SecurityRequestHandler) are using different threadpools even when handling a local node request. I don't think persistent threadcontext headers would help here if there are different threadpools. A concept of a ThreadContextStatePropagator was added recently. This may help transmit the necessary transient headers on a local node. |
We've been looking with @DarshitChanpura and the hypothesis here (not confirmed yet) is that the thread context may not work here: the |
One thing that's interesting, is that I see that sometimes it bypasses the sender during local requests and the logic to bypass is actually in core.
In the failing tests the AsyncSender is called somehow (on a local node request) which always precedes the failures seen in CI. |
In doing more local testing, I think I see an identity crisis happening where core thinks the local node is node1 and the security plugin thinks its node2. I'm looking into how this can happen now. Both the security plugin and core should be aware of node they are running on without any conflicts. I added logging statements in |
I think I nailed it folks:
We run 3 nodes in single JVM, with 3 security plugins, only one of them wins. @DarshitChanpura I think at this moment - this is 100% test related problem that should not leak into production. |
This localNode is set by using clusterService.localNode() method upon node bootstrap: And security plugin uses this localNode reference to then compare if request is for sameNode: security/src/main/java/org/opensearch/security/transport/SecurityInterceptor.java Lines 149 to 150 in 37aacdc
i'm not super-convinced on what would happen in a multi-node cluster when the request ends up in the same node. Will the thread always be shared between SecurityInterceptor and SecurityRequestHandler in that case? |
I think so too, I was not able to replicate the issue on the RC. That being said I think the code path here in the SecurityInterceptor and the corresponding code in the receiver is dead code. In reality, the user is not serialized on local node requests because it takes the bypass route described here |
I will update here with an RCA and next steps soon. |
Closing this one as the major blocker was resolved. Future flaky tests can have their own issues. |
Posted results here: #2724 (comment) |
What is the bug?
Windows CI has been flaky with recent runs for tasks, specifically
citest
,dlicRestApiTest
.See these runs for example:
https://github.com/opensearch-project/security/actions/runs/5577190458/attempts/1 <-- 3 tasks failed
https://github.com/opensearch-project/security/actions/runs/5577190458/attempts/2 <-- 2 tasks failed (1 prev failed task passed)
https://github.com/opensearch-project/security/actions/runs/5577190458/attempts/3 <-- 1 task failed
https://github.com/opensearch-project/security/actions/runs/5577190458/attempts/5 <-- all tasks passed
Failure Details
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testRestApiRolesEnabled
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testComplianceEnable
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testRestApiRolesDisabled
- org.opensearch.security.multitenancy.test.MultitenancyTests.testTenantParametersSubstitution
- org.opensearch.security.auditlog.integration.BasicAuditlogTest.testDeleteByQuery
- org.opensearch.security.httpclient.HttpClientTest.testPlainConnection
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testComplianceEnable
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testRestApiRolesEnabled
- com.amazon.dlic.auth.ldap2.LdapBackendIntegTest2.testIntegLdapAuthenticationSSL
- org.opensearch.security.SecurityAdminTests.testIsLegacySecurityIndexOnV7Index
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testInternalConfig
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testComplianceEnable
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testInternalConfig
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testBCryptHashRedaction
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testRestApiRolesDisabled
- org.opensearch.security.multitenancy.test.TenancyPrivateTenantEnabledTests.testPrivateTenantDisabled_Update_EndToEnd
- org.opensearch.security.auditlog.integration.BasicAuditlogTest.testSensitiveMethodRedaction
- org.opensearch.security.auditlog.integration.BasicAuditlogTest.testScroll
- org.opensearch.security.auditlog.integration.BasicAuditlogTest.testScroll
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testBCryptHashRedaction
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testInternalConfig
- org.opensearch.security.auditlog.integration.BasicAuditlogTest.testDeleteByQuery
- org.opensearch.security.multitenancy.test.TenancyMultitenancyEnabledTests.testMultitenancyDisabled_endToEndTest
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testInternalConfig
- org.opensearch.security.auditlog.integration.BasicAuditlogTest.testDeleteByQuery
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testWriteHistory
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testBCryptHashRedaction
- org.opensearch.security.multitenancy.test.MultitenancyTests.testTenantParametersSubstitution
- org.opensearch.security.multitenancy.test.MultitenancyTests.testMt
- org.opensearch.security.multitenancy.test.MultitenancyTests.testMtMulti
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testBCryptHashRedaction
- org.opensearch.security.SecurityAdminTests.testSecurityAdminInvalidCert
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testRestApiRolesEnabled
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testRestApiNewUser
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testUpdate
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testInternalConfig
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testRestApiRolesEnabled
- org.opensearch.security.auditlog.integration.BasicAuditlogTest.testDeleteByQuery
- org.opensearch.security.multitenancy.test.MultitenancyTests.testMt
- org.opensearch.security.auditlog.integration.BasicAuditlogTest.testDeleteByQuery
- org.opensearch.security.multitenancy.test.TenancyMultitenancyEnabledTests.testMultitenancyDisabled_endToEndTest
- org.opensearch.security.auditlog.compliance.RestApiComplianceAuditlogTest.testRestApiNewUser
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testInternalConfig
- org.opensearch.security.auditlog.compliance.ComplianceAuditlogTest.testUpdate
Test Methods with failures across all mentioned runs
What is the expected behavior?
No flakiness.
The text was updated successfully, but these errors were encountered: