-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reenable azure repository tests and lower down max error per request #48283
Conversation
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
@elasticmachine run elasticsearch-ci/1 (Test is green but I'd like to have more runs) |
@elasticmachine run elasticsearch-ci/1 (Test is green but I'd like to have more runs) |
@elasticmachine run elasticsearch-ci/1 (Test is green but I'd like to have more runs) |
@tlrx I'm not so sure about this one. Can slowness due to executing the handlers on the HTTP server's callbacks actually lead to EOF exceptions? (to me it seems no, but I haven't worked much with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my question, I'm a little unsure about the fix, but if the failures are all due to the 3x retry I think we're looking at the right thing probably.
Well I did not claim it's a fix :) I tried to reproduce the Azure failures by having the tests run over the weekend but I did not manage to reproduce. Also, by looking at the build stats I've found some GCS failures with read timeouts and assertion errors on server side. I'd expect any failure related to read timeouts or errors/retries to be reproductible, but actually it is not because of the
Honestly I'm not sure. But the internal HTTP servers are still single threaded and we are adding more parallelization to snapshot/restore so I think we should change that so that it better reflects how storage services are working and it avoid to have one operation to be dependent on how another unrelated operation performs.
That was my initial guess but the fact that the failures do not reproduce easily and that Azure retries test are testing the max. number of retries let me think that it's maybe a factor but maybe not the root cause of the failures. Since Azure executes more requests (HEAD+GET) maybe they are just failing more often than the others. At this point I agree that this PR is maybe changing too many things at once. We could maybe do instead:
Depending of the result, revisit the decrease of 3 to 2 max. errors per request. What do you think? |
I like this a lot :) Making things more deterministic sounds like the right move.
I don't like this tbh. I can't see how lowering the latency of the IO thread by forking cheap operations will help us here. Also, even if it does help, then it's a strange spot because it basically means that our tests are broken and just accidentally work on fast boxes? On this topic:
I'm not sure this is a valid point. Even though not using a thread-pool for handling requests does make all the request handling sequential it does not however introduce any artificial ordering as far as I can see. In the same way I argued above, I'd say here: If we fail because of sequential handling of requests, then something must either be seriously wrong with the SDK or with our code mustn't it? :)
Jup let's do suggestion 1 (remove the random boolean) and reenable ? :) |
Thanks Armin, I tend to follow your networking expertise :) Still, I'm not sure all operations are that cheap on CI boxes (I'm thinking of few megabytes uploads or multiple uploads/deletions at once that "hold on" connections but do nothing with them).
Sure, I'll adapt this PR. |
Jenkins run elasticsearch-ci/bwc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks Tanguy!
@elasticmachine run elasticsearch-ci/bwc |
@elasticmachine update branch |
Thanks @original-brownbear. |
* elastic/master: [Docs] Fix opType options in IndexRequest API example. (elastic#48290) Simplify Shard Snapshot Upload Code (elastic#48155) Mute ClassificationIT tests (elastic#48338) Reenable azure repository tests and remove some randomization in http servers (elastic#48283) Use an env var for the classpath of jar hell task (elastic#48240) Refactor FIPS BootstrapChecks to simple checks (elastic#47499) Add "format" to "range" queries resulted from optimizing a logical AND (elastic#48073) [DOCS][Transform] document limitation regarding rolling upgrade with 7.2, 7.3 (elastic#48118) Fail with a better error when if there are no ingest nodes (elastic#48272) Fix executing enrich policies stats (elastic#48132) Use MultiFileTransfer in CCR remote recovery (elastic#44514) Make BytesReference an interface (elastic#48171) Also validate source index at put enrich policy time. (elastic#48254) Add 'javadoc' task to lifecycle check tasks (elastic#48214) Remove option to enable direct buffer pooling (elastic#47956) [DOCS] Add 'Selecting gateway and seed nodes' section to CCS docs (elastic#48297) Add Enrich Origin (elastic#48098) fix incorrect comparison (elastic#48208)
This pull request reenables some repository integration tests, decreases the maximum number of failures per request in Azure tests and sets an
ExecutorService
for theHttpServer
used in tests.I wasn't able to reproduce to failures reported in #47948 and #47380 but I suspect that the highest max. errors per request of
3
is picked up and the fact that requests are processed using the same thread that started the HTTP server might be the cause of the failures. I also think that with more parallelization added in snapshot/restore using an executor service in tests makes sense.Closes #47948
Closes #47380