-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
execbuilder: deflake TestExecBuild #75304
execbuilder: deflake TestExecBuild #75304
Conversation
Haven't looked around to see if this pattern applies to other test files, I suspect it does and worth looking out for going forward. I'm not really sure why #73876 caused this to flake more reliably. |
8c7a3d6
to
f3ec41e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we fix #75282 in this PR or handle it separately? distsql_enum seemed to be flakey too, should the EXPLAN (VEC) test in there also have a retry?
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @yuzefovich)
pkg/sql/opt/exec/execbuilder/testdata/inverted_filter_geospatial_dist, line 109 at r1 (raw file):
/1152921574000000000 NULL {2} 2 # Distributed. TODO(treilly): This claims to be distributed, but it isn't. What gives?
Good question! I think we can handle the bogus comments in this test out of band, not sure if the comments just bitrotted or something deeper is going on.
Fixes cockroachdb#74933. This test asserts on physical plans of statements after moving ranges around, i.e. whether distsql execution is "local" or "distributed". This is determined by: - where the distsql planner places processor cores, - which in turn is a function of the span resolver, - which delegates to the embedded distsender, - which operates off of a range cache. The range cache, like the name suggests, can hold stale data. This test moved replicas of indexes around to induce distributed execution, but was not accounting for the caches holding onto stale data. In my test runs every so often I was seeing stale range descriptors from before the relocate trip up the planner to generate a local execution, flaking the test. Fortunately for us, all we need to do is slap on a retry. To repro: # Took under a minute on my gceworker. Helped to trim down the test # file slightly. dev test pkg/sql/opt/exec/execbuilder \ -f 'TestExecBuild/5node' -v --show-logs \ --stress --stress-args '-p 4' Release note: None
We opted this test out of using the span configs infra in cockroachdb#75281 after observing a CI flake. Following the analysis in the last commit, it may have been due to stale distsender caches affecting the physical plan generated. This fix is speculative because I was unable to repro the original flake under stress. Attempt (successful after 1000s of runs on GCE worker over 20+ minutes): dev test pkg/sql/logictest \ -f 'TestLogic/^5node-disk$/distsql_enum' \ --show-logs -v --stress --stress-args '-p 4' -- --test_arg -show-sql Release note: None
f3ec41e
to
83f584d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we fix #75282 in this PR
Done, PTAL.
distsql_enum seemed to be flakey too, should the EXPLAN (VEC)
Done, though I couldn't repro the original flake. The suggested fix looks right to me though, lets see if it works.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)
pkg/sql/opt/exec/execbuilder/testdata/inverted_filter_geospatial_dist, line 109 at r1 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
Good question! I think we can handle the bogus comments in this test out of band, not sure if the comments just bitrotted or something deeper is going on.
I'll leave this TODO here for you/your team.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)
Thanks! bors r+ |
Build failed (retrying...): |
Build failed (retrying...): |
Build failed (retrying...): |
Build succeeded: |
Fixes #74933. This test asserts on physical plans of statements after
moving ranges around, i.e. whether distsql execution is "local" or
"distributed". This is determined by:
The range cache, like the name suggests, can hold stale data. This test
moved replicas of indexes around to induce distributed execution, but
was not accounting for the caches holding onto stale data. In my test
runs every so often I was seeing stale range descriptors from before the
relocate trip up the planner to generate a local execution, flaking the
test. Fortunately for us, all we need to do is slap on a retry. To
repro:
Release note: None