Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

execbuilder: deflake TestExecBuild #75304

Merged
merged 2 commits into from
Jan 25, 2022

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Jan 22, 2022

Fixes #74933. This test asserts on physical plans of statements after
moving ranges around, i.e. whether distsql execution is "local" or
"distributed". This is determined by:

  • where the distsql planner places processor cores,
  • which in turn is a function of the span resolver,
  • which delegates to the embedded distsender,
  • which operates off of a range cache.

The range cache, like the name suggests, can hold stale data. This test
moved replicas of indexes around to induce distributed execution, but
was not accounting for the caches holding onto stale data. In my test
runs every so often I was seeing stale range descriptors from before the
relocate trip up the planner to generate a local execution, flaking the
test. Fortunately for us, all we need to do is slap on a retry. To
repro:

# Took under a minute on my gceworker. Helped to trim down the test
# file slightly.
dev test pkg/sql/opt/exec/execbuilder \
  -f 'TestExecBuild/5node' -v --show-logs \
  --stress --stress-args '-p 4'

Release note: None

@irfansharif irfansharif requested a review from a team as a code owner January 22, 2022 04:16
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif
Copy link
Contributor Author

Haven't looked around to see if this pattern applies to other test files, I suspect it does and worth looking out for going forward. I'm not really sure why #73876 caused this to flake more reliably.

@irfansharif irfansharif force-pushed the 220121.deflake-TestExecBuild branch 2 times, most recently from 8c7a3d6 to f3ec41e Compare January 22, 2022 04:32
Copy link
Contributor

@cucaroach cucaroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we fix #75282 in this PR or handle it separately? distsql_enum seemed to be flakey too, should the EXPLAN (VEC) test in there also have a retry?

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @yuzefovich)


pkg/sql/opt/exec/execbuilder/testdata/inverted_filter_geospatial_dist, line 109 at r1 (raw file):

/1152921574000000000  NULL                  {2}       2

# Distributed. TODO(treilly): This claims to be distributed, but it isn't. What gives?

Good question! I think we can handle the bogus comments in this test out of band, not sure if the comments just bitrotted or something deeper is going on.

Fixes cockroachdb#74933. This test asserts on physical plans of statements after
moving ranges around, i.e. whether distsql execution is "local" or
"distributed". This is determined by:
- where the distsql planner places processor cores,
- which in turn is a function of the span resolver,
- which delegates to the embedded distsender,
- which operates off of a range cache.

The range cache, like the name suggests, can hold stale data. This test
moved replicas of indexes around to induce distributed execution, but
was not accounting for the caches holding onto stale data. In my test
runs every so often I was seeing stale range descriptors from before the
relocate trip up the planner to generate a local execution, flaking the
test. Fortunately for us, all we need to do is slap on a retry. To
repro:

    # Took under a minute on my gceworker. Helped to trim down the test
    # file slightly.
    dev test pkg/sql/opt/exec/execbuilder \
      -f 'TestExecBuild/5node' -v --show-logs \
      --stress --stress-args '-p 4'

Release note: None
We opted this test out of using the span configs infra in cockroachdb#75281 after
observing a CI flake. Following the analysis in the last commit, it may
have been due to stale distsender caches affecting the physical plan
generated.

This fix is speculative because I was unable to repro the original flake
under stress. Attempt (successful after 1000s of runs on GCE worker over
20+ minutes):

    dev test pkg/sql/logictest \
    -f 'TestLogic/^5node-disk$/distsql_enum' \
    --show-logs -v --stress --stress-args '-p 4' -- --test_arg -show-sql

Release note: None
@irfansharif irfansharif force-pushed the 220121.deflake-TestExecBuild branch from f3ec41e to 83f584d Compare January 24, 2022 16:31
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we fix #75282 in this PR

Done, PTAL.

distsql_enum seemed to be flakey too, should the EXPLAN (VEC)

Done, though I couldn't repro the original flake. The suggested fix looks right to me though, lets see if it works.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)


pkg/sql/opt/exec/execbuilder/testdata/inverted_filter_geospatial_dist, line 109 at r1 (raw file):

Previously, cucaroach (Tommy Reilly) wrote…

Good question! I think we can handle the bogus comments in this test out of band, not sure if the comments just bitrotted or something deeper is going on.

I'll leave this TODO here for you/your team.

Copy link
Contributor

@cucaroach cucaroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)

@irfansharif
Copy link
Contributor Author

Thanks!

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 25, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jan 25, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jan 25, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jan 25, 2022

Build succeeded:

@craig craig bot merged commit da5d0fc into cockroachdb:master Jan 25, 2022
@irfansharif irfansharif deleted the 220121.deflake-TestExecBuild branch January 25, 2022 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql/opt/exec/execbuilder: TestExecBuild failed
3 participants