Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[runners-spark] Do not set accTimestamp to null in SparkCombineFn (#28256) #29162

Merged
merged 1 commit into from
Oct 31, 2023

Conversation

je-ik
Copy link
Contributor

@je-ik je-ik commented Oct 27, 2023

Fixes #28256


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@github-actions
Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@aromanenko-dev
Copy link
Contributor

Run Spark ValidatesRunner

@aromanenko-dev
Copy link
Contributor

Run Spark Runner Tpcds Tests

@aromanenko-dev
Copy link
Contributor

Run Spark Runner Nexmark Tests

@github-actions
Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @lostluck added as fallback since no labels match configuration

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@aromanenko-dev
Copy link
Contributor

CC: @mosche @echauchot

@je-ik
Copy link
Contributor Author

je-ik commented Oct 30, 2023

I'm not convinced this fix is 100% correct. I'd need an instance of a failing job to see what exactly causes the timestamp to be null.

@je-ik je-ik force-pushed the 28256-do-not-set-timestamp-to-null branch from 8b514db to e528134 Compare October 30, 2023 14:34
@je-ik je-ik changed the title [DO NOT MERGE] [runners-spark] Do not set accTimestamp to null in SparkCombineFn (#28256) [runners-spark] Do not set accTimestamp to null in SparkCombineFn (#28256) Oct 30, 2023
@je-ik je-ik force-pushed the 28256-do-not-set-timestamp-to-null branch from e528134 to d342b95 Compare October 30, 2023 14:44
@je-ik
Copy link
Contributor Author

je-ik commented Oct 30, 2023

@aromanenko-dev I think I have the complete understanding. The condition in the assignment was wrong (don't have any clue how it was reasoned), the problem is that it was hidden, because until there was at least one call to add to the accumulator (which is nearly always the case), the timestamp was overwritten.

The bug manifested itself under the following conditions:
a) Combine.withHotKeyFanout, because that merges accumulators without any elements being added (in the postHotFanoutCombine phase)
b) TimestampCombiner.EARLIEST, because that can create accumulator timestamps of BoundedWindow.TIMESTAMP_MIN_VALUE.

Both conditions are satisfied in the case of query 3 in TPC-DS.

The combination is somewhat niche, which is why it was not caught by VR tests. We can add a VR test for this, but I think that the bug was so specific, that it makes little sense.

@je-ik je-ik requested a review from aromanenko-dev October 30, 2023 14:49
@je-ik
Copy link
Contributor Author

je-ik commented Oct 30, 2023

Run Spark Runner Nexmark Tests

@je-ik
Copy link
Contributor Author

je-ik commented Oct 30, 2023

Run Spark Runner Tpcds Tests

@je-ik
Copy link
Contributor Author

je-ik commented Oct 30, 2023

Run Spark ValidatesRunner

@aromanenko-dev
Copy link
Contributor

Run Spark Runner Tpcds Tests

Copy link
Contributor

@aromanenko-dev aromanenko-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, LGTM, thanks a lot! It sounds reasonable for me

The condition in the assignment was wrong (don't have any clue how it was reasoned)

=) It was added 4+ years ago...

The combination is somewhat niche, which is why it was not caught by VR tests. We can add a VR test for this, but I think that the bug was so specific, that it makes little sense.

Actually, according to TPC-DS Grafana dashboard, it caused a lot of queries failed without catching it with VR tests, that worried me. TBH, I didn't see if any of users were affected (mostly because I guess SparkRunner is not so popular now) but it would be great to add any kind of regression test of this if possible.

Let's just wait for TPC-DS job results before merging.

@aromanenko-dev
Copy link
Contributor

Run Spark Runner Tpcds Tests

@je-ik
Copy link
Contributor Author

je-ik commented Oct 30, 2023

Added the test, it is surely better to have a test proving the fix, though in this case it is only "the code does not contain this bug". Nevertheless the line of thinking that introduced in could in theory reappear. 👍

@je-ik je-ik force-pushed the 28256-do-not-set-timestamp-to-null branch from fe68bdb to 3e3207f Compare October 30, 2023 16:34
@je-ik
Copy link
Contributor Author

je-ik commented Oct 30, 2023

Run Spark Runner Tpcds Tests

@je-ik
Copy link
Contributor Author

je-ik commented Oct 30, 2023

Run Java PreCommit

1 similar comment
@je-ik
Copy link
Contributor Author

je-ik commented Oct 31, 2023

Run Java PreCommit

@je-ik je-ik force-pushed the 28256-do-not-set-timestamp-to-null branch from 3e3207f to 524a7bf Compare October 31, 2023 09:23
@je-ik
Copy link
Contributor Author

je-ik commented Oct 31, 2023

Run Spark ValidatesRunner

@je-ik je-ik merged commit 6a66b72 into apache:master Oct 31, 2023
17 checks passed
@je-ik je-ik deleted the 28256-do-not-set-timestamp-to-null branch October 31, 2023 10:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Failing Test]: Various TPC-DS queries throw NPEs using SparkRunner
2 participants