Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multiple files spilled for the distinct hash table #9230

Closed
wants to merge 1 commit into from

Conversation

xiaoxmeng
Copy link
Contributor

@xiaoxmeng xiaoxmeng commented Mar 24, 2024

The existing distinct aggregation implementation assumes that there is one file generated
for each spill run. And use stream id 0 to detect if a row read from spilled file is distinct one
or not. This is no longer true after we add support to configure the max number of rows to
spill in each sorted spill file for aggregation which means stream id > 0 could also contains
the distinct values. This will cause incorrect data result and reported by issue.

This PR fixes this issue by recording the number of spilled files on the first spill in grouping
set to detect the spilled files that contain the seen distinct values. Unit test is added to reproduce
and verify the fix. Also removed the unused spill config

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 24, 2024
Copy link

netlify bot commented Mar 24, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 8b929cf
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6601e83d304d7200085b991d

@xiaoxmeng xiaoxmeng marked this pull request as ready for review March 24, 2024 05:36
@xiaoxmeng xiaoxmeng force-pushed the distinct branch 4 times, most recently from 39751fa to 2843ef4 Compare March 24, 2024 05:49
@facebook-github-bot
Copy link
Contributor

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Collaborator

@duanmeng duanmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great % some nits

/// partition if it has any data. This is to avoid spill from a partition with
/// a small amount of data which might result in generating too many small
/// spilled files.
static constexpr const char* kMinSpillRunSize = "min_spill_run_size";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we need to update the configs.rst?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove this in followup after Prestissimo has removed the config reference

velox/exec/fuzzer/AggregationFuzzerBase.cpp Outdated Show resolved Hide resolved
velox/exec/GroupingSet.h Outdated Show resolved Hide resolved
velox/exec/Spill.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@facebook-github-bot
Copy link
Contributor

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

xiaoxmeng added a commit to xiaoxmeng/velox that referenced this pull request Mar 25, 2024
…ator#9230)

Summary:
The existing distinct aggregation implementation assumes that there is one file generated
for each spill run. And use stream id 0 to detect if a row read from spilled file is distinct one
or not. This is no longer true after we add support to configure the max number of rows to
spill in each sorted spill file for aggregation which means stream id > 0 could also contains
the distinct values. This will cause incorrect data result and reported by [issue](facebookincubator#9219).

This PR fixes this issue by recording the number of spilled files on the first spill in grouping
set to detect the spilled files that contain the seen distinct values. Unit test is added to reproduce
and verify the fix. Also removed the unused spill config


Reviewed By: oerling

Differential Revision: D55288249

Pulled By: xiaoxmeng
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55288249

…ator#9230)

Summary:
The existing distinct aggregation implementation assumes that there is one file generated
for each spill run. And use stream id 0 to detect if a row read from spilled file is distinct one
or not. This is no longer true after we add support to configure the max number of rows to
spill in each sorted spill file for aggregation which means stream id > 0 could also contains
the distinct values. This will cause incorrect data result and reported by [issue](facebookincubator#9219).

This PR fixes this issue by recording the number of spilled files on the first spill in grouping
set to detect the spilled files that contain the seen distinct values. Unit test is added to reproduce
and verify the fix. Also removed the unused spill config


Reviewed By: oerling

Differential Revision: D55288249

Pulled By: xiaoxmeng
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55288249

@facebook-github-bot
Copy link
Contributor

@xiaoxmeng merged this pull request in 5c67de4.

Copy link

Conbench analyzed the 1 benchmark run on commit 5c67de40.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

@xiaoxmeng xiaoxmeng deleted the distinct branch March 26, 2024 00:29
Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this pull request Jun 7, 2024
…ator#9230)

Summary:
The existing distinct aggregation implementation assumes that there is one file generated
for each spill run. And use stream id 0 to detect if a row read from spilled file is distinct one
or not. This is no longer true after we add support to configure the max number of rows to
spill in each sorted spill file for aggregation which means stream id > 0 could also contains
the distinct values. This will cause incorrect data result and reported by [issue](facebookincubator#9219).

This PR fixes this issue by recording the number of spilled files on the first spill in grouping
set to detect the spilled files that contain the seen distinct values. Unit test is added to reproduce
and verify the fix. Also removed the unused spill config

Pull Request resolved: facebookincubator#9230

Reviewed By: oerling

Differential Revision: D55288249

Pulled By: xiaoxmeng

fbshipit-source-id: 0b96263ea3c08d8e5bd9e210f77547d642c2f2db
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants