Fix multiple files spilled for the distinct hash table #9230

xiaoxmeng · 2024-03-24T05:25:40Z

The existing distinct aggregation implementation assumes that there is one file generated
for each spill run. And use stream id 0 to detect if a row read from spilled file is distinct one
or not. This is no longer true after we add support to configure the max number of rows to
spill in each sorted spill file for aggregation which means stream id > 0 could also contains
the distinct values. This will cause incorrect data result and reported by issue.

This PR fixes this issue by recording the number of spilled files on the first spill in grouping
set to detect the spilled files that contain the seen distinct values. Unit test is added to reproduce
and verify the fix. Also removed the unused spill config

netlify · 2024-03-24T05:25:58Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`8b929cf`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/6601e83d304d7200085b991d

facebook-github-bot · 2024-03-24T05:49:40Z

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

duanmeng

Looks great % some nits

duanmeng · 2024-03-24T06:39:32Z

velox/core/QueryConfig.h

-  /// partition if it has any data. This is to avoid spill from a partition with
-  /// a small amount of data which might result in generating too many small
-  /// spilled files.
-  static constexpr const char* kMinSpillRunSize = "min_spill_run_size";


Shall we need to update the configs.rst?

I will remove this in followup after Prestissimo has removed the config reference

velox/exec/fuzzer/AggregationFuzzerBase.cpp

velox/exec/GroupingSet.h

velox/exec/Spill.cpp

zhztheplayer

Thanks for the fix!

facebook-github-bot · 2024-03-25T16:58:56Z

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-03-25T17:29:26Z

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

…ator#9230) Summary: The existing distinct aggregation implementation assumes that there is one file generated for each spill run. And use stream id 0 to detect if a row read from spilled file is distinct one or not. This is no longer true after we add support to configure the max number of rows to spill in each sorted spill file for aggregation which means stream id > 0 could also contains the distinct values. This will cause incorrect data result and reported by [issue](facebookincubator#9219). This PR fixes this issue by recording the number of spilled files on the first spill in grouping set to detect the spilled files that contain the seen distinct values. Unit test is added to reproduce and verify the fix. Also removed the unused spill config Reviewed By: oerling Differential Revision: D55288249 Pulled By: xiaoxmeng

facebook-github-bot · 2024-03-25T19:12:54Z

This pull request was exported from Phabricator. Differential Revision: D55288249

…ator#9230) Summary: The existing distinct aggregation implementation assumes that there is one file generated for each spill run. And use stream id 0 to detect if a row read from spilled file is distinct one or not. This is no longer true after we add support to configure the max number of rows to spill in each sorted spill file for aggregation which means stream id > 0 could also contains the distinct values. This will cause incorrect data result and reported by [issue](facebookincubator#9219). This PR fixes this issue by recording the number of spilled files on the first spill in grouping set to detect the spilled files that contain the seen distinct values. Unit test is added to reproduce and verify the fix. Also removed the unused spill config Reviewed By: oerling Differential Revision: D55288249 Pulled By: xiaoxmeng

facebook-github-bot · 2024-03-25T21:10:30Z

This pull request was exported from Phabricator. Differential Revision: D55288249

facebook-github-bot · 2024-03-25T23:36:07Z

@xiaoxmeng merged this pull request in 5c67de4.

conbench-facebook · 2024-03-26T00:07:45Z

Conbench analyzed the 1 benchmark run on commit 5c67de40.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

…ator#9230) Summary: The existing distinct aggregation implementation assumes that there is one file generated for each spill run. And use stream id 0 to detect if a row read from spilled file is distinct one or not. This is no longer true after we add support to configure the max number of rows to spill in each sorted spill file for aggregation which means stream id > 0 could also contains the distinct values. This will cause incorrect data result and reported by [issue](facebookincubator#9219). This PR fixes this issue by recording the number of spilled files on the first spill in grouping set to detect the spilled files that contain the seen distinct values. Unit test is added to reproduce and verify the fix. Also removed the unused spill config Pull Request resolved: facebookincubator#9230 Reviewed By: oerling Differential Revision: D55288249 Pulled By: xiaoxmeng fbshipit-source-id: 0b96263ea3c08d8e5bd9e210f77547d642c2f2db

xiaoxmeng requested review from duanmeng and tanjialiang March 24, 2024 05:25

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 24, 2024

xiaoxmeng requested review from mbasmanova and bikramSingh91 March 24, 2024 05:25

xiaoxmeng force-pushed the distinct branch from 6367e41 to 5844876 Compare March 24, 2024 05:36

xiaoxmeng marked this pull request as ready for review March 24, 2024 05:36

xiaoxmeng force-pushed the distinct branch 4 times, most recently from 39751fa to 2843ef4 Compare March 24, 2024 05:49

xiaoxmeng mentioned this pull request Mar 24, 2024

distinct hash aggregate returned duplicated value if spill happens #9219

Closed

duanmeng approved these changes Mar 24, 2024

View reviewed changes

duanmeng reviewed Mar 24, 2024

View reviewed changes

velox/exec/GroupingSet.h Outdated Show resolved Hide resolved

duanmeng reviewed Mar 24, 2024

View reviewed changes

velox/exec/Spill.cpp Outdated Show resolved Hide resolved

zhztheplayer approved these changes Mar 25, 2024

View reviewed changes

xiaoxmeng force-pushed the distinct branch from 2843ef4 to b9363a4 Compare March 25, 2024 16:58

xiaoxmeng force-pushed the distinct branch from b9363a4 to aa7fe2f Compare March 25, 2024 17:14

xiaoxmeng force-pushed the distinct branch from aa7fe2f to ee116d5 Compare March 25, 2024 19:12

facebook-github-bot added the fb-exported label Mar 25, 2024

xiaoxmeng force-pushed the distinct branch from ee116d5 to 8b929cf Compare March 25, 2024 21:10

facebook-github-bot closed this in 5c67de4 Mar 25, 2024

facebook-github-bot added the Merged label Mar 25, 2024

xiaoxmeng deleted the distinct branch March 26, 2024 00:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multiple files spilled for the distinct hash table #9230

Fix multiple files spilled for the distinct hash table #9230

xiaoxmeng commented Mar 24, 2024 •

edited

Loading

netlify bot commented Mar 24, 2024 •

edited

Loading

facebook-github-bot commented Mar 24, 2024

duanmeng left a comment

duanmeng Mar 24, 2024

xiaoxmeng Mar 25, 2024

zhztheplayer left a comment

facebook-github-bot commented Mar 25, 2024

facebook-github-bot commented Mar 25, 2024

facebook-github-bot commented Mar 25, 2024

facebook-github-bot commented Mar 25, 2024

facebook-github-bot commented Mar 25, 2024

conbench-facebook bot commented Mar 26, 2024

Fix multiple files spilled for the distinct hash table #9230

Fix multiple files spilled for the distinct hash table #9230

Conversation

xiaoxmeng commented Mar 24, 2024 • edited Loading

netlify bot commented Mar 24, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

facebook-github-bot commented Mar 24, 2024

duanmeng left a comment

Choose a reason for hiding this comment

duanmeng Mar 24, 2024

Choose a reason for hiding this comment

xiaoxmeng Mar 25, 2024

Choose a reason for hiding this comment

zhztheplayer left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Mar 25, 2024

facebook-github-bot commented Mar 25, 2024

facebook-github-bot commented Mar 25, 2024

facebook-github-bot commented Mar 25, 2024

facebook-github-bot commented Mar 25, 2024

conbench-facebook bot commented Mar 26, 2024

xiaoxmeng commented Mar 24, 2024 •

edited

Loading

netlify bot commented Mar 24, 2024 •

edited

Loading