-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix multiple files spilled for the distinct hash table #9230
Conversation
✅ Deploy Preview for meta-velox canceled.
|
39751fa
to
2843ef4
Compare
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great % some nits
velox/core/QueryConfig.h
Outdated
/// partition if it has any data. This is to avoid spill from a partition with | ||
/// a small amount of data which might result in generating too many small | ||
/// spilled files. | ||
static constexpr const char* kMinSpillRunSize = "min_spill_run_size"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we need to update the configs.rst?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove this in followup after Prestissimo has removed the config reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix!
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
…ator#9230) Summary: The existing distinct aggregation implementation assumes that there is one file generated for each spill run. And use stream id 0 to detect if a row read from spilled file is distinct one or not. This is no longer true after we add support to configure the max number of rows to spill in each sorted spill file for aggregation which means stream id > 0 could also contains the distinct values. This will cause incorrect data result and reported by [issue](facebookincubator#9219). This PR fixes this issue by recording the number of spilled files on the first spill in grouping set to detect the spilled files that contain the seen distinct values. Unit test is added to reproduce and verify the fix. Also removed the unused spill config Reviewed By: oerling Differential Revision: D55288249 Pulled By: xiaoxmeng
This pull request was exported from Phabricator. Differential Revision: D55288249 |
…ator#9230) Summary: The existing distinct aggregation implementation assumes that there is one file generated for each spill run. And use stream id 0 to detect if a row read from spilled file is distinct one or not. This is no longer true after we add support to configure the max number of rows to spill in each sorted spill file for aggregation which means stream id > 0 could also contains the distinct values. This will cause incorrect data result and reported by [issue](facebookincubator#9219). This PR fixes this issue by recording the number of spilled files on the first spill in grouping set to detect the spilled files that contain the seen distinct values. Unit test is added to reproduce and verify the fix. Also removed the unused spill config Reviewed By: oerling Differential Revision: D55288249 Pulled By: xiaoxmeng
This pull request was exported from Phabricator. Differential Revision: D55288249 |
@xiaoxmeng merged this pull request in 5c67de4. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
…ator#9230) Summary: The existing distinct aggregation implementation assumes that there is one file generated for each spill run. And use stream id 0 to detect if a row read from spilled file is distinct one or not. This is no longer true after we add support to configure the max number of rows to spill in each sorted spill file for aggregation which means stream id > 0 could also contains the distinct values. This will cause incorrect data result and reported by [issue](facebookincubator#9219). This PR fixes this issue by recording the number of spilled files on the first spill in grouping set to detect the spilled files that contain the seen distinct values. Unit test is added to reproduce and verify the fix. Also removed the unused spill config Pull Request resolved: facebookincubator#9230 Reviewed By: oerling Differential Revision: D55288249 Pulled By: xiaoxmeng fbshipit-source-id: 0b96263ea3c08d8e5bd9e210f77547d642c2f2db
The existing distinct aggregation implementation assumes that there is one file generated
for each spill run. And use stream id 0 to detect if a row read from spilled file is distinct one
or not. This is no longer true after we add support to configure the max number of rows to
spill in each sorted spill file for aggregation which means stream id > 0 could also contains
the distinct values. This will cause incorrect data result and reported by issue.
This PR fixes this issue by recording the number of spilled files on the first spill in grouping
set to detect the spilled files that contain the seen distinct values. Unit test is added to reproduce
and verify the fix. Also removed the unused spill config