-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distinct hash aggregate returned duplicated value if spill happens #9219
Comments
CC: @xiaoxmeng |
@FelixYBW Thanks for catching this! |
Confirmed, as long as the initial spill has more than 1 split file, may be caused by maxSpillRunRows or maxSpillFileSize, the bug happens. |
|
Here is the PR fix this issue: #9230. This is caused by maxSpillRun which might split the large sorted spill run to small ones for batch. This is added after distinct aggregation spill support. |
@FelixYBW thanks and already have a fix for this issue. |
Thank you so much for the quick fix! |
…ator#9230) Summary: The existing distinct aggregation implementation assumes that there is one file generated for each spill run. And use stream id 0 to detect if a row read from spilled file is distinct one or not. This is no longer true after we add support to configure the max number of rows to spill in each sorted spill file for aggregation which means stream id > 0 could also contains the distinct values. This will cause incorrect data result and reported by [issue](facebookincubator#9219). This PR fixes this issue by recording the number of spilled files on the first spill in grouping set to detect the spilled files that contain the seen distinct values. Unit test is added to reproduce and verify the fix. Also removed the unused spill config Reviewed By: oerling Differential Revision: D55288249 Pulled By: xiaoxmeng
…ator#9230) Summary: The existing distinct aggregation implementation assumes that there is one file generated for each spill run. And use stream id 0 to detect if a row read from spilled file is distinct one or not. This is no longer true after we add support to configure the max number of rows to spill in each sorted spill file for aggregation which means stream id > 0 could also contains the distinct values. This will cause incorrect data result and reported by [issue](facebookincubator#9219). This PR fixes this issue by recording the number of spilled files on the first spill in grouping set to detect the spilled files that contain the seen distinct values. Unit test is added to reproduce and verify the fix. Also removed the unused spill config Reviewed By: oerling Differential Revision: D55288249 Pulled By: xiaoxmeng
Summary: The existing distinct aggregation implementation assumes that there is one file generated for each spill run. And use stream id 0 to detect if a row read from spilled file is distinct one or not. This is no longer true after we add support to configure the max number of rows to spill in each sorted spill file for aggregation which means stream id > 0 could also contains the distinct values. This will cause incorrect data result and reported by [issue](#9219). This PR fixes this issue by recording the number of spilled files on the first spill in grouping set to detect the spilled files that contain the seen distinct values. Unit test is added to reproduce and verify the fix. Also removed the unused spill config Pull Request resolved: #9230 Reviewed By: oerling Differential Revision: D55288249 Pulled By: xiaoxmeng fbshipit-source-id: 0b96263ea3c08d8e5bd9e210f77547d642c2f2db
…ator#9230) Summary: The existing distinct aggregation implementation assumes that there is one file generated for each spill run. And use stream id 0 to detect if a row read from spilled file is distinct one or not. This is no longer true after we add support to configure the max number of rows to spill in each sorted spill file for aggregation which means stream id > 0 could also contains the distinct values. This will cause incorrect data result and reported by [issue](facebookincubator#9219). This PR fixes this issue by recording the number of spilled files on the first spill in grouping set to detect the spilled files that contain the seen distinct values. Unit test is added to reproduce and verify the fix. Also removed the unused spill config Pull Request resolved: facebookincubator#9230 Reviewed By: oerling Differential Revision: D55288249 Pulled By: xiaoxmeng fbshipit-source-id: 0b96263ea3c08d8e5bd9e210f77547d642c2f2db
Bug description
If there is a spill the returned values from distinct hash aggregate have duplicates. It's not yet fixed. The root cause is related to the spill files.
code`
It assums all the records before first spill trigger are spilled single file but for some reason the records are written into 2 files which leads to being output again when handling spill file.
System information
Relevant logs
No response
The text was updated successfully, but these errors were encountered: