distinct hash aggregate returned duplicated value if spill happens #9219

FelixYBW · 2024-03-22T17:48:36Z

Bug description

If there is a spill the returned values from distinct hash aggregate have duplicates. It's not yet fixed. The root cause is related to the spill files.

    if (stream->id() == 0) {
      newDistinct = false;
    }

code`

It assums all the records before first spill trigger are spilled single file but for some reason the records are written into 2 files which leads to being output again when handling spill file.

System information


Velox System Info v0.0.2
Commit: b4ce48390e055cdf307ba378b9dd305ce77027ec
CMake Version: 3.22.1
System: Linux-5.15.0-101-generic
Arch: x86_64
C++ Compiler: /usr/bin/c++
C++ Compiler Version: 11.4.0
C Compiler: /usr/bin/cc
C Compiler Version: 11.4.0
CMake Prefix Path: /usr/local;/usr;/;/usr;/usr/local;/usr/X11R6;/usr/pkg;/opt

\nThe results will be copied to your clipboard if xclip is installed.

Relevant logs

No response

The text was updated successfully, but these errors were encountered:

mbasmanova · 2024-03-22T17:58:37Z

CC: @xiaoxmeng

xiaoxmeng · 2024-03-22T18:29:24Z

@FelixYBW Thanks for catching this!

FelixYBW · 2024-03-22T20:35:23Z

Confirmed, as long as the initial spill has more than 1 split file, may be caused by maxSpillRunRows or maxSpillFileSize, the bug happens.
Planed solution is to record the initial spill file id and used it to exclude the data which is already output. We will submit a PR later.

FelixYBW · 2024-03-22T20:35:49Z

Confirmed, as long as the initial spill has more than 1 split file, may be caused by maxSpillRunRows or maxSpillFileSize, the bug happens. Planed solution is to record the initial spill file id and used it to exclude the data which is already output. We will submit a PR later.

@zhztheplayer

xiaoxmeng · 2024-03-24T05:51:29Z

Here is the PR fix this issue: #9230. This is caused by maxSpillRun which might split the large sorted spill run to small ones for batch. This is added after distinct aggregation spill support.

xiaoxmeng · 2024-03-24T05:51:52Z

Confirmed, as long as the initial spill has more than 1 split file, may be caused by maxSpillRunRows or maxSpillFileSize, the bug happens. Planed solution is to record the initial spill file id and used it to exclude the data which is already output. We will submit a PR later.

@FelixYBW thanks and already have a fix for this issue.

FelixYBW · 2024-03-24T19:33:41Z

Thank you so much for the quick fix!

…ator#9230) Summary: The existing distinct aggregation implementation assumes that there is one file generated for each spill run. And use stream id 0 to detect if a row read from spilled file is distinct one or not. This is no longer true after we add support to configure the max number of rows to spill in each sorted spill file for aggregation which means stream id > 0 could also contains the distinct values. This will cause incorrect data result and reported by [issue](facebookincubator#9219). This PR fixes this issue by recording the number of spilled files on the first spill in grouping set to detect the spilled files that contain the seen distinct values. Unit test is added to reproduce and verify the fix. Also removed the unused spill config Reviewed By: oerling Differential Revision: D55288249 Pulled By: xiaoxmeng

Summary: The existing distinct aggregation implementation assumes that there is one file generated for each spill run. And use stream id 0 to detect if a row read from spilled file is distinct one or not. This is no longer true after we add support to configure the max number of rows to spill in each sorted spill file for aggregation which means stream id > 0 could also contains the distinct values. This will cause incorrect data result and reported by [issue](#9219). This PR fixes this issue by recording the number of spilled files on the first spill in grouping set to detect the spilled files that contain the seen distinct values. Unit test is added to reproduce and verify the fix. Also removed the unused spill config Pull Request resolved: #9230 Reviewed By: oerling Differential Revision: D55288249 Pulled By: xiaoxmeng fbshipit-source-id: 0b96263ea3c08d8e5bd9e210f77547d642c2f2db

…ator#9230) Summary: The existing distinct aggregation implementation assumes that there is one file generated for each spill run. And use stream id 0 to detect if a row read from spilled file is distinct one or not. This is no longer true after we add support to configure the max number of rows to spill in each sorted spill file for aggregation which means stream id > 0 could also contains the distinct values. This will cause incorrect data result and reported by [issue](facebookincubator#9219). This PR fixes this issue by recording the number of spilled files on the first spill in grouping set to detect the spilled files that contain the seen distinct values. Unit test is added to reproduce and verify the fix. Also removed the unused spill config Pull Request resolved: facebookincubator#9230 Reviewed By: oerling Differential Revision: D55288249 Pulled By: xiaoxmeng fbshipit-source-id: 0b96263ea3c08d8e5bd9e210f77547d642c2f2db

FelixYBW added bug Something isn't working triage Newly created issue that needs attention. labels Mar 22, 2024

xiaoxmeng self-assigned this Mar 22, 2024

This was referenced Mar 22, 2024

[VL] Result mismatch issues tracker apache/incubator-gluten#4652

Open

[VL] result mismatch on hash aggregate apache/incubator-gluten#4678

Closed

[VL] Add 3 configs of spill apache/incubator-gluten#5088

Merged

xiaoxmeng mentioned this issue Mar 24, 2024

Fix multiple files spilled for the distinct hash table #9230

Closed

xiaoxmeng closed this as completed Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distinct hash aggregate returned duplicated value if spill happens #9219

distinct hash aggregate returned duplicated value if spill happens #9219

FelixYBW commented Mar 22, 2024

mbasmanova commented Mar 22, 2024

xiaoxmeng commented Mar 22, 2024

FelixYBW commented Mar 22, 2024

FelixYBW commented Mar 22, 2024

xiaoxmeng commented Mar 24, 2024

xiaoxmeng commented Mar 24, 2024

FelixYBW commented Mar 24, 2024

distinct hash aggregate returned duplicated value if spill happens #9219

distinct hash aggregate returned duplicated value if spill happens #9219

Comments

FelixYBW commented Mar 22, 2024

Bug description

System information

Relevant logs

mbasmanova commented Mar 22, 2024

xiaoxmeng commented Mar 22, 2024

FelixYBW commented Mar 22, 2024

FelixYBW commented Mar 22, 2024

xiaoxmeng commented Mar 24, 2024

xiaoxmeng commented Mar 24, 2024

FelixYBW commented Mar 24, 2024