Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distinct hash aggregate returned duplicated value if spill happens #9219

Closed
FelixYBW opened this issue Mar 22, 2024 · 7 comments
Closed

distinct hash aggregate returned duplicated value if spill happens #9219

FelixYBW opened this issue Mar 22, 2024 · 7 comments
Assignees
Labels
bug Something isn't working triage Newly created issue that needs attention.

Comments

@FelixYBW
Copy link

Bug description

If there is a spill the returned values from distinct hash aggregate have duplicates. It's not yet fixed. The root cause is related to the spill files.

    if (stream->id() == 0) {
      newDistinct = false;
    }

code`

It assums all the records before first spill trigger are spilled single file but for some reason the records are written into 2 files which leads to being output again when handling spill file.

System information


Velox System Info v0.0.2
Commit: b4ce48390e055cdf307ba378b9dd305ce77027ec
CMake Version: 3.22.1
System: Linux-5.15.0-101-generic
Arch: x86_64
C++ Compiler: /usr/bin/c++
C++ Compiler Version: 11.4.0
C Compiler: /usr/bin/cc
C Compiler Version: 11.4.0
CMake Prefix Path: /usr/local;/usr;/;/usr;/usr/local;/usr/X11R6;/usr/pkg;/opt

\nThe results will be copied to your clipboard if xclip is installed.

Relevant logs

No response

@FelixYBW FelixYBW added bug Something isn't working triage Newly created issue that needs attention. labels Mar 22, 2024
@mbasmanova
Copy link
Contributor

CC: @xiaoxmeng

@xiaoxmeng
Copy link
Contributor

@FelixYBW Thanks for catching this!

@xiaoxmeng xiaoxmeng self-assigned this Mar 22, 2024
@FelixYBW
Copy link
Author

Confirmed, as long as the initial spill has more than 1 split file, may be caused by maxSpillRunRows or maxSpillFileSize, the bug happens.
Planed solution is to record the initial spill file id and used it to exclude the data which is already output. We will submit a PR later.

@FelixYBW
Copy link
Author

Confirmed, as long as the initial spill has more than 1 split file, may be caused by maxSpillRunRows or maxSpillFileSize, the bug happens. Planed solution is to record the initial spill file id and used it to exclude the data which is already output. We will submit a PR later.

@zhztheplayer

@xiaoxmeng
Copy link
Contributor

Here is the PR fix this issue: #9230. This is caused by maxSpillRun which might split the large sorted spill run to small ones for batch. This is added after distinct aggregation spill support.

@xiaoxmeng
Copy link
Contributor

Confirmed, as long as the initial spill has more than 1 split file, may be caused by maxSpillRunRows or maxSpillFileSize, the bug happens. Planed solution is to record the initial spill file id and used it to exclude the data which is already output. We will submit a PR later.

@FelixYBW thanks and already have a fix for this issue.

@FelixYBW
Copy link
Author

Thank you so much for the quick fix!

xiaoxmeng added a commit to xiaoxmeng/velox that referenced this issue Mar 25, 2024
…ator#9230)

Summary:
The existing distinct aggregation implementation assumes that there is one file generated
for each spill run. And use stream id 0 to detect if a row read from spilled file is distinct one
or not. This is no longer true after we add support to configure the max number of rows to
spill in each sorted spill file for aggregation which means stream id > 0 could also contains
the distinct values. This will cause incorrect data result and reported by [issue](facebookincubator#9219).

This PR fixes this issue by recording the number of spilled files on the first spill in grouping
set to detect the spilled files that contain the seen distinct values. Unit test is added to reproduce
and verify the fix. Also removed the unused spill config


Reviewed By: oerling

Differential Revision: D55288249

Pulled By: xiaoxmeng
xiaoxmeng added a commit to xiaoxmeng/velox that referenced this issue Mar 25, 2024
…ator#9230)

Summary:
The existing distinct aggregation implementation assumes that there is one file generated
for each spill run. And use stream id 0 to detect if a row read from spilled file is distinct one
or not. This is no longer true after we add support to configure the max number of rows to
spill in each sorted spill file for aggregation which means stream id > 0 could also contains
the distinct values. This will cause incorrect data result and reported by [issue](facebookincubator#9219).

This PR fixes this issue by recording the number of spilled files on the first spill in grouping
set to detect the spilled files that contain the seen distinct values. Unit test is added to reproduce
and verify the fix. Also removed the unused spill config


Reviewed By: oerling

Differential Revision: D55288249

Pulled By: xiaoxmeng
facebook-github-bot pushed a commit that referenced this issue Mar 25, 2024
Summary:
The existing distinct aggregation implementation assumes that there is one file generated
for each spill run. And use stream id 0 to detect if a row read from spilled file is distinct one
or not. This is no longer true after we add support to configure the max number of rows to
spill in each sorted spill file for aggregation which means stream id > 0 could also contains
the distinct values. This will cause incorrect data result and reported by [issue](#9219).

This PR fixes this issue by recording the number of spilled files on the first spill in grouping
set to detect the spilled files that contain the seen distinct values. Unit test is added to reproduce
and verify the fix. Also removed the unused spill config

Pull Request resolved: #9230

Reviewed By: oerling

Differential Revision: D55288249

Pulled By: xiaoxmeng

fbshipit-source-id: 0b96263ea3c08d8e5bd9e210f77547d642c2f2db
Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this issue Jun 7, 2024
…ator#9230)

Summary:
The existing distinct aggregation implementation assumes that there is one file generated
for each spill run. And use stream id 0 to detect if a row read from spilled file is distinct one
or not. This is no longer true after we add support to configure the max number of rows to
spill in each sorted spill file for aggregation which means stream id > 0 could also contains
the distinct values. This will cause incorrect data result and reported by [issue](facebookincubator#9219).

This PR fixes this issue by recording the number of spilled files on the first spill in grouping
set to detect the spilled files that contain the seen distinct values. Unit test is added to reproduce
and verify the fix. Also removed the unused spill config

Pull Request resolved: facebookincubator#9230

Reviewed By: oerling

Differential Revision: D55288249

Pulled By: xiaoxmeng

fbshipit-source-id: 0b96263ea3c08d8e5bd9e210f77547d642c2f2db
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Newly created issue that needs attention.
Projects
None yet
Development

No branches or pull requests

3 participants