Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet writer column_size() should return a size_t #12870

Merged
merged 5 commits into from
Mar 1, 2023

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Mar 1, 2023

Description

Fixes #12867.

Bug introduced in #12685. A calculation of total bytes in a column was returned in a 32-bit size_type rather than 64-bit size_t leading to overflow for tables with many millions of rows.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested a review from a team as a code owner March 1, 2023 01:28
@etseidl etseidl requested review from bdice and karthikeyann March 1, 2023 01:28
@rapids-bot
Copy link

rapids-bot bot commented Mar 1, 2023

Pull requests from external contributors require approval from a rapidsai organization member with write or admin permissions before CI can begin.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 1, 2023
@etseidl
Copy link
Contributor Author

etseidl commented Mar 1, 2023

Wanted to get this out fast for testing. I still have to add a unit test.

@vuule vuule added bug Something isn't working non-breaking Non-breaking change cuIO cuIO issue labels Mar 1, 2023
@vuule
Copy link
Contributor

vuule commented Mar 1, 2023

/ok to test

@vuule
Copy link
Contributor

vuule commented Mar 1, 2023

Thanks for the quick PR @etseidl !

Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@galipremsagar
Copy link
Contributor

Thanks @etseidl, I'll test this out today and get back

Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix @etseidl ! Verified that this PR fixes the issue #12867

@galipremsagar
Copy link
Contributor

/merge

@etseidl
Copy link
Contributor Author

etseidl commented Mar 1, 2023

Thanks for the quick fix @etseidl ! Verified that this PR fixes the issue #12867

Sorry for the bug @galipremsagar ☹️

@etseidl
Copy link
Contributor Author

etseidl commented Mar 1, 2023

Since CI hates me, should I push the unit test I wrote for this? @vuule

@vuule
Copy link
Contributor

vuule commented Mar 1, 2023

Since CI hates me, should I push the unit test I wrote for this? @vuule

How big/slow is the test? My concern is that it requires a lot of device memory to execute, so it would not run on all GPUs.

@etseidl
Copy link
Contributor Author

etseidl commented Mar 1, 2023

How big/slow is the test? My concern is that it requires a lot of device memory to execute, so it would not run on all GPUs.

Well, it writes 300M int64's, so 2.4GB. Takes about 1.5sec on my A6000. nsys says 24GB memory used 😮

@vuule
Copy link
Contributor

vuule commented Mar 1, 2023

nsys says 24GB memory used 😮

FWIW, that's the size of the pool, so peak memory use could be a lot lower. Still, I have an 8GB GPU and it's unlikely that it's enough for a 2.4GB file. So let's leave this one without a test, the fix looks very safe anyway.

@vuule
Copy link
Contributor

vuule commented Mar 1, 2023

/merge

@vuule
Copy link
Contributor

vuule commented Mar 1, 2023

/ok to test

@galipremsagar
Copy link
Contributor

@vuule CI currently is blocked by #12874, I just canceled the jobs of this PR to free up machines for 12874 to finish quicker

@vuule
Copy link
Contributor

vuule commented Mar 1, 2023

@vuule CI currently is blocked by #12874, I just canceled the jobs of this PR to free up machines for 12874 to finish quicker

Thanks. I saw CI passing on another PR (probably ran at a different time) and got the wrong idea :)

@galipremsagar
Copy link
Contributor

@vuule CI currently is blocked by #12874, I just canceled the jobs of this PR to free up machines for 12874 to finish quicker

Thanks. I saw CI passing on another PR (probably ran at a different time) and got the wrong idea :)

CI is now ready. Thanks for bearing with #12874 😉

@galipremsagar galipremsagar added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Mar 1, 2023
@galipremsagar
Copy link
Contributor

/ok to test

1 similar comment
@galipremsagar
Copy link
Contributor

/ok to test

@rapids-bot rapids-bot bot merged commit 40e56c9 into rapidsai:branch-23.04 Mar 1, 2023
@etseidl etseidl deleted the feature/fix_col_size branch March 1, 2023 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] >2x memory usage in parquet writer
4 participants