Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: AsyncArrowWriter inner buffer is not correctly limited and causes OOM #4477

Closed
richox opened this issue Jul 6, 2023 · 1 comment · Fixed by #4478
Closed

Parquet: AsyncArrowWriter inner buffer is not correctly limited and causes OOM #4477

richox opened this issue Jul 6, 2023 · 1 comment · Fixed by #4478
Labels
bug parquet Changes to the parquet crate

Comments

@richox
Copy link
Contributor

richox commented Jul 6, 2023

Describe the bug

when writing big parquet files using AsyncArrowWriter, we found that the memory usage is unexpectedly high, and sometimes makes the process run out of memory.

the bug is likely in the following code. it tried to trigger flushing once the buffer size reaches half of the capacity. however, when data is written into buffer, the capacity also increases along with size. so this condition is not working expectedly.

if !force && buffer.len() < buffer.capacity() / 2 {

To Reproduce

read a big parquet file, then write to another file with AsyncArrowWriter. since reading is ususally faster than writing. data will be buffered but not correctly flushed, causing OOM.

Expected behavior

trigger flushing with the constant initial buffer capacity.

Additional context

@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'parquet'} from #4478

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants