-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] OOM in has_next
and read_chunk
of chunked parquet reader
#15812
Comments
has_next
of chunked parquet readerhas_next
and read_chunk
of chunked parquet reader
Here is my test code. I did notice in def test_parquet_chunked_reader_oom():
reader = ParquetReader(["/home/coder/datasets/lineitem.parquet"], chunk_read_limit=24000000)
while (reader._has_next()):
chunk = reader._read_chunk() |
Exactly, I did notice this too on T4 and since T4's can't handle that much amount of memory we end up with an OOM there too. |
Thank you @mhaseeb123 and @galipremsagar for testing this. I don't believe the C++ API encounters this memory spike in the |
First question, the code I'm seeing is just read_parquet(), not the chunked reader
The regular reader is implemented in terms of the chunked reader, but with no limits set. Ie, infinite sizes. So if you're just using that, OOMs are absolutely possible. If this code is somehow using the chunked reader, note that there are two parameters:
They can be set independently, but only the input chunk limit will work to keep OOMs down. |
Thank you @nvdbaranec, yes we are trying to use the chunked reader here in python. It looks like we might not be setting the "input chunk limit" |
Sorry, I misread. I thought the first block of code was where the bug was. It is odd that the one that uses the chunked reader directly would fail. There should be no difference between the two in overall memory usage, but maybe that small chunk value specfied (24 MB) is throwing something for a loop. In this case, I would not expect the input limit to make a difference since it clearly loads in the non-chunked case. |
The following test code and patch for #15728 makes things smooth again def test_parquet_chunked_reader_oom():
reader = ParquetReader(["/home/coder/datasets/lineitem.parquet"], chunk_read_limit=24000000, pass_read_limit=16384000000) # setting pass_read_limit to 16GB but smaller values also work
table = []
while (reader._has_next()):
chunk = reader._read_chunk()
# table = table + chunk # concatenate not needed for testing diff --git a/python/cudf/cudf/_lib/parquet.pyx b/python/cudf/cudf/_lib/parquet.pyx
index aa18002fe1..14c1d00c06 100644
--- a/python/cudf/cudf/_lib/parquet.pyx
+++ b/python/cudf/cudf/_lib/parquet.pyx
@@ -763,6 +763,7 @@ cdef class ParquetWriter:
cdef class ParquetReader:
cdef bool initialized
cdef unique_ptr[cpp_chunked_parquet_reader] reader
+ cdef size_t pass_read_limit
cdef size_t chunk_read_limit
cdef size_t row_group_size_bytes
cdef table_metadata result_meta
@@ -781,7 +782,7 @@ cdef class ParquetReader:
def __cinit__(self, filepaths_or_buffers, columns=None, row_groups=None,
use_pandas_metadata=True,
- Expression filters=None, int chunk_read_limit=1024000000):
+ Expression filters=None, size_t chunk_read_limit=1024000000, size_t pass_read_limit=0):
# Convert NativeFile buffers to NativeFileDatasource,
# but save original buffers in case we need to use
@@ -831,9 +832,10 @@ cdef class ParquetReader:
self.allow_range_index &= filters is None
self.chunk_read_limit = chunk_read_limit
+ self.pass_read_limit = pass_read_limit
with nogil:
- self.reader.reset(new cpp_chunked_parquet_reader(chunk_read_limit, args))
+ self.reader.reset(new cpp_chunked_parquet_reader(chunk_read_limit, pass_read_limit, args))
self.initialized = False
self.row_groups = row_groups
self.filepaths_or_buffers = filepaths_or_buffers |
The issue has been resolved by using |
Describe the bug
While reading a dataframe of size ~8GB on T4(16GB), there seems to be an OOM exception when
has_next
is called.Steps/Code to reproduce bug
nvidia-smi:
Bug:
Expected behavior
We should be able to read using chunked parquet reader.
Environment overview (please complete the following information)
This issue will need changes from : #15728
The text was updated successfully, but these errors were encountered: