Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Reduce peak memory usage for STRUCT decoding in parquet reader #14965

Closed
GregoryKimball opened this issue Feb 4, 2024 · 4 comments
Closed
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Feb 4, 2024

Describe the bug
In the libcudf benchmarks PARQUET_READER_NVBENCH, the STRUCT data type shows surprisingly high peak_memory_usage. For a 536 MB table, the INTEGRAL data type shows a 597 MiB peak memory usage. However, for the same 536 MB table size, the STRUCT data type shows 996 MiB peak memory usage. If there are good reasons for this difference, we can close the issue. Otherwise, we should reduce the extra memory overhead.

data_type io_type cardinality run_length Samples CPU Time Noise GPU Time Noise bytes_per_second peak_memory_usage encoded_file_size
INTEGRAL DEVICE_BUFFER 1000 32 33x 15.405 ms 0.30% 15.395 ms 0.29% 34872906834 597.127 MiB 14.403 MiB
FLOAT DEVICE_BUFFER 1000 32 51x 9.827 ms 0.26% 9.818 ms 0.24% 54685058116 563.539 MiB 9.888 MiB
DECIMAL DEVICE_BUFFER 1000 32 66x 7.701 ms 0.49% 7.691 ms 0.47% 69802302000 548.740 MiB 7.213 MiB
TIMESTAMP DEVICE_BUFFER 1000 32 1152x 8.416 ms 3.03% 8.406 ms 3.03% 63866354457 556.717 MiB 8.719 MiB
DURATION DEVICE_BUFFER 1000 32 1392x 7.919 ms 2.12% 7.909 ms 2.11% 67879410607 612.525 MiB 8.113 MiB
STRING DEVICE_BUFFER 1000 32 928x 13.539 ms 1.62% 13.530 ms 1.62% 39678673862 669.530 MiB 8.504 MiB
LIST DEVICE_BUFFER 1000 32 7x 72.190 ms 0.29% 72.180 ms 0.29% 7437971830 558.376 MiB 24.246 MiB
STRUCT DEVICE_BUFFER 1000 32 13x 41.528 ms 0.14% 41.518 ms 0.14% 12930954541 996.277 MiB 15.399 MiB

Steps/Code to reproduce bug
Here is an nvbench CLI command you can run to reproduce the above table:

./PARQUET_READER_NVBENCH --device 0 --benchmark 0 --axis cardinality=1000 --axis run_length=32

Expected behavior
INTEGRAL and STRUCT decode in the parquet reader should have a similar peak memory footprint.

Environment overview (please complete the following information)

  • docker image rapidsai/ci-conda:cuda12.1.1-ubuntu22.04-py3.11 pulled on 2024-02-03
  • cudf branch-24.02 and sha 6cebf2294ff

Additional context
The chunked parquet reader seems to reduce the memory footprint from STRUCT decode, and the trend seems scaled to a higher footprint than other data types.
image

@GregoryKimball GregoryKimball added bug Something isn't working 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Feb 4, 2024
@vuule
Copy link
Contributor

vuule commented Feb 14, 2024

There seems to be a bug in decode_page_data, which causes double allocation of the nested string column. Somehow two out_buf objects allocate string data based on the same src_col_index.
This does not happen when there are two columns in the struct.

@vuule
Copy link
Contributor

vuule commented Feb 15, 2024

Got a better understanding of the isolation info: the bug happens only when the string column is the first child of the second column. Seems like this case breaks the owning_schema logic.
CC @nvdbaranec

@vuule
Copy link
Contributor

vuule commented Feb 15, 2024

Opened #15061, which fixes the peak memory use in benchmarks (structs are now in line with the memory use of their nested types).

@GregoryKimball
Copy link
Contributor Author

GregoryKimball commented Mar 4, 2024

Closed by #15061

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

2 participants