Optimize Parquet string reading #3083

bmcdonald3 · 2024-04-05T20:29:49Z

This PR reworks the existing Parquet string read implementation by implementing a distinct string path, due to differences in the layout of the strings.

Parquet strings cannot be read directly into Chapel buffers, like they can for other Arkouda types, like int and bool, due to Arrow having a unique representation of strings.

The way that Parquet reading was implemented prior to this PR was reading each Parquet file once, the first time to get the number of bytes from the file, which requires reading the entire file, since number of bytes is not available in the metadata, and then the second time to actually read and store the data into the Arkouda buffer. Each Parquet string was read 1 value at a time and then deep copied on the C++ side into the Chapel buffer.

This PR takes a different approach, reading all of the values in once, storing them in an intermediary data structure to avoid the second file read.

To get the data from Arrow into Chapel, we are creating a row group reader for each row group in the file, which needs to be kept alive to avoid cleaning up the data from the actual Arrow strings. Once we have all of the data stored in the interim data structure, we can then do the copies into the Chapel buffer in parallel on the Chapel side.

Performance results reading one large file collected on an XC with a Lustre file system:

nl	master	my branch
1	8.17s	1.56s
2	8.06s	1.95s
4	8.32s	1.90s
8	8.31s	1.96s

So, this results in a 4-5x speedup.

The reason that 1 nodes reads faster than 2 nodes is that each file is assigned to a locale, rather than a chunk (like the existing implementation), which means that the first node will read the contents and then broadcast that out to the other nodes, which results in a modest amount of overhead.

* Add standalone string read benchmark * Fix typos * Add standalone benchmarks * remove C++ prints * Add attempt for string read with reader in Chapel * Seems to be working??? * Working baby * Add timers * Add timers * Add timers * Add C++ timers * Pretty much working * Working on single locale * All working except copy * Working about from start idx * Row group start index determine * Add free and remove write * Free on locale owning data * Parallelize over files with reader hash * create readers serially, read parallel * Fix row group segment calculation * Clean up * From horizon * Think working, party * Fix row group size * Remove standalone benchmark directory * Clean up * Clean up Chapel standalone file * Fix indentation from working on other machines * Fix void pointers for 1.31 * Fix null values * Clean up * Add error handling * Fix seg fault * fix null check * Fix error handling * Clean up deprecation * Fix null check * Fix compat modules * Address Tess feedback * Oops... * Fix wrong index * Fix loop start * Revert to old open file

bmcdonald3 self-assigned this Apr 5, 2024

bmcdonald3 mentioned this issue Apr 5, 2024

Closes #3083: Optimize Parquet string read code #3082

Merged

stress-tess closed this as completed in #3082 Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Parquet string reading #3083

Optimize Parquet string reading #3083

bmcdonald3 commented Apr 5, 2024 •

edited

Loading

Optimize Parquet string reading #3083

Optimize Parquet string reading #3083

Comments

bmcdonald3 commented Apr 5, 2024 • edited Loading

bmcdonald3 commented Apr 5, 2024 •

edited

Loading