Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Parquet string reading #3083

Closed
bmcdonald3 opened this issue Apr 5, 2024 · 0 comments · Fixed by #3082
Closed

Optimize Parquet string reading #3083

bmcdonald3 opened this issue Apr 5, 2024 · 0 comments · Fixed by #3082
Assignees

Comments

@bmcdonald3
Copy link
Contributor

bmcdonald3 commented Apr 5, 2024

This PR reworks the existing Parquet string read implementation by implementing a distinct string path, due to differences in the layout of the strings.

Parquet strings cannot be read directly into Chapel buffers, like they can for other Arkouda types, like int and bool, due to Arrow having a unique representation of strings.

The way that Parquet reading was implemented prior to this PR was reading each Parquet file once, the first time to get the number of bytes from the file, which requires reading the entire file, since number of bytes is not available in the metadata, and then the second time to actually read and store the data into the Arkouda buffer. Each Parquet string was read 1 value at a time and then deep copied on the C++ side into the Chapel buffer.

This PR takes a different approach, reading all of the values in once, storing them in an intermediary data structure to avoid the second file read.

To get the data from Arrow into Chapel, we are creating a row group reader for each row group in the file, which needs to be kept alive to avoid cleaning up the data from the actual Arrow strings. Once we have all of the data stored in the interim data structure, we can then do the copies into the Chapel buffer in parallel on the Chapel side.

Performance results reading one large file collected on an XC with a Lustre file system:

nl master my branch
1 8.17s 1.56s
2 8.06s 1.95s
4 8.32s 1.90s
8 8.31s 1.96s

So, this results in a 4-5x speedup.

The reason that 1 nodes reads faster than 2 nodes is that each file is assigned to a locale, rather than a chunk (like the existing implementation), which means that the first node will read the contents and then broadcast that out to the other nodes, which results in a modest amount of overhead.

@bmcdonald3 bmcdonald3 self-assigned this Apr 5, 2024
github-merge-queue bot pushed a commit that referenced this issue Apr 16, 2024
* Add standalone string read benchmark

* Fix typos

* Add standalone benchmarks

* remove C++ prints

* Add attempt for string read with reader in Chapel

* Seems to be working???

* Working baby

* Add timers

* Add timers

* Add timers

* Add C++ timers

* Pretty much working

* Working on single locale

* All working except copy

* Working about from start idx

* Row group start index determine

* Add free and remove write

* Free on locale owning data

* Parallelize over files with reader hash

* create readers serially, read parallel

* Fix row group segment calculation

* Clean up

* From horizon

* Think working, party

* Fix row group size

* Remove standalone benchmark directory

* Clean up

* Clean up Chapel standalone file

* Fix indentation from working on other machines

* Fix void pointers for 1.31

* Fix null values

* Clean up

* Add error handling

* Fix seg fault

* fix null check

* Fix error handling

* Clean up deprecation

* Fix null check

* Fix compat modules

* Address Tess feedback

* Oops...

* Fix wrong index

* Fix loop start

* Revert to old open file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant