Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #3083: Optimize Parquet string read code #3082

Merged
merged 45 commits into from
Apr 16, 2024

Conversation

bmcdonald3
Copy link
Contributor

@bmcdonald3 bmcdonald3 commented Apr 5, 2024

This PR reworks the existing Parquet string read implementation by implementing a distinct string path, due to differences in the layout of the strings.

Parquet strings cannot be read directly into Chapel buffers, like they can for other Arkouda types, like int and bool, due to Arrow having a unique representation of strings.

The way that Parquet reading was implemented prior to this PR was reading each Parquet file once, the first time to get the number of bytes from the file, which requires reading the entire file, since number of bytes is not available in the metadata, and then the second time to actually read and store the data into the Arkouda buffer. Each Parquet string was read 1 value at a time and then deep copied on the C++ side into the Chapel buffer.

This PR takes a different approach, reading all of the values in once, storing them in an intermediary data structure to avoid the second file read.

To get the data from Arrow into Chapel, we are creating a row group reader for each row group in the file, which needs to be kept alive to avoid cleaning up the data from the actual Arrow strings. Once we have all of the data stored in the interim data structure, we can then do the copies into the Chapel buffer in parallel on the Chapel side.

Performance results reading one large file collected on an XC with a Lustre file system:

nl master my branch
1 8.17s 1.56s
2 8.06s 1.95s
4 8.32s 1.90s
8 8.31s 1.96s

So, this results in a 4-5x speedup.

The reason that 1 nodes reads faster than 2 nodes is that each file is assigned to a locale, rather than a chunk (like the existing implementation), which means that the first node will read the contents and then broadcast that out to the other nodes, which results in a modest amount of overhead.

Closes #3083

@bmcdonald3 bmcdonald3 changed the title Optimize Parquet string read code Closes #3083: Optimize Parquet string read code Apr 5, 2024
Copy link
Member

@stress-tess stress-tess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!! Thanks for doing this! I don't envy u having to figure out how to keep the maps from going out of scope switching from c++ to chpl lol

src/ArrowFunctions.h Outdated Show resolved Hide resolved
src/ParquetMsg.chpl Show resolved Hide resolved
src/ParquetMsg.chpl Show resolved Hide resolved
src/ParquetMsg.chpl Outdated Show resolved Hide resolved
src/ParquetMsg.chpl Show resolved Hide resolved
@bmcdonald3
Copy link
Contributor Author

Also, in regards to your higher level comment, the hard part here wasn't getting the maps in place, it was figuring out that we needed them in the first place! There is no indication in the API that you need to keep column readers in scope to keep string allocated data around, so that was a long rabbit chase to figure that one out.

Copy link
Contributor

@jaketrookman jaketrookman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@stress-tess
Copy link
Member

stress-tess commented Apr 15, 2024

awesome @bmcdonald3 it looks like this will be good to go once the merge conflicts are resolved (sorry I messed urs up by merging mine first 😅 )

@bmcdonald3
Copy link
Contributor Author

@stress-tess should be good to go

@stress-tess stress-tess added this pull request to the merge queue Apr 16, 2024
Merged via the queue into Bears-R-Us:master with commit f22e5b2 Apr 16, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize Parquet string reading
3 participants