-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #3083: Optimize Parquet string read code #3082
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!! Thanks for doing this! I don't envy u having to figure out how to keep the maps from going out of scope switching from c++ to chpl lol
Also, in regards to your higher level comment, the hard part here wasn't getting the maps in place, it was figuring out that we needed them in the first place! There is no indication in the API that you need to keep column readers in scope to keep string allocated data around, so that was a long rabbit chase to figure that one out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
awesome @bmcdonald3 it looks like this will be good to go once the merge conflicts are resolved (sorry I messed urs up by merging mine first 😅 ) |
@stress-tess should be good to go |
This PR reworks the existing Parquet string read implementation by implementing a distinct string path, due to differences in the layout of the strings.
Parquet strings cannot be read directly into Chapel buffers, like they can for other Arkouda types, like int and bool, due to Arrow having a unique representation of strings.
The way that Parquet reading was implemented prior to this PR was reading each Parquet file once, the first time to get the number of bytes from the file, which requires reading the entire file, since number of bytes is not available in the metadata, and then the second time to actually read and store the data into the Arkouda buffer. Each Parquet string was read 1 value at a time and then deep copied on the C++ side into the Chapel buffer.
This PR takes a different approach, reading all of the values in once, storing them in an intermediary data structure to avoid the second file read.
To get the data from Arrow into Chapel, we are creating a row group reader for each row group in the file, which needs to be kept alive to avoid cleaning up the data from the actual Arrow strings. Once we have all of the data stored in the interim data structure, we can then do the copies into the Chapel buffer in parallel on the Chapel side.
Performance results reading one large file collected on an XC with a Lustre file system:
So, this results in a 4-5x speedup.
The reason that 1 nodes reads faster than 2 nodes is that each file is assigned to a locale, rather than a chunk (like the existing implementation), which means that the first node will read the contents and then broadcast that out to the other nodes, which results in a modest amount of overhead.
Closes #3083