You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This PR reworks the existing Parquet string read implementation by implementing a distinct string path, due to differences in the layout of the strings.
Parquet strings cannot be read directly into Chapel buffers, like they can for other Arkouda types, like int and bool, due to Arrow having a unique representation of strings.
The way that Parquet reading was implemented prior to this PR was reading each Parquet file once, the first time to get the number of bytes from the file, which requires reading the entire file, since number of bytes is not available in the metadata, and then the second time to actually read and store the data into the Arkouda buffer. Each Parquet string was read 1 value at a time and then deep copied on the C++ side into the Chapel buffer.
This PR takes a different approach, reading all of the values in once, storing them in an intermediary data structure to avoid the second file read.
To get the data from Arrow into Chapel, we are creating a row group reader for each row group in the file, which needs to be kept alive to avoid cleaning up the data from the actual Arrow strings. Once we have all of the data stored in the interim data structure, we can then do the copies into the Chapel buffer in parallel on the Chapel side.
Performance results reading one large file collected on an XC with a Lustre file system:
nl
master
my branch
1
8.17s
1.56s
2
8.06s
1.95s
4
8.32s
1.90s
8
8.31s
1.96s
So, this results in a 4-5x speedup.
The reason that 1 nodes reads faster than 2 nodes is that each file is assigned to a locale, rather than a chunk (like the existing implementation), which means that the first node will read the contents and then broadcast that out to the other nodes, which results in a modest amount of overhead.
The text was updated successfully, but these errors were encountered:
* Add standalone string read benchmark
* Fix typos
* Add standalone benchmarks
* remove C++ prints
* Add attempt for string read with reader in Chapel
* Seems to be working???
* Working baby
* Add timers
* Add timers
* Add timers
* Add C++ timers
* Pretty much working
* Working on single locale
* All working except copy
* Working about from start idx
* Row group start index determine
* Add free and remove write
* Free on locale owning data
* Parallelize over files with reader hash
* create readers serially, read parallel
* Fix row group segment calculation
* Clean up
* From horizon
* Think working, party
* Fix row group size
* Remove standalone benchmark directory
* Clean up
* Clean up Chapel standalone file
* Fix indentation from working on other machines
* Fix void pointers for 1.31
* Fix null values
* Clean up
* Add error handling
* Fix seg fault
* fix null check
* Fix error handling
* Clean up deprecation
* Fix null check
* Fix compat modules
* Address Tess feedback
* Oops...
* Fix wrong index
* Fix loop start
* Revert to old open file
This PR reworks the existing Parquet string read implementation by implementing a distinct string path, due to differences in the layout of the strings.
Parquet strings cannot be read directly into Chapel buffers, like they can for other Arkouda types, like int and bool, due to Arrow having a unique representation of strings.
The way that Parquet reading was implemented prior to this PR was reading each Parquet file once, the first time to get the number of bytes from the file, which requires reading the entire file, since number of bytes is not available in the metadata, and then the second time to actually read and store the data into the Arkouda buffer. Each Parquet string was read 1 value at a time and then deep copied on the C++ side into the Chapel buffer.
This PR takes a different approach, reading all of the values in once, storing them in an intermediary data structure to avoid the second file read.
To get the data from Arrow into Chapel, we are creating a row group reader for each row group in the file, which needs to be kept alive to avoid cleaning up the data from the actual Arrow strings. Once we have all of the data stored in the interim data structure, we can then do the copies into the Chapel buffer in parallel on the Chapel side.
Performance results reading one large file collected on an XC with a Lustre file system:
So, this results in a 4-5x speedup.
The reason that 1 nodes reads faster than 2 nodes is that each file is assigned to a locale, rather than a chunk (like the existing implementation), which means that the first node will read the contents and then broadcast that out to the other nodes, which results in a modest amount of overhead.
The text was updated successfully, but these errors were encountered: