-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apache Parquet Support #903
Comments
Also, if anyone is interested in seeing the work that is being done, here is the branch that I have been working on: https://github.com/bmcdonald3/chapel/tree/parquet |
@ben-albrecht Awesome! Thanks for the update and really glad that you'e bringing in this across the finish line. Just my two cents regarding your questions
Again, just my opinions, will definitely defer to @mhmerrill and @reuster986 |
@bmcdonald3 welcome aboard and thanks for taking on this important work! I appreciate and agree with your general overview of parquet format. One thing to note up front is that we will almost always want to read the same columns from multiple files at once, since our data feeds come in as many files per day, all with the same schema. To your questions:
Unfortunately, I can't give you any sample files at the moment, but I can work on generating some based on an educated guess. Thanks! |
Thanks for the feedback guys, I have been working on getting everything in good shape and will hopefully have something to show for it soon. One question, however, that hadn't occurred to me when creating this issue is: Is the ability to write to a Parquet file also valuable/important? Is the Parquet use-case entirely going to be based on reading in Parquet files and any writing would be done to an HDF5 file and not a Parquet file? If writing to Parquet files is important, what functionality would you envision being useful for that? Thank you. |
@bmcdonald3 yes, writing to parquet very important! Basically however we read in, be able to write out. |
@bmcdonald3 I second @hokiegeek2 that being able to round-trip arrays to/from parquet is important. One place where this shows up is in the unit tests/CI, where we will typically generate a few random arrays in arkouda, write them to disk, and read them back in to make sure the result is the same as the original. Having that capability with parquet will be very important for writing useful tests (as well as for users to get data out). |
Here is a quick summary on the progress that has been made with Parquet so far. SummaryUsing the Arrow GLib C interface, we have been able to get Parquet I/O working in Arkouda, but it is by no means perfect just yet and there is quite a bit of remaining work. Having a discussion at next weeks Arkouda call (or another time) about what is needed on the Arkouda end to get this to a usable place would be valuable for determining where to focus our efforts. read from parquet file
write to parquet file (on 2 locales)
Challenges
ConclusionIn conclusion, I mainly am wondering:
I am planning to give a quick demo of the code at the next Arkouda call and would appreciate any feedback there (tagging @glitch from Mike's recommendation). |
@bmcdonald3 this is really cool, really exciting! Are you also planning on adding read-write capability to the Plasma Object Store? |
@hokiegeek2 that has not been a priority of mine as of late, where I have been just focusing on Parquet I/O in regards to Arrow, but I did spend some time looking into the Plasma Object Store prior to this work. It seemed to me there were some pretty strong limitations with it (objects are read-only, must be shared memory, etc.), so we put that on hold for the time being, but we could invest more into that effort if it would be valuable/helpful to the Arkouda team. Is that something that would be valuable to you/others? If so, could you provide an example use-case that you would like to be able to use in Arkouda? |
@bmcdonald3 My thoughts are that enabling Chapel I/O with the Plasma Object Store could enable zero-copy data sharing between Chapel locales and distributed Python frameworks such as Ray, Dask, and PySpark as well as distributed Java/Scala frameworks such as Spark. The main use case would be for an Arkouda user to downselect an Arkouda array and then do a zero-copy read of that array into one of these other tools to leverage the Python ML/DL ecosystem (Ray, Dask, PySpark), Spark to enable integration with Spark tools and frameworks, and basically any other distributed framework that utilizes the Plasma Object Store for in-memory data sharing. Chapel-Plasma Object Store integration is definitely not an immediate need as this is not an Arkouda development priority. This is an idea I've had for awhile and my plan is to investigate this on my own time to see what's possible. I've just recently started looking into how to make this happen on the Ray/Dask/PySpark/Spark side. The Ray option seems to be the most accessible one as Ray distributes a plasma object store with each Ray worker. |
Here is a quick update of the performance that has been observed reading in Parquet files to Chapel arrays with various configurations collected on a linux server:
For context, from the Apache Arrow documentation: "A Row Group is a logical horizontal partitioning of the data into rows." In regards to file writing performance, this is about 15x slower than HDF5, but no work has been done to investigate how we could improve that performance, so I didn't think it was a fair comparison to have it displayed in a table like the read speeds. Edit: In the Arrow documentation, they recommend large row group sizes (~512MB-1GB), which will cause the columns to be all stored contiguously, which makes reading the entire column faster, which is what the above benchmark is doing (reading one entire column). |
As I am starting to work on implementing reading in of Parquet files to Chapel arrays, I thought I would summarize what I have been finding here to get feedback and make sure I am going down the right path.
Summary of Parquet I/O using GLib Interface
Interacting with Parquet files
From what I have gathered, the industry norm for interacting with Parquet files looks something like the following:
Chapel Implementation
First proposal: reading specific columns from a file
Based on this information, I have been taking the approach of allowing the reading of specific columns from the Parquet file, which looks something like this:
Second proposal: reading entire files at a time
edit: adding this third proposal
Third proposal: reading columns from in-memory table (not file)
When Arrow files are read, they are stored in tables and since the implementation for how this is done is largely out of our control (and based on the info in the Interacting with Parquet Files section above), this seems to be the most performant method for large files. The table is read in once and then passed to the various methods, with the data already in memory.
Remaining questions
getColumn()
rather thangetIntColumn()
Lastly, any sample files that match what the expected size and format of the Parquet files that you intend to work with would be greatly appreciated.
The text was updated successfully, but these errors were encountered: