-
Notifications
You must be signed in to change notification settings - Fork 841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement returning dictionary arrays from parquet reader #171
Comments
Comment from Andrew Lamb(alamb) @ 2021-02-02T11:50:29.862+0000: [~yordan-pavlov] I think this would be amazing -- and we would definitely use it in IOx. This is the kind of thing that is on our longer term roadmap and I would love to help (e.g. code review, or testing , or documentation, etc). Let me know! |
Further to this, if the expected output schema is a dictionary array it will cast back and compute a new dictionary! See here. I'm going to take a stab at implementing this and see where I get to |
Spent some time digging into this and think I have a plan of action. I am off for the next day or so but will be looking to make a start on my return Background The lowest level parquet API:
The next level up is
The next level is
Next up we have the
Finally we have
There is an additional
Given this I think it is acceptable to add a new Problems
Proposal Add two settings to
When reading if Within
This Otherwise it will produce DictionaryArray preserving the RLE dictionary. It will likely need to duplicate some logic from I think this will avoid making any breaking changes to clients, whilst allowing them to opt in to better behaviour when they know that their parquet files are such that this optimisation can be performed. |
FYI @yordan-pavlov |
In general sounds very cool @tustvold
What is the reason for defaulting this one to false? For any string dictionary column this is likely to almost always be what is wanted.
I can't remember if a single column can have pages encoded using dictionary and non dictionary encoding; I think it is possible but I don't know if any real files have such a thing |
@tustvold thank you for looking into this, and for the excellent summary of the parquet reader stack - this should probably go in documentation somewhere as it takes a while to figure out. The main reason for the stalling of work on the Here are my thoughts on preserving dictionary arrays:
|
Looks like there already is a proposed implementation of comparison operations for Dictionary Arrays here #984 |
@matthewmturner and I are working on it :) We still have a ways to go but I think we are making progress ! |
Congratulations 🎉
Unfortunately they definitely can, it is in fact explicitly highlighted in the arrow C++ blog post on the same topic here. IIRC it occurs when the dictionary grows too large for a single page.
The problem I was trying to avoid is However, having looked again I'm confident that I can avoid needing to expose My reasoning for having
Indeed, if the upstream is just going to fully materialize the dictionary in order to do anything with it, there is limited benefit to using dictionaries at all 😆 |
In the case of partial dictionary encoding, I wonder if the returned record batches could be split so that a single record batch only contains column data based on a single encoding (only dictionary or plain encoded values for each column)? Wouldn't this enable that |
I think it would be pretty confusing, and break quite a lot of code if
The encoding is per-page, and there is no relationship between what rows belong to what pages across column chunks. To get around this, the current logic requires that all ArrayReader return the batch_size number of rows, unless the This is why I originally proposed a However, my current plan I think sidesteps the need for this:
This avoids the need for config options, or changes to APIs with ambiguous termination criteria, whilst ensuring that most workloads will only compute dictionaries for sections of the parquet file where the dictionary encoding was incomplete. This should make it strictly better than current master, which always computes the dictionaries again having first fully materialized the values. |
@tustvold this latest approach you describe will probably work in many cases, but usually there is a reason for having partial dictionary encoding in parquet files - my understanding is that the reason usually is that the dictionary grew too big. And to have to reconstruct a big dictionary from plain-encoded parquet data sounds expensive and I suspect this will result in suboptimal performance and increased memory use. If the possibility to have a mix of dictionary-encoded and plain-encoded pages is just how parquet works, then is this something that has to be abstracted / hidden? Furthermore, if we take the DataFusion Parquet reader as an example, here https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/file_format/parquet.rs#L436 we can see that it doesn't care for the number of rows in the record batches as long as the record batch iterator doesn't return Finally, how would the user know that they have to make changes to |
@tustvold I should have read the blog post you linked earlier (https://arrow.apache.org/blog/2019/09/05/faster-strings-cpp-parquet/) before commenting; it appears that the C++ implementation of the arrow parquet reader converts plain-encoded fallback pages into a dictionary similar to the latest approach you described:
|
This is my understanding also, with the current defaults "too big" would be a dictionary larger than 1MB
I don't disagree with this, this would represent a somewhat degenerate edge case. As currently articulated, we will only attempt to preserve the dictionary if the arrow schema encoded within the parquet file is for a dictionary type. Specifically this means the data that was written was already contained in arrow dictionary arrays. In order to yield RecordBatch with the same schema as was written we would therefore need to construct one or more dictionaries anyway. This ticket, or at least my interpretation of it, is an optimisation to avoid needing to compute new dictionaries where the parquet dictionary can be reused, and if not, to avoid fully hydrating the values first as the logic currently does.
In my opinion, yes. If I write a dictionary array, I expect to get one back. If for some crazy reason I write a column with a dictionary that exceeds the capabilities of the parquet format to store natively, I would expect that to be abstracted away. I do not think there being a performance and compression penalty for over-sized dictionaries is unreasonable? As an aside the maximum dictionary size is configurable, although I'm not really sure what the implications of increasing it are
Indeed,
My proposal is for this to be an internal optimisation within FWIW if |
… 60x perf improvement (#171) (#1180) * Preserve dictionary encoding from parquet (#171) * Use OffsetBuffer::into_array for dictionary * Fix and test handling of empty dictionaries Don't panic if missing dictionary page * Use ArrayRef instead of Arc<ArrayData> * Update doc comments * Add integration test Tweak RecordReader buffering logic * Add benchmark * Set write batch size in parquet fuzz tests Fix bug in column writer with small page sizes * Fix test_dictionary_preservation * Add batch_size comment
Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-11410
Currently the Rust parquet reader returns a regular array (e.g. string array) even when the column is dictionary encoded in the parquet file.
If the parquet reader had the ability to return dictionary arrays for dictionary encoded columns this would bring many benefits such as:
[~nevime] , [~alamb] let me know what you think
The text was updated successfully, but these errors were encountered: