-
Notifications
You must be signed in to change notification settings - Fork 818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add async
arrow parquet reader
#1154
Conversation
@@ -78,7 +78,6 @@ pub fn parse_metadata<R: ChunkReader>(chunk_reader: &R) -> Result<ParquetMetaDat | |||
|
|||
// build up the reader covering the entire metadata | |||
let mut default_end_cursor = Cursor::new(default_len_end_buf); | |||
let metadata_read: Box<dyn Read>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive by cleanup - this dynamic dispatch isn't necessary any more
fn column_chunks(&self, i: usize) -> Result<Box<dyn PageIterator>>; | ||
} | ||
|
||
impl RowGroupCollection for Arc<dyn FileReader> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does mean we have double dynamic dispatch, given these methods are called a couple of times per-file I'm inclined to consider this largely irrelevant
Codecov Report
@@ Coverage Diff @@
## master #1154 +/- ##
==========================================
- Coverage 82.96% 82.90% -0.07%
==========================================
Files 178 180 +2
Lines 51522 51969 +447
==========================================
+ Hits 42744 43083 +339
- Misses 8778 8886 +108
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tustvold -- this is really cool.
I suggest the following actions:
- Do a POC to use this
async
reader inDataFusion
- If that looks good, then fill out the tests for this
I'll try and find time later this week or this weekend to help if no one else beats me to it.
Pretty cool demo @tustvold 👍 |
Exciting news, thanks @tustvold |
7eda456
to
7825ea8
Compare
da73b55
to
dccb641
Compare
Add Sync + Send bounds to parquet crate
dccb641
to
078b37c
Compare
type Item = Result<Page>; | ||
|
||
fn next(&mut self) -> Option<Self::Item> { | ||
self.get_next_page().transpose() | ||
} | ||
} | ||
|
||
impl<T: Read> PageReader for SerializedPageReader<T> { | ||
impl<T: Read + Send> PageReader for SerializedPageReader<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As PageReader: Send
it can only be implemented for types that are Send
which is only the case for SerializedPageReader<T>
if T: Send
@@ -43,8 +43,8 @@ pub trait Length { | |||
/// The ChunkReader trait generates readers of chunks of a source. | |||
/// For a file system reader, each chunk might contain a clone of File bounded on a given range. | |||
/// For an object store reader, each read can be mapped to a range request. | |||
pub trait ChunkReader: Length { | |||
type T: Read; | |||
pub trait ChunkReader: Length + Send + Sync { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These traits need to be both Send + Sync
as they are used through immutable references, e.g. Arc
With apache/datafusion#1617 I'm happy with the interface, so I'm marking this ready for review. I'll work on getting some better coverage, e.g. the fuzz tests, over the coming days. |
322e664
to
78f71ab
Compare
78f71ab
to
92b7cb9
Compare
@@ -74,7 +74,7 @@ fn create_table(results: &[RecordBatch]) -> Result<Table> { | |||
let mut cells = Vec::new(); | |||
for col in 0..batch.num_columns() { | |||
let column = batch.column(col); | |||
cells.push(Cell::new(&array_value_to_string(&column, row)?)); | |||
cells.push(Cell::new(&array_value_to_string(column, row)?)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Activating pretty_print in parquet appears to have made clippy find a load of new stuff in arrow 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @tustvold -- thank you.
"give the people what they want!"
My largest potential concern is the introduction of Send
and Sync
. I would like to try this change against some other crate (datafusion or IOx perhaps) to make sure the new Send
trait requirement doesn't cause undue challenges when upgrading.
As a follow on to this PR I think we (not you necessarily) should Take a look through the documentation and document the new async
feature flag and add a doc example to this page: https://docs.rs/parquet/8.0.0/parquet/arrow/index.html
(will file a ticket to do so)
I also think this change (especially the newly added Send
and Sync
trait boundaries) deserves some broader attention, so I'll send a note to the mailing list too
"+----------+-------------+-----------+", | ||
], | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other tests that might be cool
- error cases (where projection is out of bounds, row group out of bounds).
- row group filter (as in read a multi-row group parquet file but only read one of the row groups)
Actually, don't we have to add |
Actually, I see apache/datafusion#1617 demonstrates what impacts this has on DataFusion, which seems just fine 👍 |
5614438
to
cbe6bb4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/arrow-rs/runs/5004473864?check_suite_focus=true
Unless anyone else has any comments, I'll plan to merge this tomorrow
Added an example in #1253 |
Thanks again @tustvold 👍 |
tracking API change in #1264 (for changelog) |
I see a few issues with this. First, the notion that the column chunk is the basic i/o unit for Parquet is somewhat outdates with the introduction of the index page. Second, a major premise of Parquet is "read only what you need", where what you need is usually dictated by some query engine, so continuously downloading in the background for data the client may doesn't even want or need doesn't seem right, especially as the cost is complicating all existing client by the added "Send" constraint. |
I agree, in so much as whatever mechanism we eventually add for more granular filter pushdown, be it the page index or something more sophisticated such as described in #1191, I would anticipate using to refine the data
This PR does not add functionality for doing this, it adds hooks for a query engine to use for doing this by providing something implementing
Are these additions this causing an issue for you? I have to confess I did not anticipate this causing issues, as almost all types are |
I am also interested in what issues (if any) adding the @zeevm if you have some time and are willing to help make the |
Which issue does this PR close?
Closes #111 .
Rationale for this change
See ticket, in particular I wanted to confirm that it is possible to create an async parquet reader without any major changes to the parquet crate. This seems to come up as a frequent ask from the community, and I think we could support it without any major churn.
What changes are included in this PR?
Adds a layer of indirection to
array_reader
to abstract it away from files, I think this change may stand on its own merits.It then adds a ParquetRecordBatchStream which is a
Stream
that yieldsRecordBatch
. Under the hood, this uses async to read row groups into memory and then feeds these into the non-async decoders.The parquet docs describe the column chunk as the unit of IO, and so I think buffering compressed row groups in memory is not an impractical approach. It also avoids having to maintain sync and async version of all the decoders, readers, etc...
Are there any user-facing changes?
This adds
Send + Sync
toDataType
,RowGroupReader
,FileReader
,ChunkReader
.It also adds
Send
constraints to the variousstd::io::Read
constraints.