Skip to content
This repository has been archived by the owner on Jul 25, 2022. It is now read-only.

Implement PyArrow Dataset TableProvider #59

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

kylebrooks-8451
Copy link

This implements a PyArrow Dataset TableProvider that allows for using Datasets as tables in Datafusion.

Fixes #10

Copy link

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great. You write really clean PyO3 code and I learned a lot going through this PR :)

I'm requesting changes, but the only thing I think you need to do is add the test cases I suggested. Everything else is a suggestion or FYI.

I'm not a contributor on this repo so can't approve the CI or merge, but happy to review again.

@houqp @jimexist One of you willing to approve CI?

datafusion/tests/test_context.py Show resolved Hide resolved
src/dataset.rs Outdated Show resolved Hide resolved
type Item = ArrowResult<RecordBatch>;

fn next(&mut self) -> Option<Self::Item> {
Python::with_gil(|py| {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires an upgrade in the version of datafusion, but FYI we have a zero-copy conversion from RecordBatchReaders into arrow-rs ArrowArrayStreamReader. Requires arrow-rs 17.0 and higher.

https://github.com/apache/arrow-rs/blob/b2cf02c7a8a5027d037fc359323bc0ed45b943de/arrow/src/pyarrow.rs#L204

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! I will plan on using that once Datafusion is upgraded in this project. At the time I authored this code for our internal project this did not exist.

fragments: fragments.into(),
columns,
filter_expr,
projected_statistics: Default::default(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'm not sure of is whether we should be doing anything to fill in these statistics.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure on this either. I think supporting this is optional and we can add statistics collection on as a future improvement.

src/pyarrow_filter_expression.rs Outdated Show resolved Hide resolved
// Note that pyarrow.compute.{field,scalar} are put into Python globals() when evaluated
// isin, is_null, and is_valid (~is_null) are methods of pyarrow.dataset.Expression
// https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Expression.html#pyarrow-dataset-expression
fn try_from(expr: &Expr) -> Result<Self, Self::Error> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this will be limited to the max recursion depth of Rust. You might want to instead implement an ExpressionVisitor. https://docs.rs/datafusion/10.0.0/datafusion/logical_plan/trait.ExpressionVisitor.html

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that ExpressionVisitor would be a better approach due to no recursion. I noticed that your PR does this. I will give implementing it a shot for this PR.

}
}

impl TryFrom<&Expr> for PyArrowFilterExpression {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I think the long-term solution for this will be going through Substrait https://issues.apache.org/jira/browse/ARROW-16844

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to replace this code with Substait once it's ready.

@wjones127
Copy link

Also FYI we are moving this repo, so they might ask to re-open this PR at the new location soon. apache/datafusion-python#5

@wjones127
Copy link

cc @andygrove

@andygrove
Copy link
Contributor

Thanks @kylebrooks-8451 this is looking very cool. As @wjones127 mentioned we are just in the process of moving development to https://github.com/apache/arrow-datafusion-python so would you mind opening the PR there?

@kylebrooks-8451
Copy link
Author

Thanks @kylebrooks-8451 this is looking very cool. As @wjones127 mentioned we are just in the process of moving development to https://github.com/apache/arrow-datafusion-python so would you mind opening the PR there?

Thanks @andygrove! This PR has been moved to apache/datafusion-python#9. We can close this PR if you wish. I'm unable to make sure this works with DataFusion 10.0.0 because of this error:

error: failed to select a version for the requirement `parquet = "^18.0.0"`
candidate versions found which didn't match: 15.0.0, 14.0.0, 13.0.0, ...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support reading from PyArrow datasets
3 participants