Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create ListingTableConfig which includes file format and schema inference #1715

Merged
merged 38 commits into from
Feb 10, 2022

Conversation

matthewmturner
Copy link
Contributor

Which issue does this PR close?

Closes #1705

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Jan 31, 2022
@matthewmturner
Copy link
Contributor Author

Just as FYI I havent had time this week to work on DataFusion PRs. I plan to pick up work on this early next week.

@matthewmturner
Copy link
Contributor Author

@alamb @houqp would you mind checking this out before I go updating all the other tests and places where this is used? I confess i struggled a bit with the borrow checker here and there is likely more idiomatic ways to achieve the goal of this - but i think that it at least serves as a starting point.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @matthewmturner -- I think this is looking good

file_schema: SchemaRef,
options: ListingOptions,
) -> Self {
pub async fn new(mut config: ListingTableConfig) -> Result<Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this API -- it would be cool if we could add some more documentation (either here or in ListingTableConfig explaining that the options / schema are inferred if not explicitly created

Also, as this can now return an Err perhaps renaming to pub async fn try_new(...) would be a more idiomatic / clearer name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes was planning on adding docs once i was sure API was good. and will update to try_new!

@matthewmturner matthewmturner marked this pull request as ready for review February 7, 2022 17:37
@matthewmturner
Copy link
Contributor Author

@alamb any idea why only one CI check is running?

@alamb
Copy link
Contributor

alamb commented Feb 7, 2022

@alamb any idea why only one CI check is running?

I think it is because the PR has a conflict

@matthewmturner
Copy link
Contributor Author

I think it is because the PR has a conflict

Oh, duh, that makes sense. Sry. Will fix.

@matthewmturner
Copy link
Contributor Author

matthewmturner commented Feb 8, 2022

@alamb do you know of any Rust magic that would allow me to use an async method for impl TryInto<LogicalPlan> for &protobuf::LogicalPlanNode to get the ballista errors resolved? #[async_trait] doesnt work as it doesnt match the trait definition for TryInto

@matthewmturner
Copy link
Contributor Author

Thinking out loud here. Im wondering if i can move the required async functionality to all be within ListingTableConfig and the user can decide if its needed. Then ListingTable::new doesnt need to be async.

@alamb
Copy link
Contributor

alamb commented Feb 8, 2022

@alamb do you know of any Rust magic that would allow me to use an async method for impl TryInto for &protobuf::LogicalPlanNode to get the ballista errors resolved?

I do not know 😞

Thinking out loud here. Im wondering if i can move the required async functionality to all be within ListingTableConfig and the user can decide if its needed. Then ListingTable::new doesnt need to be async.

I think that sounds like a very good idea to me 👍

@matthewmturner
Copy link
Contributor Author

As an update, I need to handle the case of inferring file format when a partitioned file / directory is provided.

@matthewmturner
Copy link
Contributor Author

matthewmturner commented Feb 9, 2022

@alamb i think this is finally ready.

I havent included partition column inference, but i think that can be added as a follow on PR if needed.

let table =
ListingTable::new(Arc::new(LocalFileSystem {}), filename, schema, opt);
let config = ListingTableConfig::new(Arc::new(LocalFileSystem {}), filename)
.infer()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is cool

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is looking really good @matthewmturner 👍

@@ -271,7 +274,7 @@ async fn parquet_overlapping_columns() -> Result<()> {
Ok(())
}

fn register_partitioned_aggregate_csv(
async fn register_partitioned_aggregate_csv(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be async anymore does it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. ive fixed.

@matthewmturner
Copy link
Contributor Author

@alamb thx! as always appreciate your guidance.

is it ok for me to include this feature in post for 7.0?

@alamb
Copy link
Contributor

alamb commented Feb 9, 2022

Ok no now it has a conflict!

@matthewmturner
Copy link
Contributor Author

Ok no now it has a conflict!

working on it

@matthewmturner
Copy link
Contributor Author

@alamb hopefully were good now

Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"json" => Ok(Arc::new(JsonFormat::default())),
"parquet" => Ok(Arc::new(ParquetFormat::default())),
_ => Err(DataFusionError::Internal(
"Unable to infer file type".into(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make the log more specifically with the suffix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I'll update.

@alamb alamb merged commit e5f6969 into apache:master Feb 10, 2022
@alamb
Copy link
Contributor

alamb commented Feb 10, 2022

Thanks again for the good work @matthewmturner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplify creating new ListingTable
4 participants