-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom / Dynamic table provider factories #3311
Conversation
@andygrove @alamb @houqp This was a quick and dirty implementation, but I thought I would throw it out there to collect your thoughts on it. If something like this is deemed acceptable and merged, we'd extend Ballista to use it as well, then either:
|
datafusion/sql/src/parser.rs
Outdated
"expect one of PARQUET, AVRO, NDJSON, or CSV, found: {}", | ||
other | ||
))), | ||
name => Ok(FileType::Custom(name.to_string())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually I'd like to un-hard-code the built-in tables and just have a ListingTableFactory
in the map so all these behave the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'd love to use the postgres with options
syntax to pass an arbitrary hashmap for credentials, host & port, etc.
https://www.postgresql.org/docs/current/sql-createforeigntable.html
CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_name ( [
{ column_name data_type [ OPTIONS ( option 'value' [, ... ] ) ]
On second thought, credentials could be controversial... forget I said that :)
I think this looks like a great start. Thanks @avantgardnerio |
Oh, also related to #2025 |
I guess I should mention @matthewmturner since I linked that ticket. |
@avantgardnerio This is very cool. Unfortunately I haven't been able to work on datafusion lately but I could definitely see integrating this into datafusion tui again when I start again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also really like where this PR is heading
ctx.register_table_factory("DELTATABLE", Arc::new(TestTableFactory {})); | ||
|
||
let sql = "CREATE EXTERNAL TABLE dt STORED AS DELTATABLE LOCATION 's3://bucket/schema/table';"; | ||
ctx.sql(sql).await.unwrap(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API is very cool -- I like it a lot 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, affirmation of general direction was what I was looking for, I'll clean it up and get it submitted, ty!
Codecov Report
@@ Coverage Diff @@
## master #3311 +/- ##
==========================================
+ Coverage 85.48% 85.51% +0.02%
==========================================
Files 294 294
Lines 54115 54120 +5
==========================================
+ Hits 46259 46279 +20
+ Misses 7856 7841 -15
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks great -- thank you @avantgardnerio
It does contain the changes from #3333 as well, so I think we should probably merge #3333 first and then rebase this one.
I had several small comments / style suggestions, but overall I think this PR is basically ready to go. I didn't "approve" it as it has the #3333 changes in it as well and wanted to give you a chance to respond to suggestions
Something I suggest as well is a datafusion-example in https://github.com/apache/arrow-datafusion/tree/master/datafusion-examples/examples
The rationale for doing so is:
- Serves as end-to-end test (that uses DataFusion's public API only) so we don't inadvertently break something if code is moved around (we did that in the past when we moved something and it became non-
pub
🤦 ) as the scoping / visibility rules are different in crate / out of crate. - Serves as a documentation / starting point for others to actually use it.
Such an end to end can be done as a follow on PR (or we could just file a ticket to track the work)
@@ -277,7 +277,9 @@ jobs: | |||
rustup default stable | |||
rustup component add rustfmt | |||
- name: Run | |||
run: ci/scripts/rust_fmt.sh | |||
run: | | |||
echo '' > datafusion/proto/src/generated/datafusion.rs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a left over from #3333 -- but I also don't think it hurts to leave it in this PR
let pb_file_type: protobuf::FileType = | ||
create_extern_table.file_type.try_into()?; | ||
match create_extern_table.file_type.as_str() { | ||
"CSV" | "JSON" | "PARQUET" | "AVRO" => {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to avoid having these four strings hardcoded in several places, so that if we add another format, we don't have to make changes all over the place.
Maybe we could put it into a struct like BuiltInTableProviders
or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I 100% agree, but I was hoping to do it in a follow-on PR, as there seems to be a lot of action going on with TableProviders right now, and I was hoping to not keep this one long-standing.
@@ -1214,19 +1214,6 @@ pub struct CreateView { | |||
pub definition: Option<String>, | |||
} | |||
|
|||
/// Types of files to parse as DataFrames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another potential way to write this (and maybe reduce the diff) would be to add a new enum variant like "Dynamic":
pub enum FileType {
/// Newline-delimited JSON
NdJson,
...
/// Custom factory
Dynamic(String),
}
This strategy might be more "idiomatic" as it encodes more information in the type system rather than a String
I don't feel strongly about this, I just wanted to offer it as an option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started with that solution, but I think it makes more sense to go:
- enum (as it was)
- string (with hard-coded values)
- everything-as-a-TableProviderFactory (string with no hard-coded values)
If this PR is accepted, I will move immediately onto step #3.
319619b
to
761f384
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @avantgardnerio |
/// | ||
/// For example, this can be used to create a table "on the fly" | ||
/// from a directory of files only when that name is referenced. | ||
pub trait TableProviderFactory: Sync + Send { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should create
be async
?
The TableProvider
will have to know the Schema
which will most likely have to be inferred by reading from the ObjectStore
(or involve a network call otherwise if there is some sort of metastore involved).
Benchmark runs are scheduled for baseline = b175f9a and contender = bb08d31. bb08d31 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #3310.
Rationale for this change
I would like to allow users to register custom table types at runtime.
What changes are included in this PR?
TableProviderFactory
FileType::Custom(String)
to allow the parser to pass information about these tablesHashMap
ofTableProviderFactory
s on theSessionContext
to allow them to be registered programmatically but used dynamicallyAre there any user-facing changes?
They can now register custom tables.