-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for reading distributed datasets #616
Comments
Given the number of possible remote file systems, and the extensibility of DataFusion itself (e.g. implement a TableProvider) this might be an excellent usecase to make a "plugin" crate like Bringing in the dependency stack of S3 to datafusion would be tough (as any project that used datafusion would then also pick up a lot of code / compile time even if it never used the s3 feature) |
Thank you for the idea @yjshen |
Yes this would be great to have, also for something like delta-rs @houqp |
Yeah, this will be very useful for datafusion integration with delta-rs. As @Dandandan mentioned earlier in slack, we need to update ParquetExec to take To make existing csv, parquet table provider implementation more reusable, we should probably extend |
How about an async library that supports diferent protocols, fs, s3, azure blob, etc.? Would there be parquet-specific functionality that would be required, or would such a general purpose library be usable with parquet, csv, json, etc.? |
To make a digression, after searching on the Internet for a period of time, I found no suitable Rust HDFS client library that can be used directly for my use case.
I may need to re-implement a rust-native HDFS client with HDFS protocol or create a wrapper over libhdfs3. Could someone give any suggestions or recommendations for existing crates for the HDFS client? |
I think we should be able to provide a bring your own IO abstraction to make it easy to plug different IO extensions into the existing parquet, csv and json table implementations. @yjshen, there is also https://github.com/frqc/rust-hdfs. All of them seems to be unmaintained though. I think wrapping the C++ implementation is probably the easiest route in the short term. |
There is also the webhdfs-rs (https://github.com/vvvy/webhdfs-rs) which seems to be maintained. It is build on top of Tokio & Hyper and offers both sync/async access. |
@houqp: delta rs |
@dispanser I am thinking we will eventually migrate delta-rs to use the implementation that @yjshen implemented in datafusion. In the long run, delta-rs needs to be coupled with a distributed compute engine anyway, so I am expecting it to have a hard dependency on ballista and by extension datafusion. I also see the value in moving the IO abstraction into its own crate once it's proven within datafusion. |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently, we can only read files from LocalFS since we use
std::fs
inParquetExec
&CsvExec
. It would be nice to add support to read files that reside on storage sources such as HDFS, Amazon S3, etc.Describe the solution you'd like
Describe alternatives you've considered
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: