Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for reading distributed datasets #616

Closed
yjshen opened this issue Jun 24, 2021 · 10 comments · Fixed by #950
Closed

Add support for reading distributed datasets #616

yjshen opened this issue Jun 24, 2021 · 10 comments · Fixed by #950
Labels
enhancement New feature or request

Comments

@yjshen
Copy link
Member

yjshen commented Jun 24, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently, we can only read files from LocalFS since we use std::fs in ParquetExec & CsvExec. It would be nice to add support to read files that reside on storage sources such as HDFS, Amazon S3, etc.

Describe the solution you'd like

Describe alternatives you've considered

Additional context
Add any other context or screenshots about the feature request here.

@yjshen yjshen added the enhancement New feature or request label Jun 24, 2021
@alamb
Copy link
Contributor

alamb commented Jun 24, 2021

Given the number of possible remote file systems, and the extensibility of DataFusion itself (e.g. implement a TableProvider) this might be an excellent usecase to make a "plugin" crate like datafusion-s3 or something for each of the types of filesystems to support.

Bringing in the dependency stack of S3 to datafusion would be tough (as any project that used datafusion would then also pick up a lot of code / compile time even if it never used the s3 feature)

@alamb
Copy link
Contributor

alamb commented Jun 24, 2021

Thank you for the idea @yjshen

@Dandandan
Copy link
Contributor

Yes this would be great to have, also for something like delta-rs @houqp

@houqp
Copy link
Member

houqp commented Jun 25, 2021

Yeah, this will be very useful for datafusion integration with delta-rs.

As @Dandandan mentioned earlier in slack, we need to update ParquetExec to take datasource::Source as input instead of path strings. We also need to update datasource::Source to make it async compatible.

To make existing csv, parquet table provider implementation more reusable, we should probably extend datasource::Source to also handle directory listing so we won't need to re-implement table provider for csv/json/parquet in different IO extensions. datafusion-s3 can just provide a S3 Source that does object listing, get and put.

@nevi-me
Copy link
Contributor

nevi-me commented Jun 25, 2021

How about an async library that supports diferent protocols, fs, s3, azure blob, etc.?

Would there be parquet-specific functionality that would be required, or would such a general purpose library be usable with parquet, csv, json, etc.?

@yjshen
Copy link
Member Author

yjshen commented Jun 25, 2021

To make a digression, after searching on the Internet for a period of time, I found no suitable Rust HDFS client library that can be used directly for my use case. ☹️

I may need to re-implement a rust-native HDFS client with HDFS protocol or create a wrapper over libhdfs3.

Could someone give any suggestions or recommendations for existing crates for the HDFS client?

@houqp
Copy link
Member

houqp commented Jun 27, 2021

Would there be parquet-specific functionality that would be required, or would such a general purpose library be usable with parquet, csv, json, etc.?

I think we should be able to provide a bring your own IO abstraction to make it easy to plug different IO extensions into the existing parquet, csv and json table implementations.

@yjshen, there is also https://github.com/frqc/rust-hdfs. All of them seems to be unmaintained though. I think wrapping the C++ implementation is probably the easiest route in the short term.

@andrei-ionescu
Copy link

There is also the webhdfs-rs (https://github.com/vvvy/webhdfs-rs) which seems to be maintained. It is build on top of Tokio & Hyper and offers both sync/async access.

@dispanser
Copy link

@houqp: delta rs StorageBackend seems to be similar in spirit, possibly lacking the capability to only read a specific slice / chunk of the data (for only fetching one or more consecutive row groups). Maybe the concept can be extended and moved into arrow-rs or even a separate library?

@yjshen yjshen changed the title Add support for reading distributed datasets (files on HDFS for example) Add support for reading distributed datasets Aug 25, 2021
@houqp
Copy link
Member

houqp commented Aug 28, 2021

@dispanser I am thinking we will eventually migrate delta-rs to use the implementation that @yjshen implemented in datafusion. In the long run, delta-rs needs to be coupled with a distributed compute engine anyway, so I am expecting it to have a hard dependency on ballista and by extension datafusion.

I also see the value in moving the IO abstraction into its own crate once it's proven within datafusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
7 participants