Add support for reading distributed datasets #616

yjshen · 2021-06-24T07:55:51Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently, we can only read files from LocalFS since we use std::fs in ParquetExec & CsvExec. It would be nice to add support to read files that reside on storage sources such as HDFS, Amazon S3, etc.

Describe the solution you'd like

Describe alternatives you've considered

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

alamb · 2021-06-24T17:33:20Z

Given the number of possible remote file systems, and the extensibility of DataFusion itself (e.g. implement a TableProvider) this might be an excellent usecase to make a "plugin" crate like datafusion-s3 or something for each of the types of filesystems to support.

Bringing in the dependency stack of S3 to datafusion would be tough (as any project that used datafusion would then also pick up a lot of code / compile time even if it never used the s3 feature)

alamb · 2021-06-24T17:33:32Z

Thank you for the idea @yjshen

Dandandan · 2021-06-24T18:04:50Z

Yes this would be great to have, also for something like delta-rs @houqp

houqp · 2021-06-25T06:16:30Z

Yeah, this will be very useful for datafusion integration with delta-rs.

As @Dandandan mentioned earlier in slack, we need to update ParquetExec to take datasource::Source as input instead of path strings. We also need to update datasource::Source to make it async compatible.

To make existing csv, parquet table provider implementation more reusable, we should probably extend datasource::Source to also handle directory listing so we won't need to re-implement table provider for csv/json/parquet in different IO extensions. datafusion-s3 can just provide a S3 Source that does object listing, get and put.

nevi-me · 2021-06-25T08:43:10Z

How about an async library that supports diferent protocols, fs, s3, azure blob, etc.?

Would there be parquet-specific functionality that would be required, or would such a general purpose library be usable with parquet, csv, json, etc.?

yjshen · 2021-06-25T09:25:37Z

To make a digression, after searching on the Internet for a period of time, I found no suitable Rust HDFS client library that can be used directly for my use case. ☹️

hyunsik/hdfs-rs is in its alpha state and not updated for 6 years.
vvvy/rust-hdfs-native is still WIP and not updated for 3 years.

I may need to re-implement a rust-native HDFS client with HDFS protocol or create a wrapper over libhdfs3.

Could someone give any suggestions or recommendations for existing crates for the HDFS client?

houqp · 2021-06-27T03:41:10Z

Would there be parquet-specific functionality that would be required, or would such a general purpose library be usable with parquet, csv, json, etc.?

I think we should be able to provide a bring your own IO abstraction to make it easy to plug different IO extensions into the existing parquet, csv and json table implementations.

@yjshen, there is also https://github.com/frqc/rust-hdfs. All of them seems to be unmaintained though. I think wrapping the C++ implementation is probably the easiest route in the short term.

andrei-ionescu · 2021-07-07T16:03:30Z

There is also the webhdfs-rs (https://github.com/vvvy/webhdfs-rs) which seems to be maintained. It is build on top of Tokio & Hyper and offers both sync/async access.

dispanser · 2021-07-31T10:29:34Z

@houqp: delta rs StorageBackend seems to be similar in spirit, possibly lacking the capability to only read a specific slice / chunk of the data (for only fetching one or more consecutive row groups). Maybe the concept can be extended and moved into arrow-rs or even a separate library?

houqp · 2021-08-28T02:59:25Z

@dispanser I am thinking we will eventually migrate delta-rs to use the implementation that @yjshen implemented in datafusion. In the long run, delta-rs needs to be coupled with a distributed compute engine anyway, so I am expecting it to have a hard dependency on ballista and by extension datafusion.

I also see the value in moving the IO abstraction into its own crate once it's proven within datafusion.

yjshen added the enhancement New feature or request label Jun 24, 2021

yjshen mentioned this issue Aug 2, 2021

Add support for reading remote storage systems #811

Closed

yjshen changed the title ~~Add support for reading distributed datasets (files on HDFS for example)~~ Add support for reading distributed datasets Aug 25, 2021

This was referenced Aug 25, 2021

Table Scan Enhancement Plan #944

Closed

ObjectStore API to read from remote storage systems #950

Merged

alamb closed this as completed in #950 Sep 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for reading distributed datasets #616

Add support for reading distributed datasets #616

yjshen commented Jun 24, 2021

alamb commented Jun 24, 2021

alamb commented Jun 24, 2021

Dandandan commented Jun 24, 2021

houqp commented Jun 25, 2021

nevi-me commented Jun 25, 2021

yjshen commented Jun 25, 2021 •

edited

Loading

houqp commented Jun 27, 2021

andrei-ionescu commented Jul 7, 2021

dispanser commented Jul 31, 2021

houqp commented Aug 28, 2021

Add support for reading distributed datasets #616

Add support for reading distributed datasets #616

Comments

yjshen commented Jun 24, 2021

alamb commented Jun 24, 2021

alamb commented Jun 24, 2021

Dandandan commented Jun 24, 2021

houqp commented Jun 25, 2021

nevi-me commented Jun 25, 2021

yjshen commented Jun 25, 2021 • edited Loading

houqp commented Jun 27, 2021

andrei-ionescu commented Jul 7, 2021

dispanser commented Jul 31, 2021

houqp commented Aug 28, 2021

yjshen commented Jun 25, 2021 •

edited

Loading