Allow `SessionContext::read_csv`, etc to read multiple files #4908

saikrishna1-bidgely · 2023-01-14T21:36:51Z

closes #4909

saikrishna1-bidgely · 2023-01-15T07:38:59Z

I'll add for read_avro, read_json and read_parquet too once what needs to be done gets finalised.

Edit: completed with these functions too.

alamb

This looks like a nice improvement @saikrishna1-bidgely

I think we should add a test for this new functionality so that we don't accidentally break the new APIs going forward.

Also I was wondering about the signature

Rather than a slice, what would you think about taking something that could be turned into an iter:

So instead of

    pub async fn read_parquet_with_multi_paths(
        &self,
        table_paths: &[impl AsRef<str>],
        options: ParquetReadOptions<'_>,
    ) -> Result<DataFrame> {

Something more like

    pub async fn read_parquet_with_multi_paths(
        &self,
        table_paths: impl IntoIterator<Item = &str>],
        options: ParquetReadOptions<'_>,
    ) -> Result<DataFrame> {

I also think it would be ideal to figure out some way to have the same API take both a single path and an iterator -- do you think the above would work?

alamb · 2023-01-17T21:28:43Z

datafusion/core/src/execution/context.rs

@@ -551,12 +551,14 @@ impl SessionContext {
    }

    /// Creates a [`DataFrame`] for reading an Avro data source.
-    pub async fn read_avro(
+    pub async fn read_avro_with_multi_paths(


what do you think about calling this read_avro_from_paths? rather than multi_paths?

I used _with_multi_paths since arrow uses a similar name but _from_paths seems cleaner. I'm fine with either.

alamb · 2023-01-17T21:28:55Z

datafusion/core/src/execution/context.rs

+        return self.read_avro_with_multi_paths(&[table_path], options).await;
+    }
+
+    /// Creates a [`DataFrame`] for reading an Json data source.


Perhaps we can update the docstring as well?

saikrishna1-bidgely · 2023-01-18T20:46:01Z

I think we should add a test for this new functionality so that we don't accidentally break the new APIs going forward.
I will add the tests once we finalise the function signature.

Rather than a slice, what would you think about taking something that could be turned into an iter:

I agree, something like IntoIterator seems better than a simple slice.

I also think it would be ideal to figure out some way to have the same API take both a single path and an iterator -- do you think the above would work?

It is not simple to implement a function that can take both str and iter of str since rust doesn't have function overloading or variadic arguments. We can look into following:

Enums for arguments and then do pattern matching. I'm trying to implement this.
Union type for arguments. This might not be possible, see link.
Create a custom trait and then implement it for str and for a slice/Vec. See Link.

What do you think about having a single method which only takes a list of paths? For a single path, the callee can create a slice/Vec. This would be a lot simpler to do.

alamb · 2023-01-19T14:41:56Z

What do you think about having a single method which only takes a list of paths? For a single path, the callee can create a slice/Vec. This would be a lot simpler to do.

I was thinking about this PR and I have an alternate suggestion

It seems to me that read_parquet, read_avro, etc are wrappers to simplify the process of creating a ListingTable. Support for multiple paths starts complicating the API more -- what do you think about instead of adding read_parquet_from_paths we make it easier to see how to read multiple files using the ListingTable API directly?

For example, I bet if we added a doc example like the following

    /// Creates a [`DataFrame`] for reading a Parquet data source from a single file or directory. 
    ///
    /// Note: if you want to read from multiple files, or control other behaviors
    /// you can use the [`ListingTable`] API directly. For example to read multiple files
    /// 
    /// ```
    /// Example here (basically copy/paste the implementation of read_parquet and support multiple files)
    /// ```
    pub async fn read_parquet(
        &self,
        table_path: impl AsRef<str>,
        options: ParquetReadOptions<'_>,
    ) -> Result<DataFrame> {
...

We could give similar treatment to the docstrings for read_avro and read_csv (perhaps by pointing to the docs for read_parquet for an example of creating ListingTables)

saikrishna1-bidgely · 2023-01-19T22:38:56Z

Quick question, wouldn't we want to support multiple paths anyways since we would want to use it in DataFusion-CLI?

Also, if we are able to crack the implementation which can take both single and multiple paths in the same function, API itself should be unchanged, right?

So, I think the following are our options:

As you said, don't support it but provide an example in the docs.
Have multiple methods.
Same method takes both single and multiple paths.
Change the current methods to take only multiple paths.

1 is definitely the simplest but I think we should support it as it would make using DataFusion simpler.
2 is simple but multiple functions is a downside.
3 if possible is best case IMO.
4 is also simpler but needs a lot of changes downstream.

alamb · 2023-01-20T11:36:34Z

Quick question, wouldn't we want to support multiple paths anyways since we would want to use it in DataFusion-CLI?

I think that is likely a separate question -- DataFusion CLI can potentially create a ListingTable directly rather than using the higher level SessionContext API as well

Also, if we are able to crack the implementation which can take both single and multiple paths in the same function, API itself should be unchanged, right?

Yes, I agree

So, I think the following are our options:

I agree with your assessments -- I do think documentation on how to create a ListingTable would go a long way. It seems we are lacking in such docs now https://docs.rs/datafusion/16.0.0/datafusion/datasource/listing/struct.ListingTable.html

I think documentation will help regardless of what else we chose to do -- I'll go write some now. Thank you for this good discussion

alamb · 2023-01-20T12:20:20Z

Here is a proposal to at least add some more docs: #5001 -- it is not necessarily mutually exclusive with updating the signatures as well

saikrishna1-bidgely · 2023-01-20T18:27:02Z

@alamb I finally got a way to implement overloading using Traits.

use std::vec;

fn main() {
    struct PATH {
    }
    
    impl PATH {
        pub fn new() -> Self {
            Self {
            }
        }
    }
    
    pub trait Reader<T>: Sized {
        fn read_csv(&self, value: T) -> i32;
    }
    
    impl<'a> Reader<&'a str> for PATH {
        fn read_csv(&self, p: &'a str) -> i32  {
            self.read_csv(vec![p])
        }
    }
    
    impl<'a> Reader<Vec<&'a str>> for PATH {
        fn read_csv(&self, v: Vec<&'a str>) -> i32 {
            v.len() as i32
        }
    }
    let p = PATH::new();
    println!("{:?}", p.read_csv("path"));
    println!("{:?}", p.read_csv(vec!["path", "paths2"]));
}

This way, callees using str or String can continue using the function and we can add support for vector/iterators. I will write a more general solution for read functions.

I tried to implement using Enum and Union but I wasn't able to do so and they will change the function signature.

alamb · 2023-01-21T10:55:21Z

Thank you @saikrishna1-bidgely -- sounds like great progress. I think if we go the trait route as long as we document how to use it (basically we can make a doc example showing the use of a &str) I think we'll be good.

Thanks again!

saikrishna1-bidgely · 2023-01-21T21:10:31Z

datafusion/core/src/execution/context.rs

+}
+
+#[async_trait]
+impl<'a> Reader<'a, String> for SessionContext {


We can't use AsRef here. If we do that, we won't be able to implement for Vec<str>. We have to implement both &str and String separately.

saikrishna1-bidgely · 2023-01-21T22:16:49Z

datafusion/core/src/execution/context.rs

+    ///
+    /// For more control such as reading multiple files, you can use
+    /// [`read_table`](Self::read_table) with a [`ListingTable`].
+    async fn read_csv(&self, table_paths: T, options: CsvReadOptions<'_>) -> Result<DataFrame>


For now implemented only for read_csv but once it gets finalised for it, I'll implement for the rest of the methods.

…aining AsRef<str> trait.

saikrishna1-bidgely · 2023-02-14T19:38:15Z

@alamb @tustvold I updated the code with suggestion from @tustvold. The change is now much smaller and cleaner. Pls review the code.

alamb

Thank you @saikrishna1-bidgely -- I really like this PR ❤️ -- thank you for sticking with it.

The only thing I think is missing is a test (mostly to ensure this ability is not removed by accident in the future).

For a test, What do you think about

adding a doc example on one of the methods (e.g like SessionContext::read_parquet that shows reading from a list of strings)
Add a note on the other methods (e.g. SessionContext::read_csv) pointing at the method with the example.

I can help with this documentation if you like

alamb · 2023-02-15T14:23:13Z

datafusion/core/src/execution/context.rs

        options: impl ReadOptions<'a>,
    ) -> Result<DataFrame> {
-        let table_path = ListingTableUrl::parse(table_path)?;
+        let table_paths = table_paths.to_urls()?;


datafusion/core/src/execution/context.rs

saikrishna1-bidgely · 2023-02-15T16:34:40Z

Regarding the docs, I think we should extend the example in SessionContext and add an example to all the read_* methods.

saikrishna1-bidgely · 2023-02-16T13:21:45Z

@alamb updated the docs. Should I remove call_read_csvs and the related methods?

saikrishna1-bidgely · 2023-02-18T12:34:58Z

@alamb removed methods from CallReadTrait too.

alamb

Thanks @saikrishna1-bidgely I think just a few more comment updates and this will be ready to go. Thanks for sticking with it

datafusion/core/src/execution/context.rs

Co-authored-by: Andrew Lamb <[email protected]>

…-datafusion

alamb

Thank you @saikrishna1-bidgely -- I think this is looking ready to go.

saikrishna1-bidgely · 2023-02-18T18:18:17Z

Cool! Do we need more approvals before merging?

alamb · 2023-02-19T15:38:57Z

Cool! Do we need more approvals before merging?

Nope, I was just giving it some time after approval for other maintainers to have a look if they wanted.

🚀

alamb · 2023-02-19T15:39:41Z

Closing/ reopening to rerun CI (for some reason several of the tests were canceled)

alamb · 2023-02-20T12:02:32Z

Thanks again for sticking with this @saikrishna1-bidgely

alamb · 2023-02-20T12:02:48Z

Thanks again for sticking with this @saikrishna1-bidgely

ursabot · 2023-02-20T12:12:18Z

Benchmark runs are scheduled for baseline = ae89960 and contender = cfbb14d. cfbb14d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

…4908) * Added a traitDataFilePaths to convert strings and vector of strings to a vector of URLs. * Added docs and tests. Updated DataFilePaths to accept any vector containing AsRef<str> trait. * Added docs to read_ methods and extended the SessionContext doc. * Ran Cargo fmt * removed CallReadTrait methods * Update read_csv example Co-authored-by: Andrew Lamb <[email protected]> * removed addition to SessionContext example --------- Co-authored-by: Lakkam Sai Krishna Reddy <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>

github-actions bot added the core Core DataFusion crate label Jan 14, 2023

saikrishna1-bidgely marked this pull request as draft January 15, 2023 10:07

saikrishna1-bidgely marked this pull request as ready for review January 17, 2023 15:52

alamb reviewed Jan 17, 2023

View reviewed changes

alamb mentioned this pull request Jan 20, 2023

Minor: Document how to create ListingTables #5001

Merged

github-actions bot added documentation Improvements or additions to documentation logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait labels Jan 21, 2023

saikrishna1-bidgely force-pushed the master branch from d518619 to 654d770 Compare January 21, 2023 21:09

github-actions bot removed logical-expr Logical plan and expressions sql SQL Planner documentation Improvements or additions to documentation optimizer Optimizer rules substrait sqllogictest SQL Logic Tests (.slt) physical-expr Physical Expressions labels Jan 21, 2023

saikrishna1-bidgely commented Jan 21, 2023

View reviewed changes

Added docs and tests. Updated DataFilePaths to accept any vector cont…

62e8785

…aining AsRef<str> trait.

saikrishna1-bidgely requested review from alamb February 15, 2023 13:39

alamb changed the title ~~added a method to read multiple locations at the same time.~~ Allow SessionContext::read_csv, etc to read multiple files Feb 15, 2023

alamb reviewed Feb 15, 2023

View reviewed changes

saikrishna1-bidgely requested review from tustvold and removed request for alamb February 15, 2023 16:05

luckylsk34 added 2 commits February 16, 2023 03:05

Added docs to read_ methods and extended the SessionContext doc.

51bf427

Ran Cargo fmt

bae2222

removed CallReadTrait methods

b39c694

alamb reviewed Feb 18, 2023

View reviewed changes

datafusion/core/src/execution/context.rs Outdated Show resolved Hide resolved

datafusion/core/src/execution/context.rs Outdated Show resolved Hide resolved

saikrishna1-bidgely and others added 3 commits February 18, 2023 22:14

Update read_csv example

81495c0

Co-authored-by: Andrew Lamb <[email protected]>

removed addition to SessionContext example

aaa6a80

Merge branch 'master' of https://github.com/saikrishna1-bidgely/arrow…

af7ff91

…-datafusion

alamb approved these changes Feb 18, 2023

View reviewed changes

alamb closed this Feb 19, 2023

alamb reopened this Feb 19, 2023

alamb merged commit cfbb14d into apache:main Feb 20, 2023

sergiimk mentioned this pull request Jun 21, 2023

Support register_parquet from a list of files #1384

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow `SessionContext::read_csv`, etc to read multiple files #4908

Allow `SessionContext::read_csv`, etc to read multiple files #4908

saikrishna1-bidgely commented Jan 14, 2023 •

edited

Loading

saikrishna1-bidgely commented Jan 15, 2023 •

edited

Loading

alamb left a comment

alamb Jan 17, 2023

saikrishna1-bidgely Jan 18, 2023

alamb Jan 17, 2023

saikrishna1-bidgely commented Jan 18, 2023 •

edited

Loading

alamb commented Jan 19, 2023

saikrishna1-bidgely commented Jan 19, 2023

alamb commented Jan 20, 2023

alamb commented Jan 20, 2023

saikrishna1-bidgely commented Jan 20, 2023

alamb commented Jan 21, 2023

saikrishna1-bidgely Jan 21, 2023 •

edited

Loading

saikrishna1-bidgely Jan 21, 2023

saikrishna1-bidgely commented Feb 14, 2023

alamb left a comment

alamb Feb 15, 2023

saikrishna1-bidgely commented Feb 15, 2023

saikrishna1-bidgely commented Feb 16, 2023

saikrishna1-bidgely commented Feb 18, 2023

alamb left a comment

alamb left a comment

saikrishna1-bidgely commented Feb 18, 2023

alamb commented Feb 19, 2023

alamb commented Feb 19, 2023

alamb commented Feb 20, 2023

alamb commented Feb 20, 2023

ursabot commented Feb 20, 2023

Allow SessionContext::read_csv, etc to read multiple files #4908

Allow SessionContext::read_csv, etc to read multiple files #4908

Conversation

saikrishna1-bidgely commented Jan 14, 2023 • edited Loading

saikrishna1-bidgely commented Jan 15, 2023 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

alamb Jan 17, 2023

Choose a reason for hiding this comment

saikrishna1-bidgely Jan 18, 2023

Choose a reason for hiding this comment

alamb Jan 17, 2023

Choose a reason for hiding this comment

saikrishna1-bidgely commented Jan 18, 2023 • edited Loading

alamb commented Jan 19, 2023

saikrishna1-bidgely commented Jan 19, 2023

alamb commented Jan 20, 2023

alamb commented Jan 20, 2023

saikrishna1-bidgely commented Jan 20, 2023

alamb commented Jan 21, 2023

saikrishna1-bidgely Jan 21, 2023 • edited Loading

Choose a reason for hiding this comment

saikrishna1-bidgely Jan 21, 2023

Choose a reason for hiding this comment

saikrishna1-bidgely commented Feb 14, 2023

alamb left a comment

Choose a reason for hiding this comment

alamb Feb 15, 2023

Choose a reason for hiding this comment

saikrishna1-bidgely commented Feb 15, 2023

saikrishna1-bidgely commented Feb 16, 2023

saikrishna1-bidgely commented Feb 18, 2023

alamb left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

saikrishna1-bidgely commented Feb 18, 2023

alamb commented Feb 19, 2023

alamb commented Feb 19, 2023

alamb commented Feb 20, 2023

alamb commented Feb 20, 2023

ursabot commented Feb 20, 2023

Allow `SessionContext::read_csv`, etc to read multiple files #4908

Allow `SessionContext::read_csv`, etc to read multiple files #4908

saikrishna1-bidgely commented Jan 14, 2023 •

edited

Loading

saikrishna1-bidgely commented Jan 15, 2023 •

edited

Loading

saikrishna1-bidgely commented Jan 18, 2023 •

edited

Loading

saikrishna1-bidgely Jan 21, 2023 •

edited

Loading