Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow SessionContext::read_csv, etc to read multiple files #4908

Merged
merged 8 commits into from
Feb 20, 2023

Conversation

saikrishna1-bidgely
Copy link
Contributor

@saikrishna1-bidgely saikrishna1-bidgely commented Jan 14, 2023

closes #4909

@github-actions github-actions bot added the core Core DataFusion crate label Jan 14, 2023
@saikrishna1-bidgely
Copy link
Contributor Author

saikrishna1-bidgely commented Jan 15, 2023

I'll add for read_avro, read_json and read_parquet too once what needs to be done gets finalised.

Edit: completed with these functions too.

@saikrishna1-bidgely saikrishna1-bidgely marked this pull request as draft January 15, 2023 10:07
@saikrishna1-bidgely saikrishna1-bidgely marked this pull request as ready for review January 17, 2023 15:52
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a nice improvement @saikrishna1-bidgely

I think we should add a test for this new functionality so that we don't accidentally break the new APIs going forward.

Also I was wondering about the signature

Rather than a slice, what would you think about taking something that could be turned into an iter:

So instead of

    pub async fn read_parquet_with_multi_paths(
        &self,
        table_paths: &[impl AsRef<str>],
        options: ParquetReadOptions<'_>,
    ) -> Result<DataFrame> {

Something more like

    pub async fn read_parquet_with_multi_paths(
        &self,
        table_paths: impl IntoIterator<Item = &str>],
        options: ParquetReadOptions<'_>,
    ) -> Result<DataFrame> {

I also think it would be ideal to figure out some way to have the same API take both a single path and an iterator -- do you think the above would work?

@@ -551,12 +551,14 @@ impl SessionContext {
}

/// Creates a [`DataFrame`] for reading an Avro data source.
pub async fn read_avro(
pub async fn read_avro_with_multi_paths(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about calling this read_avro_from_paths? rather than multi_paths?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used _with_multi_paths since arrow uses a similar name but _from_paths seems cleaner. I'm fine with either.

return self.read_avro_with_multi_paths(&[table_path], options).await;
}

/// Creates a [`DataFrame`] for reading an Json data source.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can update the docstring as well?

@saikrishna1-bidgely
Copy link
Contributor Author

saikrishna1-bidgely commented Jan 18, 2023

I think we should add a test for this new functionality so that we don't accidentally break the new APIs going forward.
I will add the tests once we finalise the function signature.

Rather than a slice, what would you think about taking something that could be turned into an iter:

I agree, something like IntoIterator seems better than a simple slice.

I also think it would be ideal to figure out some way to have the same API take both a single path and an iterator -- do you think the above would work?

It is not simple to implement a function that can take both str and iter of str since rust doesn't have function overloading or variadic arguments. We can look into following:

  1. Enums for arguments and then do pattern matching. I'm trying to implement this.
  2. Union type for arguments. This might not be possible, see link.
  3. Create a custom trait and then implement it for str and for a slice/Vec. See Link.

What do you think about having a single method which only takes a list of paths? For a single path, the callee can create a slice/Vec. This would be a lot simpler to do.

@alamb
Copy link
Contributor

alamb commented Jan 19, 2023

What do you think about having a single method which only takes a list of paths? For a single path, the callee can create a slice/Vec. This would be a lot simpler to do.

I was thinking about this PR and I have an alternate suggestion

It seems to me that read_parquet, read_avro, etc are wrappers to simplify the process of creating a ListingTable. Support for multiple paths starts complicating the API more -- what do you think about instead of adding read_parquet_from_paths we make it easier to see how to read multiple files using the ListingTable API directly?

For example, I bet if we added a doc example like the following

    /// Creates a [`DataFrame`] for reading a Parquet data source from a single file or directory. 
    ///
    /// Note: if you want to read from multiple files, or control other behaviors
    /// you can use the [`ListingTable`] API directly. For example to read multiple files
    /// 
    /// ```
    /// Example here (basically copy/paste the implementation of read_parquet and support multiple files)
    /// ```
    pub async fn read_parquet(
        &self,
        table_path: impl AsRef<str>,
        options: ParquetReadOptions<'_>,
    ) -> Result<DataFrame> {
...

We could give similar treatment to the docstrings for read_avro and read_csv (perhaps by pointing to the docs for read_parquet for an example of creating ListingTables)

@saikrishna1-bidgely
Copy link
Contributor Author

Quick question, wouldn't we want to support multiple paths anyways since we would want to use it in DataFusion-CLI?

Also, if we are able to crack the implementation which can take both single and multiple paths in the same function, API itself should be unchanged, right?

So, I think the following are our options:

  1. As you said, don't support it but provide an example in the docs.
  2. Have multiple methods.
  3. Same method takes both single and multiple paths.
  4. Change the current methods to take only multiple paths.

1 is definitely the simplest but I think we should support it as it would make using DataFusion simpler.
2 is simple but multiple functions is a downside.
3 if possible is best case IMO.
4 is also simpler but needs a lot of changes downstream.

@alamb
Copy link
Contributor

alamb commented Jan 20, 2023

Quick question, wouldn't we want to support multiple paths anyways since we would want to use it in DataFusion-CLI?

I think that is likely a separate question -- DataFusion CLI can potentially create a ListingTable directly rather than using the higher level SessionContext API as well

Also, if we are able to crack the implementation which can take both single and multiple paths in the same function, API itself should be unchanged, right?

Yes, I agree

So, I think the following are our options:

I agree with your assessments -- I do think documentation on how to create a ListingTable would go a long way. It seems we are lacking in such docs now https://docs.rs/datafusion/16.0.0/datafusion/datasource/listing/struct.ListingTable.html

I think documentation will help regardless of what else we chose to do -- I'll go write some now. Thank you for this good discussion

@alamb
Copy link
Contributor

alamb commented Jan 20, 2023

Here is a proposal to at least add some more docs: #5001 -- it is not necessarily mutually exclusive with updating the signatures as well

@saikrishna1-bidgely
Copy link
Contributor Author

@alamb I finally got a way to implement overloading using Traits.

use std::vec;

fn main() {
    struct PATH {
    }
    
    impl PATH {
        pub fn new() -> Self {
            Self {
            }
        }
    }
    
    pub trait Reader<T>: Sized {
        fn read_csv(&self, value: T) -> i32;
    }
    
    impl<'a> Reader<&'a str> for PATH {
        fn read_csv(&self, p: &'a str) -> i32  {
            self.read_csv(vec![p])
        }
    }
    
    impl<'a> Reader<Vec<&'a str>> for PATH {
        fn read_csv(&self, v: Vec<&'a str>) -> i32 {
            v.len() as i32
        }
    }
    let p = PATH::new();
    println!("{:?}", p.read_csv("path"));
    println!("{:?}", p.read_csv(vec!["path", "paths2"]));
}

This way, callees using str or String can continue using the function and we can add support for vector/iterators. I will write a more general solution for read functions.

I tried to implement using Enum and Union but I wasn't able to do so and they will change the function signature.

@alamb
Copy link
Contributor

alamb commented Jan 21, 2023

Thank you @saikrishna1-bidgely -- sounds like great progress. I think if we go the trait route as long as we document how to use it (basically we can make a doc example showing the use of a &str) I think we'll be good.

Thanks again!

@github-actions github-actions bot added documentation Improvements or additions to documentation logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait labels Jan 21, 2023
@github-actions github-actions bot removed logical-expr Logical plan and expressions sql SQL Planner documentation Improvements or additions to documentation optimizer Optimizer rules substrait sqllogictest SQL Logic Tests (.slt) physical-expr Physical Expressions labels Jan 21, 2023
}

#[async_trait]
impl<'a> Reader<'a, String> for SessionContext {
Copy link
Contributor Author

@saikrishna1-bidgely saikrishna1-bidgely Jan 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use AsRef here. If we do that, we won't be able to implement for Vec<str>. We have to implement both &str and String separately.

///
/// For more control such as reading multiple files, you can use
/// [`read_table`](Self::read_table) with a [`ListingTable`].
async fn read_csv(&self, table_paths: T, options: CsvReadOptions<'_>) -> Result<DataFrame>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now implemented only for read_csv but once it gets finalised for it, I'll implement for the rest of the methods.

@saikrishna1-bidgely
Copy link
Contributor Author

@alamb @tustvold I updated the code with suggestion from @tustvold. The change is now much smaller and cleaner. Pls review the code.

@alamb alamb changed the title added a method to read multiple locations at the same time. Allow SessionContext::read_csv, etc to read multiple files Feb 15, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @saikrishna1-bidgely -- I really like this PR ❤️ -- thank you for sticking with it.

The only thing I think is missing is a test (mostly to ensure this ability is not removed by accident in the future).

For a test, What do you think about

  1. adding a doc example on one of the methods (e.g like SessionContext::read_parquet that shows reading from a list of strings)
  2. Add a note on the other methods (e.g. SessionContext::read_csv) pointing at the method with the example.

I can help with this documentation if you like

options: impl ReadOptions<'a>,
) -> Result<DataFrame> {
let table_path = ListingTableUrl::parse(table_path)?;
let table_paths = table_paths.to_urls()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

datafusion/core/src/execution/context.rs Outdated Show resolved Hide resolved
datafusion/core/src/execution/context.rs Outdated Show resolved Hide resolved
datafusion/core/src/execution/context.rs Show resolved Hide resolved
@saikrishna1-bidgely saikrishna1-bidgely requested review from tustvold and removed request for alamb February 15, 2023 16:05
@saikrishna1-bidgely
Copy link
Contributor Author

Regarding the docs, I think we should extend the example in SessionContext and add an example to all the read_* methods.

@saikrishna1-bidgely
Copy link
Contributor Author

@alamb updated the docs. Should I remove call_read_csvs and the related methods?

@saikrishna1-bidgely
Copy link
Contributor Author

@alamb removed methods from CallReadTrait too.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @saikrishna1-bidgely I think just a few more comment updates and this will be ready to go. Thanks for sticking with it

datafusion/core/src/execution/context.rs Outdated Show resolved Hide resolved
datafusion/core/src/execution/context.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @saikrishna1-bidgely -- I think this is looking ready to go.

@saikrishna1-bidgely
Copy link
Contributor Author

Cool! Do we need more approvals before merging?

@alamb
Copy link
Contributor

alamb commented Feb 19, 2023

Cool! Do we need more approvals before merging?

Nope, I was just giving it some time after approval for other maintainers to have a look if they wanted.

🚀

@alamb
Copy link
Contributor

alamb commented Feb 19, 2023

Closing/ reopening to rerun CI (for some reason several of the tests were canceled)

@alamb alamb closed this Feb 19, 2023
@alamb alamb reopened this Feb 19, 2023
@alamb alamb merged commit cfbb14d into apache:main Feb 20, 2023
@alamb
Copy link
Contributor

alamb commented Feb 20, 2023

Thanks again for sticking with this @saikrishna1-bidgely

1 similar comment
@alamb
Copy link
Contributor

alamb commented Feb 20, 2023

Thanks again for sticking with this @saikrishna1-bidgely

@ursabot
Copy link

ursabot commented Feb 20, 2023

Benchmark runs are scheduled for baseline = ae89960 and contender = cfbb14d. cfbb14d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

jiangzhx pushed a commit to jiangzhx/arrow-datafusion that referenced this pull request Feb 24, 2023
…4908)

* Added a traitDataFilePaths to convert strings and vector of strings to a vector of URLs.

* Added docs and tests. Updated DataFilePaths to accept any vector containing AsRef<str> trait.

* Added docs to read_ methods and extended the SessionContext doc.

* Ran Cargo fmt

* removed CallReadTrait methods

* Update read_csv example

Co-authored-by: Andrew Lamb <[email protected]>

* removed addition to SessionContext example

---------

Co-authored-by: Lakkam Sai Krishna Reddy <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Read multiple files/folders using read_csv
5 participants