Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Re-)add support for glob patterns in ListingTableUrl #3261

Closed
timvw opened this issue Aug 25, 2022 · 2 comments
Closed

(Re-)add support for glob patterns in ListingTableUrl #3261

timvw opened this issue Aug 25, 2022 · 2 comments
Labels
enhancement New feature or request

Comments

@timvw
Copy link
Contributor

timvw commented Aug 25, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Since the much needed cleanup and rationalization in #2578 of ListingTableUrl there is only support for glob patterns when no scheme is provided (in practice: only supported on local filesystem and not on other object_stores anymore).

Describe the solution you'd like
To have proper support for glob patterns. Eg, updating the documentation (and implementation) for ListingTableUrl to the following:

/// Parse a provided string as a `ListingTableUrl`
///
/// # Glob File Paths
///
/// If the path contains any of `'?', '*', '['`, it will be considered
/// a glob expression and resolved as following:
///
/// The string up to the first path segment containing a glob expression will be extracted,
/// and resolved as any other provided string.
///
/// The remaining string will be interpreted as a [`glob::Pattern`] and used as a
/// filter when listing files from object storage
///
/// # Paths without a Scheme
///
/// If no scheme is provided, or the string is an absolute filesystem path
/// as determined [`std::path::Path::is_absolute`], the string will be
/// interpreted as a path on the local filesystem using the operating
/// system's standard path delimiter, i.e. `\` on Windows, `/` on Unix.
///
/// If you wish to specify a path that does not exist on the local
/// machine you must provide it as a fully-qualified [file URI]
/// e.g. `file:///myfile.txt`
///
/// [file URI]: https://en.wikipedia.org/wiki/File_URI_scheme

Describe alternatives you've considered
We could keep things as they are and push support for globbing further into user-space.
In that case I suggest removing the support for glob altogether in ListingTableUrl.

Today, when a path/string contains an '*' or '[' the user is greeted with a BadSegment error anyway.

@tustvold WDYT?

@timvw timvw added the enhancement New feature or request label Aug 25, 2022
@tustvold
Copy link
Contributor

The reason I didn't do this is glob characters aren't URL-safe, so something like s3://bucket/path/*.parquet isn't a valid URL. I could only find examples of systems that supported glob expressions to local filesystem, and so I wasn't really sure how best to encode globs in URLs and opted to just punt on it.

Some possible ideas:

  • Just ignore that it isn't a valid URL and accept the fact it is potentially very confusing (what this ticket proposes)
  • Provide a programmatic interface to construct a ListingTableUrl with a custom scheme and glob
  • Encode the glob expression as a URL-encoded query parameter
  • Something else

It is also potentially worth highlighting that IIRC the logical plan serialization currently doesn't handle glob expressions and just drops them on the floor.

I think it would really help move this forward if we could find an example of a system that supports glob expressions to object stores, otherwise we end up having to design something custom which we will inevitably get wrong

@timvw
Copy link
Contributor Author

timvw commented Aug 25, 2022

A couple of thoughts:

Anyway, instead of making all these breaking changes without too much thinking I propose to introduce a GlobbingTable which has Globs (similar to ListingTable and it's ListingTableUrl) in datafusion-contrib and see how it works out...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants