-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider adopting IOx ObjectStore abstraction #2489
Comments
Indeed the IOx I also prefer the more generic API interface in the IOx implementation, I had actually planned on proposing something similar on #2445. From an s3 perspective the only things that come to mind are:
As for the actual implementation, given the IOx implementation already has AWS, GCP, and Azure functionality features could we just create |
I believe @tustvold is actively working on preparing the iox code for crates.io release in https://github.com/influxdata/influxdb_iox/pull/4534 |
Yeah as alluded to by @alamb, my plan is to get the iox code released to crates.io so that DataFusion could use it. There would then be a couple of potential courses of action for DataFusion:
|
I've released the crate to crates.io - https://crates.io/crates/object_store, I'm going to take a stab at integrating this into Datafusion over this weekend. Hopefully I'll get something up as a workable draft |
Wrt fetching to local disk, we have an implementation of (datafusion) |
Yeah, buffered prefetch is one way to mitigate the small read problem. However, it does not allow for coalescing adjacent reads - i.e. you will still likely end up with one request per column chunk unless you have tiny columns. TBC my preference is for 3, which mirrors the new vectored API if S3a, but I'm currently working on 2 first to ensure there aren't any fundamental integration issues. |
I believe @tustvold is working on this issue |
Cool. In our case the buffered prefetch helps marginally (but we also have a lot of sparse columns so it is a slightly special case which does a reasonable job at coalescing adjacent reads). apache/arrow-rs#1605 looks like a really good idea. We're also working on trying to optimize S3 reads at the moment so if there's any way I can help please let me know! |
that's great, at a time, I want to create |
* Switch to object_store crate (#2489) * Test fixes * Update to object_store 0.2.0 * More windows pacification * Fix windows test * Fix windows test_prefix_path * More windows fixes * Simplify ListingTableUrl::strip_prefix * Review feedback * Update to latest arrow-rs * Use ParquetRecordBatchStream * Simplify predicate pruning * Add host to ObjectStoreRegistry Co-authored-by: Andrew Lamb <[email protected]>
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In another issue @alamb and @tustvold suggested we might want to use the IOx ObjectStore implementation.
A few nice points I'll mention about the IOx one:
&str
paths.put()
for writing. There doesn't seem to be streaming write support (multi-part upload).There are a few differences in the API:
Current API: https://github.com/apache/arrow-datafusion/blob/dfdeb42d7d646cffcf3cff26beefcecffc6cbe62/data-access/src/object_store/mod.rs#L77
IOx API: https://github.com/influxdata/influxdb_iox/blob/94e9ac610acfb94870154d976f66a4d4111b5668/object_store/src/lib.rs#L74
list()
implementation evaluated prefixes on path segments: "Prefixes are evaluated on a path segment basis, i.e.foo/bar/
is a prefix offoo/bar/x
but not offoo/bar_baz/x
."There of course exist other repos that this has implications for:
From what I've seen, it seems like we could reasonably shift to simply use the IOx ObjectStore. But if there's a good reason, we could also reuse useful parts of the implementation to keep the existing API.
cc @matthewmturner @kyotoYaho @roeap
The text was updated successfully, but these errors were encountered: