-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Read GeoTIFF files from remote urls via object_store #5
Conversation
A generic object store interface for uniformly interacting with AWS S3, Google Cloud Storage, Azure Blob Storage and local files! Also added the url crate.
An event-driven, non-blocking I/O platform for writing asynchronous I/O backed applications!
Doing test driven development by trying to load a remote GeoTIFF file via http into a numpy array.
Allow passing remote URLs to the read_geotiff_py function. URLs are handled via object_store's parse_url function. Data is read asynchronously into an in-memory buffer (within a tokio runtime), and passed into the synchronous io::geotiff::read_geotiff function. Also added a short example in the Python docstring on how this can be used.
src/lib.rs
Outdated
// Parse URL | ||
let url = Url::parse(path).expect(&format!("Cannot parse path: {path}")); | ||
let (store, location) = parse_url(&url).expect(&format!("Cannot parse url: {url}")); | ||
|
||
// Return cursor to in-memory buffer | ||
let result = store.get(&location).await.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to find a way here to let object_store
read from local filepaths like /tmp/float32.tif
or data/float32.tif
. Currently at commit 33faafb, one would need to prepend the filepath with file://
for the read to work, otherwise there will be an error like pyo3_runtime.PanicException: Cannot parse path: /tmp/tmp6ip79h4_/float32.tif: RelativeUrlWithoutBase
(see https://github.com/weiji14/cog3pio/actions/runs/8115987634/job/22185036928?pr=5#step:6:89).
Maybe add some logic to determine the URL scheme, and if it is local, prepend the url with file://
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait, RelativeUrlWithoutBase
is raised on Url::parse
, not ObjectStore 😅 Edit: Should be handled in f9607f3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, need to figure out how to handle Windows paths - https://github.com/weiji14/cog3pio/actions/runs/8118215740/job/22192052940?pr=5#step:6:95:
E pyo3_runtime.PanicException: Cannot parse url: c:\Users\RUNNER~1\AppData\Local\Temp\tmpv5aci1cm\float32.tif: Generic { store: "URL", source: Unrecognised { url: Url { scheme: "c", cannot_be_a_base: true, username: "", password: None, host: None, port: None, path: "\\Users\\RUNNER~1\\AppData\\Local\\Temp\\tmpv5aci1cm\\float32.tif", query: None, fragment: None } } }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, figured out that there's a Url::from_file_path
function that handles Windows paths (and also prepends file://
to local filepaths). Implemented in commit 3c8a604
Cargo.toml
Outdated
numpy = "0.20.0" | ||
object_store = { version = "0.9.0", features = ["http"] } | ||
pyo3 = { version = "0.20.3", features = ["abi3-py310", "extension-module"] } | ||
tiff = "0.9.1" | ||
tokio = "1.36.0" | ||
url = "2.5.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably need to hide some of the Python-specific crates behind a feature flag (if the pure Rust parts will be published as a crate). Hoping that most of the Rust code can be upstreamed though. See PyO3/pyo3#2271 for reference.
Handle the RelativeUrlWithoutBase parsing error raised when local filepaths are given without `file://` in front. Also moved the parse url logic out of the async block.
CodSpeed Performance ReportMerging #5 will degrade performances by 33.24%Comparing Summary
Benchmarks breakdown
|
Resolve `warning: [email protected]: #error "ARM assembler must define __ARM_ARCH"` on aarch64. Xref PyO3/maturin-action#222
Getting an error like `/usr/bin/python3: No module named pip` when using the quay.io/pypa/manylinux_2_28_x86_64:latest docker image, so trying with ghcr.io/rust-cross/manylinux_2_28-cross instead. Note that the aarch64/armv7/s390x/ppc64le targets use this rust-cross image by default.
Don't compile on i686 since manylinux_2_28 doesn't support it according to https://github.com/rust-cross/manylinux-cross/blob/1c46d3b27a55eff53d675bf515d6361d7cc1abde/README.md?plain=1#L25-L35 and https://github.com/pypa/manylinux/blob/dbe0aa5c374c5b69769253d2a77820b08fe80a7d/README.rst?plain=1#L46. Note that x86 is an alias of i686 (see https://github.com/PyO3/maturin-action/blob/60d11847b29f81ca5375519a8eb33cc336ba4bfa/src/index.ts#L119).
Use built-in function at https://docs.rs/url/2.5.0/url/struct.Url.html#method.from_file_path to add the file:// prefix instead of the DIY method at f9607f3. Hoping that this would handle Windows paths too. Also changed .expect to .map_err to raise PyValueError instead of panicking.
Add instructions on how to install cog3pio in Rust or Python (need to pull development version from git for now), and sample code to read a GeoTIFF file from a http url. Included some tips on pinning the specific commit hash, and mentioned that the current crate/library only supports reading single-band float32 GeoTIFF files. Also checked the box that reading from remote storage is implemented.
Gracefully handle errors when retrieving the GeoTIFF data in the async block, using turbofish operator as suggested by https://rust-lang.github.io/async-book/07_workarounds/02_err_in_async_blocks.html. Fixed some errors missing the format! macro, and added unit tests to catch parsing and decoding errors. In the read_geotiff function, changed the function to return a TiffError instead of ShapeError, so that ? can be used in the decoding portions.
Copy the Rust code from 15b0dbc to src/lib.rs as a doctest. Need to change `crate-type = ["rlib"]` in Cargo.toml so that `cargo test --doc` works, but will configure that in CI later.
// Open TIFF file from path | ||
let file = File::open(path).expect("Cannot find GeoTIFF file"); | ||
// Parse URL into ObjectStore and path | ||
let file_or_url = match Url::from_file_path(path) { | ||
// Parse local filepath | ||
Ok(filepath) => filepath, | ||
// Parse remote URL | ||
Err(_) => Url::parse(path) | ||
.map_err(|_| PyValueError::new_err(format!("Cannot parse path: {path}")))?, | ||
}; | ||
let (store, location) = parse_url(&file_or_url) | ||
.map_err(|_| PyValueError::new_err(format!("Cannot parse url: {file_or_url}")))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CodSpeed Performance Report
Merging #5 will degrade performances by 32.21%
Comparing
object_store
(33f4a73) withmain
(0c37e32)Summary
❌ 1 (👁 1)
regressions
🆕 1
new benchmarksBenchmarks breakdown
Benchmark
main
object_store
Change
👁test_read_geotiff_local
423 µs 624 µs -32.21%
🆕test_read_geotiff_remote
N/A 49 ms N/A
Acknowledging the slowdown in reading local GeoTIFF files as acceptable for now since it is only ~200µs slower, but could revisit this next time if we should have separate branches for remote files (read using object_store
) and local files (read using std::fs::File
).
Enable GeoTIFFs stored on remote urls (e.g. http, s3, azure, gcp, etc) to be read into numpy arrays.
Uses the
object_store
crate, which pulls data via an async API. Note that only the network read is async, the TIFF decoding (using thetiff
crate) is still synchronous.Usage (in Python):
TODO:
object_store
'sparse_url
function to handle remote URL readsdata/float32.tif
) work alongside remote URLs (e.g.https://somewhere.com/float32.tif
).unwrap()
statementsOther things done in this PR out of necessity:
References: