Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GDAL Virtual Rasters #166

Open
TomNicholas opened this issue Jun 29, 2024 · 9 comments
Open

GDAL Virtual Rasters #166

TomNicholas opened this issue Jun 29, 2024 · 9 comments
Labels
enhancement New feature or request help wanted Extra attention is needed references generation Reading byte ranges from archival files

Comments

@TomNicholas
Copy link
Member

From https://docs.csc.fi/support/tutorials/gis/virtual-rasters/ (emphasis mine):

Virtual rasters is useful GDAL concept for managing large raster datasets that are split into not overlapping map sheets. Virtual rasters are not useful for managing time-series or overlapping rasters, for example remote sensing tiles.

Technically a virtual raster is just a small xml file that tells GDAL where the actual data files are, but from user's point of view virtual rasters can be treated much like any other raster format. Virtual rasters can include raster data in any file format GDAL supports. Virtual rasters are useful because they allow handling of large datasets as if they were a single file eliminating need for locating correct files.

It is possible to use virtual rasters so, that only the small xml-file is stored locally and the big raster files are in Allas, Amazon S3, publicly on server or any other place supported by GDAL virtual drivers. The data is moved to local only for the area and zoom level requested when the virtual raster is opened. The best performing format to save your raster data in remote service is Cloud optimized GeoTIFF, but other formats are also possible.

That sounds a lot like a set of reference files doesn't it... Maybe we could ingest those virtual raster files and turn them into chunk manifests, like we're doing with DMR++ in #113?

Also we can definitely open Cloud optimized GeoTIFFS now (since #162).

Thanks to @scottyhq for mentioning this idea. Maybe him, @abarciauskas-bgse, or someone else who knows more about GDAL can say whether they think this idea might actually work or not.

@TomNicholas TomNicholas added enhancement New feature or request help wanted Extra attention is needed references generation Reading byte ranges from archival files labels Jun 29, 2024
@abarciauskas-bgse
Copy link
Collaborator

I'm not an expert on VRTs but I think it could work. It could potentially be useful if you want to create a dataset from rasters which are overlapping and the VRT represents an already dedupped version of the data (assuming the logic for deduplication is appropriate). Mostly, I'm not sure how useful it is to have this functionality because I am not familiar of VRTs that are made publicly available or published for general use. I have heard of VRTs being used for on-the-fly definition of mosaics.

I am also going to tag my colleagues @wildintellect and @vincentsarago who have more experience with VRTs than I do and may be able to think of reasons this may or may not work.

@wildintellect
Copy link

@abarciauskas-bgse converting a VRT to a Reference File for Zarr seems fine. I'm not sure the VRT would contain all the chunk information you need so the source files may also need to also be scanned. At that point it's not super different than just being given a list of files to include in a manifest.

Example:

<VRTDataset rasterXSize="512" rasterYSize="512">
    <GeoTransform>440720.0, 60.0, 0.0, 3751320.0, 0.0, -60.0</GeoTransform>
    <VRTRasterBand dataType="Byte" band="1">
        <ColorInterp>Gray</ColorInterp>
        <SimpleSource>
        <SourceFilename relativeToVRT="1">utm.tif</SourceFilename>
        <SourceBand>1</SourceBand>
        <SrcRect xOff="0" yOff="0" xSize="512" ySize="512"/>
        <DstRect xOff="0" yOff="0" xSize="512" ySize="512"/>
        </SimpleSource>
    </VRTRasterBand>
</VRTDataset>

Fun I didn't know about https://gdal.org/drivers/raster/vrt_multidimensional.html not sure I've ever seen one of these.

To be clear a VRT does not de-duplicate anything. When using a VRT with GDAL

If there is some amount of spatial overlapping between files, the order of files appearing in the list of source matter: files that are listed at the end are the ones from which the content will be fetched
https://gdal.org/programs/gdalbuildvrt.html
https://gdal.org/drivers/raster/vrt.html

So up to you if you'd want a VRT which takes effort, or would rather just be passed a list of files to include in a mosaiced reference file.

Here's a great one you can experiment with https://github.com/scottstanie/sardem/blob/master/sardem/data/cop_global.vrt
This shows nested VRTs and point to a public dataset on AWS that is a global DEM with no overlaps, 1 projection, only 1 band, and 1 time point. So in some ways the simplest possible scenario.

@abarciauskas-bgse
Copy link
Collaborator

Interesting thanks @wildintellect .

Thanks for clearing that up about de-duplication. I was under the impressions that VRTs could represent a mosaic after deduplication of source files (e.g. spatial overlapping is resolved through logic while building the VRT). But I suppose that use case would be choosing overlapping data preference by block level, not pixel level.

@scottyhq
Copy link
Contributor

scottyhq commented Jul 3, 2024

Thanks for the ping @TomNicholas! Some good points have already been mentioned. I think I just brought up VRTs because they are another example of lightweight sidecar metadata that simplifies the user experience of data management :) ... I haven't thought too much about integrations with virtualizarr, but some ideas below:

I suppose in the same way you create a reference file for NetCDF/DMR++ to bypass HDF and use Zarr instead, you could do the same for TIFF/VRT to bypass GDAL. Would probably want to do some benchmarking there, because unlike hdf, GDAL is pretty good at using overviews and efficiently figuring out range requests during reads (for the common case of a VRT pointing at cloud-optimized geotiffs).

I think another connection here is what is the serialization format for virtualizarr and what is it's scope? My understanding is the eventual goal is to save directly to ZARR v3 format and there are I'm sure lots of existing discussions that I'm not up to speed on. But my mental model is that VRT, STAC, ZARR, KerchunkJSON are all lightweight metadata mappings that can encode many things (file and byte locations, arbitrary metadata, "on read" computations like scale and offset, subset, reprojection).

It seems these lightweight mappings work well up to a limit, and then you encounter the need for some sort of spatial index or database system :) So again, my mapping becomes (KerchunkJSON -> Parquet, VRT -> GTI, STAC -> pgSTAC, ZARR -> Earthmover?

@TomNicholas
Copy link
Member Author

Thanks @scottyhq !

lightweight metadata mappings that can encode many things (file and byte locations, arbitrary metadata, "on read" computations like scale and offset, subset, reprojection).

I see the chunk manifest as exclusively dealing with file and byte locations, and everything else in that list should live elsewhere in zarr (e.g. codecs or metadata following a certain convention).

I would be very curious to hear @mdsumner's thoughts on all the above.

@mdsumner
Copy link
Contributor

mdsumner commented Aug 9, 2024

I think @scottyhq captured my stance well, and I'm glad to see GTI mentioned here - that's really important, and new.

I actually see this completely in the opposite direction, and I wish there was more use of GDAL and VRT itself, it's amazing - but there's these heavy lenses in R and Python over the actual API (but, we have very good support in {gdalraster} and in osgeo.gdal already) - that's a story for elsewhere.

VRT is already an extremely lightweight virtualization, and I went looking for a similar serialization/description for xarray and ended up here. kerchunk/virtualizarr is perfect for hdf/grib IMO but not for the existing GDAL suite. Apart from harvesting filepaths, urls, connections (database strings, vrt:// strings, /vsi* protocols) I don't see what would be the point. There certainly could be a Zarr description of a mosaic, but I'd be adding that as feature to GDAL as the software to convert it from VRT or from a WMTS connection, etc, not trying to bypass it. VRT can mix formats too, it's a very general way to craft a virtual dataset from disparate and even depauparate sources.

If you want to bypass GDAL for TIFF I think you've already got what's needed, but to support VRT you would need to recreate GDAL in large part. How would it take a subset/decimation/rescaling/set-missing-metadata description for a file? I don't think you can sensibly write reference byte ranges for parts of native tiles.

All that said, I'm extremely interested in the relationship between image-tile-servers/GTI/VRT and the various vrt:// and /vsi* abstractions, and how Zarr and its virtualizations work. There are gaps in both, but together they cover a huge gamut of capability and I'm exploring as fast as I can to be able to talk more sensibly about all that.

@mdsumner
Copy link
Contributor

mdsumner commented Aug 9, 2024

oh one technical point on the mention of "byte locations", which I misplaced in my first read

lightweight metadata mappings that can encode many things (file and byte locations, arbitrary metadata, "on read"
computations like scale and offset, subset, reprojection).

That is not a general VRT thing (I think that also wasn't being suggested, but still I think it's worth adding more here)

Apart from being able to describe "raw" sources. You can craft VRT that wraps a blob in a file or in memory, described by shape,bbox,crs,dtype,address,size for example, but it's not something that's used for formats with an official driver.

The documentation is here:

Virtual raster:

https://gdal.org/drivers/raster/vrt.html

Virtual file systems:

https://gdal.org/user/virtual_file_systems.html

VRT for raw binary files:

https://gdal.org/drivers/raster/vrt.html#vrt-descriptions-for-raw-files

MEM or in-memory raster:
https://gdal.org/drivers/raster/mem.html

I think it's interesting in its relationship to how virtualizarr/kerchunk works and there's a lot of potential crossover.

@TomNicholas
Copy link
Member Author

VRT is already an extremely lightweight virtualization

I don't think the idea here would be to replace or add a new layer, but instead to create tools that can easily translate VRT to virtual Zarr or possibly vice versa. See the issues on DMR++ for a similar idea for another virtualization format.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 9, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed references generation Reading byte ranges from archival files
Projects
None yet
Development

No branches or pull requests

6 participants