Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add automatic chunking to open_rasterio #2255

Closed
wants to merge 2 commits into from

Conversation

mrocklin
Copy link
Contributor

This uses the automatic chunking in dask 0.18+ to chunk rasterio
datasets in a nicely aligned way.

Currently this doesn't implement tests due to a difficulty in creating
chunked tiff images.

This also uncovered some inefficiencies in how Dask doesn't align rechunking to existing chunk schemes.

  • Closes Default chunking in GeoTIFF images #2093
  • Tests added (for all bug fixes or enhancements)
  • Tests passed (for all non-documentation changes)
  • Fully documented, including whats-new.rst for all changes and api.rst for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)

I could use help on how the following:

  • How to create tiled TIFF files in the tests
  • The right way to merge different dtypes and block shapes in the tiff file. Currently I'm assuming that they're uniform

This uses the automatic chunking in dask 0.18+ to chunk rasterio
datasets in a nicely aligned way.

Currently this doesn't implement tests due to a difficulty in creating
chunked tiff images.
@mrocklin mrocklin force-pushed the rasterio-auto-chunking branch from 3ccf864 to ef8f193 Compare June 27, 2018 21:01
assert actual.chunks[0] == (1, 1, 1)
assert actual.chunks[1] == (256,) * 4
assert actual.chunks[2] == (256,) * 8
with xr.open_rasterio(tmp_file, chunks=(3, 'auto', 'auto')) as actual:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E501 line too long (86 > 79 characters)

@mrocklin
Copy link
Contributor Author

mrocklin commented Jun 28, 2018

import os
if not os.path.exists('myfile.tif'):
    import requests
    response = requests.get('https://oin-hotosm.s3.amazonaws.com/5abae68e65bd8f00110f3e42/0/5abae68e65bd8f00110f3e43.tif')
    with open('myfile.tif', 'wb') as f:
        f.write(response.content)

import dask
dask.config.set({'array.chunk-size': '1MiB'})

import xarray as xr
ds = xr.open_rasterio('myfile.tif', chunks=True)  # this only reads metadata to start

>>> ds.chunks
((1, 1, 1),
 (1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 136),
 (1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 995))

Also depends on dask/dask#3679 . Without that PR it will use values that are similar, but don't precisely align with 1024.

Oh, I should point out that the image has tiles of size (512, 512)

previous_chunks = tuple((c,) for c in block_shape)
shape = (img.count, img.height, img.width)
dtype = img.dtypes[0]
chunks = dask.array.core.normalize_chunks(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we expose normalize_chunks() as a top-level API in dask.array, e.g., dask.array.normalize_chunks? I'm generally a little nervous about dipping into dask.array.core.

@snowman2
Copy link
Contributor

snowman2 commented Jul 24, 2019

The right way to merge different dtypes and block shapes in the tiff file. Currently I'm assuming that they're uniform

I have yet to run into a raster that varies dtypes and block shapes across bands. Most of the time, they are single band rasters. And if they are not, they have had the same dtype and block shape. So, I think your assumption is a good one for most use cases.

Also, only a single dtype is allowed currently:

if not np.all(np.asarray(dtypes) == dtypes[0]):
raise ValueError('All bands should have the same dtype')

@snowman2
Copy link
Contributor

snowman2 commented Jul 24, 2019

How to create tiled TIFF files in the tests

One thing I would like to note is that the automatic chunking would be useful if the raster is tiled or not. I tested out a raster that was not tiled, but it still had chunks. This is due to the raster being written in stripes. So, I would recommend removing the restriction to only tiled rasters.

Also, to create a tiled raster:

import rasterio
import numpy
from affine import Affine

with rasterio.open(
    "tiled.tif",
    "w",
    driver="GTiff",
    count=2,
    width=1024,
    height=1024,
    crs="+init=epsg:4326",
    transform=Affine(0.0083333333, 0.0, -180.00416666665, 0.0, -0.0083333333, 75.00416666665),
    dtype=rasterio.float32,
    tiled=True,
    blockxsize=512,
    blockysize=512,
) as rds:
    rds.write((numpy.random.rand(2, 1024, 1024)*10).astype(numpy.float32))

Looks like they have this option in the tests:

open_kwargs=dict(
    tiled=True,
    blockxsize=512,
    blockysize=512
)
with create_tmp_geotiff(nx=1024, ny=1024, nz=2, open_kwargs=open_kwargs) as (tmp_file, expected):
    ....

@mrocklin
Copy link
Contributor Author

I've abandoned this PR. If anyone has time to pick it up, that would be welcome. I think that it would have positive impact.

@snowman2
Copy link
Contributor

snowman2 commented Jul 24, 2019

I've abandoned this PR. If anyone has time to pick it up, that would be welcome.

I appreciate you staring this! Based on this PR, I added the feature into rioxarray here: corteva/rioxarray#31 (released in version 0.0.9). (Example usage can be seen here.)

@mrocklin
Copy link
Contributor Author

mrocklin commented Jul 24, 2019 via email

@snowman2
Copy link
Contributor

I'm curious, are there features in rioxarray that could be pushed upstream?

Depends on what the xarray maintainers would like to add. I would definitely like to see the open_rasterio function in rioxarray absorbed back into xarray. Things have just been really slow moving with the other PRs/issues.

@fmaussion
Copy link
Member

I'm curious, are there features in rioxarray that could be pushed upstream?

Depends on what the xarray maintainers would like to add. I would definitely like to see the open_rasterio function in rioxarray absorbed back into xarray. Things have just been really slow moving with the other PRs/issues.

Put another way: why don't we put all the logic in rioxarray and make rioxarray an optional dependency of xarray to open rio files?

@snowman2
Copy link
Contributor

Put another way: why don't we put all the logic in rioxarray and make rioxarray an optional dependency of xarray to open rio files?

That is an option. All of the logic has already been moved over.

@keewis
Copy link
Collaborator

keewis commented Mar 26, 2021

with #4697 and the fact that this seems to have been included in rioxarray: should we close this?

@keewis keewis added the plan to close May be closeable, needs more eyeballs label Oct 11, 2021
@dcherian dcherian closed this Apr 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to close May be closeable, needs more eyeballs topic-backends
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Default chunking in GeoTIFF images
8 participants