Updates to the way satip uses JPEG XL compression? #129

JackKelly · 2022-09-13T12:10:19Z

Motivation & aims

Make the satellite Zarr easy to sample from on-the-fly during ML training (without having to pre-prepare batches or pre-load a subset of the training data into system RAM).
Also ensure the data is as easy as possible to use for collaborators. Small size on disk. Relatively easy toolchain.

Plan

Encode multiple satellite channels into a single JPEG XL file

Benchmarking, compatibility, integration with Zarr & xarray

Try opening the chunks in a naive JPEG XL viewer
Benchmark. Measure file size, encode speed, decode speed on Google Cloud. Try JPEG XL "all in one file", vs JPEG XL "separate files", vs "standard" compression. Try different chunks. Try different types of GCP VM. Try GCP VM vs donatello SSD. When reading, try loading as many 512x512 image patches per second as possible. Multi-process. With (chunks={}) and without (chunks=None) dask, etc.
Maybe create a separate little Python package for encoding multi-channel data in Zarr using JPEG XL. Or maybe fold all this into a PR for imagecodecs.

Misc

See if we can use the conda version of imagecodecs (instead of installing using pip) (see code comment in satip/environment.yml)? #130

References

I've asked questions about all these things on the JPEG XL reddit and one of the main JPEG XL developers has replied with lots of details!

The text was updated successfully, but these errors were encountered:

JackKelly · 2022-09-16T10:14:05Z

With the satellite data, I think one of the main challenges is creating chunks which are sufficiently large to mean that they can be read efficiently from cloud buckets. Which means the chunks should be around 1 MByte each. At the moment, when using JPEG XL, and using 1 channel per chunk, the chunks are tiny (100 kB). So I think they key is to get all channels into a single chunk.

Hopefully this will be possible with the next release of imagecodecs and libjxl v0.7.

If not, a hacky solution might be to put the jxl chunks into a tar or zip file for storage.

Or, a simple solution might be to use bz2 compression and uint8 images (and multiple channels per chunk).

After reading a bit more about atmospheric physics, I'm pretty confident that we'll want to give all the satellite channels to our ML models. At least for the "pseudo-labeller". So it makes sense to put all channels into each chunk (because we'll be loading all channels).

JackKelly · 2022-09-16T14:05:27Z

A naive way of loading a complete batch of 32 examples x 24 satellite images per example x 512 pixels x 512 pixels x 11 channels takes 39s per batch! Using v8 non-HRV Zarr on Google Cloud Storage. But that's only pulling about 15 MB/sec over the network; and CPU average load doesn't exceed 24% on a 48-core VM. Using dask chunks={}.

Loading a single variable takes just 5.3s.

chunks="auto" gives similar behaviour. But takes longer (44s).

Using chunks=None saturates a single CPU core, and the network bandwidth peaks at more like 80 MB/s. But it takes ages to complete. But takes 3min 38s to complete!

JackKelly · 2022-09-16T15:37:33Z

Using multiple processes is faster. This takes 20 seconds. Avg CPU = 75%. Network = about 20 MB/s.

def get_data(time_idx: int):
    non_hrv_ds = xr.open_dataset(NON_HRV_ZARR_PATH, mode="r", engine="zarr", chunks={})
    return non_hrv_ds.isel(
        time=time_idx,
        y_geostationary=slice(0, 512),
        x_geostationary=slice(0, 512),
    ).load()


with ProcessPoolExecutor() as executor:
    i = 0
    for ds in executor.map(get_data, range(32*24)):
        i += 1
        print(i, flush=True, end=", ")

JackKelly · 2022-09-16T15:38:49Z

The above code with chunks=None is even faster: Takes 17 sec. Avg CPU = 50%. Network = 25 - 38 MB/sec.

Loading a single variable takes 5.2s.

Using more or less max_workers than there are CPUs seems to hurt performance a little. The best seems to be to use max_workers = num CPU cores.

JackKelly · 2022-09-16T15:56:52Z

Using a hybrid of dask and ProcessPoolExecutor takes longer: 32.2s, and saturates the CPUs:

def get_data(time_idx: int):
    non_hrv_ds = xr.open_dataset(NON_HRV_ZARR_PATH, mode="r", engine="zarr", chunks={})
    return non_hrv_ds.isel(
        time=slice(time_idx*24, (time_idx+1)*24),
        y_geostationary=slice(0, 512),
        x_geostationary=slice(0, 512),
        # variable=0,
    ).load()


with ProcessPoolExecutor() as executor:
    i = 0
    for ds in executor.map(get_data, range(32)):
        i += 1
        print(i, flush=True, end=", ")

JackKelly · 2022-09-16T16:32:59Z

Maybe one solution is to include multiple timesteps per chunk. Eg have three arrays: one with two hours at 00, 15, 30, and 45. Another at 05, 20, 35, 50. And a third at 10, 25, 40, 55.

jacobbieker · 2024-02-23T10:43:05Z

Closing this as we use ocf_blosc2 now instead.

JackKelly self-assigned this Sep 15, 2022

JackKelly added the enhancement New feature or request label Sep 15, 2022

JackKelly added this to Nowcasting Sep 15, 2022

JackKelly moved this to Todo in Nowcasting Sep 15, 2022

JackKelly moved this from Todo to In Progress in Nowcasting Sep 15, 2022

jacobbieker closed this as completed Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to the way satip uses JPEG XL compression? #129

Updates to the way satip uses JPEG XL compression? #129

JackKelly commented Sep 13, 2022 •

edited

Loading

JackKelly commented Sep 16, 2022

JackKelly commented Sep 16, 2022 •

edited

Loading

JackKelly commented Sep 16, 2022 •

edited

Loading

JackKelly commented Sep 16, 2022 •

edited

Loading

JackKelly commented Sep 16, 2022

JackKelly commented Sep 16, 2022

jacobbieker commented Feb 23, 2024

Updates to the way satip uses JPEG XL compression? #129

Updates to the way satip uses JPEG XL compression? #129

Comments

JackKelly commented Sep 13, 2022 • edited Loading

Motivation & aims

Plan

Encode multiple satellite channels into a single JPEG XL file

Benchmarking, compatibility, integration with Zarr & xarray

Misc

References

JackKelly commented Sep 16, 2022

JackKelly commented Sep 16, 2022 • edited Loading

JackKelly commented Sep 16, 2022 • edited Loading

JackKelly commented Sep 16, 2022 • edited Loading

JackKelly commented Sep 16, 2022

JackKelly commented Sep 16, 2022

jacobbieker commented Feb 23, 2024

JackKelly commented Sep 13, 2022 •

edited

Loading

JackKelly commented Sep 16, 2022 •

edited

Loading

JackKelly commented Sep 16, 2022 •

edited

Loading

JackKelly commented Sep 16, 2022 •

edited

Loading