Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to the way satip uses JPEG XL compression? #129

Closed
2 of 17 tasks
JackKelly opened this issue Sep 13, 2022 · 7 comments
Closed
2 of 17 tasks

Updates to the way satip uses JPEG XL compression? #129

JackKelly opened this issue Sep 13, 2022 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@JackKelly
Copy link
Member

JackKelly commented Sep 13, 2022

Motivation & aims

  • Make the satellite Zarr easy to sample from on-the-fly during ML training (without having to pre-prepare batches or pre-load a subset of the training data into system RAM).
  • Also ensure the data is as easy as possible to use for collaborators. Small size on disk. Relatively easy toolchain.

Plan

Encode multiple satellite channels into a single JPEG XL file

  • See if libjxl 0.6.1 can already encode arbitrary channels, and that imagecodecs v2022.8.8 can already handle this if we pass in an array with 4 dimensions (time, y, x, channels), where the number of timesteps is 1, and also pass in photometric=JXL_COLOR_SPACE_GRAY. #131
  • Benchmark basic read performance with existing Zarr. Write a simple multi-threaded app to load data as fast as possible. Try with and without dask. And compare to uint8 bz2 (two ways: with one chan per chunk; and with all chans in a single chunk).
  • Try different chunking with uint8 bzip2.
  • Ask if I can use the development version of imagecodecs, see libjxl v0.7rc1 cgohlke/imagecodecs#46
  • Try updating imagecodecs to libjxl v0.7rc. Submit a PR to imagecodecs if necessary.
  • Try encoding all non-HRV channels in a single jxl file.
  • Try encoding all satellite channels (including HRV) in a single jxl file, because jxl can express that additional channels might be at different scales. (e.g. see the dim_shift member of the JxlExtraChannelInfo struct.) This might be nice to so we can avoid having tiny HRV-only chunks on disk, which are inefficient to load. But don't worry if this is too much work to integrate with Zarr and xarray. Note that non-HRV extends further east and west than HRV. So probably want to crop the non-HRV to the same shape as HRV.
  • Try encoding the "most visible" satellite channels as RGB, and the remaining channels as JXL_CHANNEL_OPTIONAL. Ensure JPEG XL doesn't try to do any colour space conversion! If JPEG XL insists on doing colour space conversion with RGB data, then just give it a single "GRAY" channel, and all remaining channels as OPTIONAL.
  • Encoding missing data without the slightly hacky way that I suggested that satip should currently do it:
    • Try JXL_CHANNEL_SELECTION_MASK.
    • Try the alpha channel
  • Try naming the channels within JPEG XL
  • Try getting JPEG XL to use information from one channel to help compress another channel. Jon says: "JPEG XL can in fact use information across channels to better compress. However the current libjxl only does a limited amount of that - we haven't really been optimizing for images with more than just alpha (i.e. a single extra channel). You'll want to set the encode option JXL_ENC_FRAME_SETTING_MODULAR_NB_PREV_CHANNELS to a higher value than zero, and I think you probably also need to set effort to 8 or 9 to actually make use of this."

Benchmarking, compatibility, integration with Zarr & xarray

  • Try opening the chunks in a naive JPEG XL viewer
  • Benchmark. Measure file size, encode speed, decode speed on Google Cloud. Try JPEG XL "all in one file", vs JPEG XL "separate files", vs "standard" compression. Try different chunks. Try different types of GCP VM. Try GCP VM vs donatello SSD. When reading, try loading as many 512x512 image patches per second as possible. Multi-process. With (chunks={}) and without (chunks=None) dask, etc.
  • Maybe create a separate little Python package for encoding multi-channel data in Zarr using JPEG XL. Or maybe fold all this into a PR for imagecodecs.

Misc

References

I've asked questions about all these things on the JPEG XL reddit and one of the main JPEG XL developers has replied with lots of details!

@JackKelly JackKelly self-assigned this Sep 15, 2022
@JackKelly JackKelly added the enhancement New feature or request label Sep 15, 2022
@JackKelly JackKelly moved this to Todo in Nowcasting Sep 15, 2022
@JackKelly JackKelly moved this from Todo to In Progress in Nowcasting Sep 15, 2022
@JackKelly
Copy link
Member Author

With the satellite data, I think one of the main challenges is creating chunks which are sufficiently large to mean that they can be read efficiently from cloud buckets. Which means the chunks should be around 1 MByte each. At the moment, when using JPEG XL, and using 1 channel per chunk, the chunks are tiny (100 kB). So I think they key is to get all channels into a single chunk.

Hopefully this will be possible with the next release of imagecodecs and libjxl v0.7.

If not, a hacky solution might be to put the jxl chunks into a tar or zip file for storage.

Or, a simple solution might be to use bz2 compression and uint8 images (and multiple channels per chunk).

After reading a bit more about atmospheric physics, I'm pretty confident that we'll want to give all the satellite channels to our ML models. At least for the "pseudo-labeller". So it makes sense to put all channels into each chunk (because we'll be loading all channels).

@JackKelly
Copy link
Member Author

JackKelly commented Sep 16, 2022

A naive way of loading a complete batch of 32 examples x 24 satellite images per example x 512 pixels x 512 pixels x 11 channels takes 39s per batch! Using v8 non-HRV Zarr on Google Cloud Storage. But that's only pulling about 15 MB/sec over the network; and CPU average load doesn't exceed 24% on a 48-core VM. Using dask chunks={}.

Loading a single variable takes just 5.3s.

chunks="auto" gives similar behaviour. But takes longer (44s).

Using chunks=None saturates a single CPU core, and the network bandwidth peaks at more like 80 MB/s. But it takes ages to complete. But takes 3min 38s to complete!

@JackKelly
Copy link
Member Author

JackKelly commented Sep 16, 2022

Using multiple processes is faster. This takes 20 seconds. Avg CPU = 75%. Network = about 20 MB/s.

def get_data(time_idx: int):
    non_hrv_ds = xr.open_dataset(NON_HRV_ZARR_PATH, mode="r", engine="zarr", chunks={})
    return non_hrv_ds.isel(
        time=time_idx,
        y_geostationary=slice(0, 512),
        x_geostationary=slice(0, 512),
    ).load()


with ProcessPoolExecutor() as executor:
    i = 0
    for ds in executor.map(get_data, range(32*24)):
        i += 1
        print(i, flush=True, end=", ")

@JackKelly
Copy link
Member Author

JackKelly commented Sep 16, 2022

The above code with chunks=None is even faster: Takes 17 sec. Avg CPU = 50%. Network = 25 - 38 MB/sec.

Loading a single variable takes 5.2s.

Using more or less max_workers than there are CPUs seems to hurt performance a little. The best seems to be to use max_workers = num CPU cores.

@JackKelly
Copy link
Member Author

Using a hybrid of dask and ProcessPoolExecutor takes longer: 32.2s, and saturates the CPUs:

def get_data(time_idx: int):
    non_hrv_ds = xr.open_dataset(NON_HRV_ZARR_PATH, mode="r", engine="zarr", chunks={})
    return non_hrv_ds.isel(
        time=slice(time_idx*24, (time_idx+1)*24),
        y_geostationary=slice(0, 512),
        x_geostationary=slice(0, 512),
        # variable=0,
    ).load()


with ProcessPoolExecutor() as executor:
    i = 0
    for ds in executor.map(get_data, range(32)):
        i += 1
        print(i, flush=True, end=", ")

@JackKelly
Copy link
Member Author

Maybe one solution is to include multiple timesteps per chunk. Eg have three arrays: one with two hours at 00, 15, 30, and 45. Another at 05, 20, 35, 50. And a third at 10, 25, 40, 55.

@jacobbieker
Copy link
Member

Closing this as we use ocf_blosc2 now instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Status: In Progress
Development

No branches or pull requests

2 participants