-
-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates to the way satip uses JPEG XL compression? #129
Comments
With the satellite data, I think one of the main challenges is creating chunks which are sufficiently large to mean that they can be read efficiently from cloud buckets. Which means the chunks should be around 1 MByte each. At the moment, when using JPEG XL, and using 1 channel per chunk, the chunks are tiny (100 kB). So I think they key is to get all channels into a single chunk. Hopefully this will be possible with the next release of imagecodecs and libjxl v0.7. If not, a hacky solution might be to put the Or, a simple solution might be to use After reading a bit more about atmospheric physics, I'm pretty confident that we'll want to give all the satellite channels to our ML models. At least for the "pseudo-labeller". So it makes sense to put all channels into each chunk (because we'll be loading all channels). |
A naive way of loading a complete batch of 32 examples x 24 satellite images per example x 512 pixels x 512 pixels x 11 channels takes 39s per batch! Using v8 non-HRV Zarr on Google Cloud Storage. But that's only pulling about 15 MB/sec over the network; and CPU average load doesn't exceed 24% on a 48-core VM. Using dask Loading a single variable takes just 5.3s.
Using |
Using multiple processes is faster. This takes 20 seconds. Avg CPU = 75%. Network = about 20 MB/s. def get_data(time_idx: int):
non_hrv_ds = xr.open_dataset(NON_HRV_ZARR_PATH, mode="r", engine="zarr", chunks={})
return non_hrv_ds.isel(
time=time_idx,
y_geostationary=slice(0, 512),
x_geostationary=slice(0, 512),
).load()
with ProcessPoolExecutor() as executor:
i = 0
for ds in executor.map(get_data, range(32*24)):
i += 1
print(i, flush=True, end=", ") |
The above code with Loading a single variable takes 5.2s. Using more or less |
Using a hybrid of dask and def get_data(time_idx: int):
non_hrv_ds = xr.open_dataset(NON_HRV_ZARR_PATH, mode="r", engine="zarr", chunks={})
return non_hrv_ds.isel(
time=slice(time_idx*24, (time_idx+1)*24),
y_geostationary=slice(0, 512),
x_geostationary=slice(0, 512),
# variable=0,
).load()
with ProcessPoolExecutor() as executor:
i = 0
for ds in executor.map(get_data, range(32)):
i += 1
print(i, flush=True, end=", ") |
Maybe one solution is to include multiple timesteps per chunk. Eg have three arrays: one with two hours at 00, 15, 30, and 45. Another at 05, 20, 35, 50. And a third at 10, 25, 40, 55. |
Closing this as we use |
Motivation & aims
Plan
Encode multiple satellite channels into a single JPEG XL file
libjxl 0.6.1
can already encode arbitrary channels, and thatimagecodecs
v2022.8.8 can already handle this if we pass in an array with 4 dimensions (time, y, x, channels), where the number of timesteps is 1, and also pass inphotometric=JXL_COLOR_SPACE_GRAY
. #131uint8
bzip2
.imagecodecs
, see libjxl v0.7rc1 cgohlke/imagecodecs#46imagecodecs
tolibjxl
v0.7rc. Submit a PR to imagecodecs if necessary.jxl
file.jxl
file, becausejxl
can express that additional channels might be at different scales. (e.g. see thedim_shift
member of theJxlExtraChannelInfo
struct.) This might be nice to so we can avoid having tiny HRV-only chunks on disk, which are inefficient to load. But don't worry if this is too much work to integrate with Zarr and xarray. Note that non-HRV extends further east and west than HRV. So probably want to crop the non-HRV to the same shape as HRV.JXL_CHANNEL_OPTIONAL
. Ensure JPEG XL doesn't try to do any colour space conversion! If JPEG XL insists on doing colour space conversion with RGB data, then just give it a single "GRAY" channel, and all remaining channels asOPTIONAL
.JXL_CHANNEL_SELECTION_MASK
.Benchmarking, compatibility, integration with Zarr & xarray
chunks={}
) and without (chunks=None
) dask, etc.imagecodecs
.Misc
conda
version ofimagecodecs
(instead of installing using pip) (see code comment in satip/environment.yml)? #130References
I've asked questions about all these things on the JPEG XL reddit and one of the main JPEG XL developers has replied with lots of details!
The text was updated successfully, but these errors were encountered: