Improve reslc example scripts #25

rogerkuou · 2024-10-02T14:30:20Z

Added the following parts to reslc exmaple scripts:

add mother image
add h2ph
add time coords
add lat and lon

1. add mother image 2. add h2ph 3. add time coords 4. add lat and lon

rogerkuou · 2024-10-02T14:57:33Z

Hi @FreekvanLeijen, could you please review this exmaple?

In this PR I added the h2ph and mother slc to the output of the reslc process. Besides, the temporal info and the lat/lon are also added. There is an successfully executed example in /project/caroline/Share/share-oku/PyDePSI/debug.

FreekvanLeijen

Hi Ou, looks nice. Two small requests from my side.

examples/scripts/script_reslc.py

rogerkuou · 2024-10-04T12:14:21Z

Hi @FreekvanLeijen, thanks for the review! I applied the comments you gave. Can you check again and see if you have further comments?

rogerkuou · 2024-10-07T12:55:40Z

Hi @FreekvanLeijen, regarding your comment of the precision loss when saving h2ph in float16, I will transfer the conversation here for a proper document.

And I will also add @SarahAlidoost and @fnattino here to see if they also have comments on this.

To summary the problem to all, this PR regards adding an example script for PyDePSI. In the end of script, we are saving an 3D array h2ph with dimensions (azimuth, range, time) in float16 to Zarr. Originally this array is saved as a binary file with float32. At present, we have a precision loss with the scale ~1e-8, which is not acceptable according to Freek.

@FreekvanLeijen unfortunately I think your proposal of multiplying a factor does not work. I found this explaination roughly explain why it's not.

However, considering the h2ph value per image (i.e. instance in time dimension) varies in a small range, I did an experiment of applying "data normalization" to the float16 data. With this idea, we calculate an offset (mean value) and a scale (fator of multiplication) per image, subtract the offset, then multiply by the factor. In this way we convert all h2ph values to a number close to zero, with the information of the decimal digits after the offset value. Since we can store the offset and scale with high precision, we can reconstruct the value with relatively small precision loss. When I test on a small crop (azimuth 2000, range 4000), we have a precision loss of 1e-10.

Attached is an visualizarion of the differences of h2ph between float32 and float16:

I am not sure if this precision loss of 1e-10 is acceptable. If yes, then there are still two drawbacks of this solution:

the precision loss will not be homogenoues per image, since we need to use one offset value. In this case I use the average, so the low precision loss will happen around the middle of the image. I am not sure if the precision will be accepable when we apply this to the full scale.
we need to save a offset and a scale per image (in time dimension). I did not find an encoding solution to specify this offset and scale in a standard way. Although there is "add_offset" and "scale_factor" keywords but only apply one general for the entire dataset. We can write our own functionality to use our customized encoding though. Maybe @SarahAlidoost can comment more on this?

On the other hand, for the same 2000x4000x409 slc stack, saving h2ph in float32 caused an extra 3GB storage. If we scale it to the entire stack, I expect a extra 300~400 GB.

Apptainer> du -h --max-depth=1 slcs*
3.0G    slcs_h2ph_float32.zarr/h2ph
6.1G    slcs_h2ph_float32.zarr/imag
3.0K    slcs_h2ph_float32.zarr/time
6.5M    slcs_h2ph_float32.zarr/lon
2.0K    slcs_h2ph_float32.zarr/range
1.5K    slcs_h2ph_float32.zarr/azimuth
2.4M    slcs_h2ph_float32.zarr/lat
6.1G    slcs_h2ph_float32.zarr/real
**16G     slcs_h2ph_float32.zarr**
**53M     slcs.zarr/h2ph**
6.1G    slcs.zarr/imag
2.0K    slcs.zarr/time
6.5M    slcs.zarr/lon
2.0K    slcs.zarr/range
1.5K    slcs.zarr/azimuth
2.4M    slcs.zarr/lat
6.1G    slcs.zarr/real
**13G     slcs.zarr**

With this info, at present, I would still recommend saving h2ph in np.float32. Since @FreekvanLeijen you also mentions that we are working with a tempory solution here. And the gain will be small.

Attaching the notebook I used to run the expriment here in case you would like to check details:
inspect_h2ph.zip

Simon-van-Diepen

@rogerkuou going through the code of the example I noticed two things:

packages datetime (from datetime import datetime), dask.array (as da) and xarray (as xr) are not imported
I think slcs_output should be written to zarr instead of slcs_recon

examples/scripts/script_reslc.py

Simon-van-Diepen

Please add to imports:
from datetime import datetime
import xarray as xr
import dask.array as da

examples/scripts/script_reslc.py

Co-authored-by: Simon-van-Diepen <[email protected]>

rogerkuou · 2024-10-15T08:54:02Z

Hi @Simon-van-Diepen, thanks for the comment. The missing modules has been updated. Can you check again?

Simon-van-Diepen · 2024-10-17T06:33:56Z

Hi @rogerkuou , I tested the appending of the time dimension on a live stack, and got some unexpected behaviour. In summary:

first run on stack of 200 images --> time axis is 200 long (generated by mode = "w")
second run on stack of 200 original images + 1 new --> time axis is 401 long (generated by mode="a")

Append thus does not check for duplicates, and simply appends everything. I am currently testing the behaviour of mode="w" in case the stack does exist.

Simon-van-Diepen · 2024-10-17T06:46:40Z

Hi @rogerkuou , mode="w" works but I found another bug: if you print the time axis of the resulting zarr the following shows:

    time     datetime64[ns] 2020-03-22
<xarray.DataArray 'time' ()>
array('2020-03-28T00:00:00.000000000', dtype='datetime64[ns]')
Coordinates:
    time     datetime64[ns] 2020-03-28
<xarray.DataArray 'time' ()>
array('2020-03-28T00:00:00.000000000', dtype='datetime64[ns]')
Coordinates:
    time     datetime64[ns] 2020-03-28
<xarray.DataArray 'time' ()>
array('2020-04-03T00:00:00.000000000', dtype='datetime64[ns]')
Coordinates:
    time     datetime64[ns] 2020-04-03
<xarray.DataArray 'time' ()>
array('2020-04-09T00:00:00.000000000', dtype='datetime64[ns]')

2020-03-28 is the mother and now appears twice. Can you have a look at what causes this?

Simon-van-Diepen · 2024-10-17T11:22:22Z

@rogerkuou I found that replacing line 165 by

slcs_output = xr.concat([slc_recon_output, slc_mother], dim="time").drop_duplicates(dim="time", keep="last").sortby("time")

properly removes the duplicated mother image

rogerkuou · 2024-10-21T08:26:50Z

Hi @Simon-van-Diepen, thanks for the feedback! I took a deeper look and thought about this a bit more. My opinion is, we should try to invest a bit more on mode=w and mode=a.

On the one hand, I am glad mode=w works. And ideally, if Zarr is smart enough to recognize the existing time coordinates and only write the new image, this can be our ultimate solution. On the other hand, if Zarr does not, we are in a situtation that we re-write the whole stack everytime a new image come in, which is really not preferred.

I would experiment if it will help if we only read the binary file (phase, h2ph) of the new image and mother image, but read the exsting images from Zarr.

I would need a bit more time to investigate this, probably after I come back from holidat (Nov 11). I would keep this PR open for now.

Simon-van-Diepen · 2024-10-21T08:35:41Z

Hi @rogerkuou , I agree we should investigate if we can make mode='a' work. If zarr truly does not check for duplicate coordinates perhaps we could do that ourselves by checking, if a zarr exists, which timestamps it contains, and dropping those from the xarray before running to_zarr with mode='a'

FreekvanLeijen

Ou, one more small request. Rest is fine.

examples/scripts/script_reslc.py

Add to example script:

66c4a11

1. add mother image 2. add h2ph 3. add time coords 4. add lat and lon

rogerkuou marked this pull request as ready for review October 2, 2024 14:31

rogerkuou changed the title ~~Add to example script:~~ Improve reslc example scripts Oct 2, 2024

rogerkuou requested a review from FreekvanLeijen October 2, 2024 14:57

FreekvanLeijen requested changes Oct 3, 2024

View reviewed changes

examples/scripts/script_reslc.py Show resolved Hide resolved

examples/scripts/script_reslc.py Outdated Show resolved Hide resolved

apply code review comments

f5fdf69

make h2ph be saved in float32

007378a

Simon-van-Diepen reviewed Oct 14, 2024

View reviewed changes

examples/scripts/script_reslc.py Outdated Show resolved Hide resolved

Simon-van-Diepen suggested changes Oct 14, 2024

View reviewed changes

examples/scripts/script_reslc.py Outdated Show resolved Hide resolved

rogerkuou and others added 2 commits October 14, 2024 16:45

Apply suggestions from code review

24fde24

Co-authored-by: Simon-van-Diepen <[email protected]>

add missing modules

5771a0e

add time appendix to reslc script

0a300be

FreekvanLeijen requested changes Oct 21, 2024

View reviewed changes

examples/scripts/script_reslc.py Outdated Show resolved Hide resolved

multiline

9931838

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve reslc example scripts #25

Improve reslc example scripts #25

rogerkuou commented Oct 2, 2024

rogerkuou commented Oct 2, 2024

FreekvanLeijen left a comment

rogerkuou commented Oct 4, 2024

rogerkuou commented Oct 7, 2024 •

edited

Loading

Simon-van-Diepen left a comment •

edited

Loading

Simon-van-Diepen left a comment

rogerkuou commented Oct 15, 2024

Simon-van-Diepen commented Oct 17, 2024

Simon-van-Diepen commented Oct 17, 2024 •

edited

Loading

Simon-van-Diepen commented Oct 17, 2024

rogerkuou commented Oct 21, 2024 •

edited

Loading

Simon-van-Diepen commented Oct 21, 2024

FreekvanLeijen left a comment

Improve reslc example scripts #25

Are you sure you want to change the base?

Improve reslc example scripts #25

Conversation

rogerkuou commented Oct 2, 2024

rogerkuou commented Oct 2, 2024

FreekvanLeijen left a comment

Choose a reason for hiding this comment

rogerkuou commented Oct 4, 2024

rogerkuou commented Oct 7, 2024 • edited Loading

Simon-van-Diepen left a comment • edited Loading

Choose a reason for hiding this comment

Simon-van-Diepen left a comment

Choose a reason for hiding this comment

rogerkuou commented Oct 15, 2024

Simon-van-Diepen commented Oct 17, 2024

Simon-van-Diepen commented Oct 17, 2024 • edited Loading

Simon-van-Diepen commented Oct 17, 2024

rogerkuou commented Oct 21, 2024 • edited Loading

Simon-van-Diepen commented Oct 21, 2024

FreekvanLeijen left a comment

Choose a reason for hiding this comment

rogerkuou commented Oct 7, 2024 •

edited

Loading

Simon-van-Diepen left a comment •

edited

Loading

Simon-van-Diepen commented Oct 17, 2024 •

edited

Loading

rogerkuou commented Oct 21, 2024 •

edited

Loading