Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve reslc example scripts #25

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Improve reslc example scripts #25

wants to merge 7 commits into from

Conversation

rogerkuou
Copy link
Member

Fix #22

Added the following parts to reslc exmaple scripts:

  1. add mother image
  2. add h2ph
  3. add time coords
  4. add lat and lon

1. add mother image
2. add h2ph
3. add time coords
4. add lat and lon
@rogerkuou rogerkuou marked this pull request as ready for review October 2, 2024 14:31
@rogerkuou rogerkuou changed the title Add to example script: Improve reslc example scripts Oct 2, 2024
@rogerkuou
Copy link
Member Author

Hi @FreekvanLeijen, could you please review this exmaple?

In this PR I added the h2ph and mother slc to the output of the reslc process. Besides, the temporal info and the lat/lon are also added. There is an successfully executed example in /project/caroline/Share/share-oku/PyDePSI/debug.

Copy link

@FreekvanLeijen FreekvanLeijen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ou, looks nice. Two small requests from my side.

examples/scripts/script_reslc.py Show resolved Hide resolved
examples/scripts/script_reslc.py Outdated Show resolved Hide resolved
@rogerkuou
Copy link
Member Author

Hi @FreekvanLeijen, thanks for the review! I applied the comments you gave. Can you check again and see if you have further comments?

@rogerkuou
Copy link
Member Author

rogerkuou commented Oct 7, 2024

Hi @FreekvanLeijen, regarding your comment of the precision loss when saving h2ph in float16, I will transfer the conversation here for a proper document.

And I will also add @SarahAlidoost and @fnattino here to see if they also have comments on this.

To summary the problem to all, this PR regards adding an example script for PyDePSI. In the end of script, we are saving an 3D array h2ph with dimensions (azimuth, range, time) in float16 to Zarr. Originally this array is saved as a binary file with float32. At present, we have a precision loss with the scale ~1e-8, which is not acceptable according to Freek.

@FreekvanLeijen unfortunately I think your proposal of multiplying a factor does not work. I found this explaination roughly explain why it's not.

However, considering the h2ph value per image (i.e. instance in time dimension) varies in a small range, I did an experiment of applying "data normalization" to the float16 data. With this idea, we calculate an offset (mean value) and a scale (fator of multiplication) per image, subtract the offset, then multiply by the factor. In this way we convert all h2ph values to a number close to zero, with the information of the decimal digits after the offset value. Since we can store the offset and scale with high precision, we can reconstruct the value with relatively small precision loss. When I test on a small crop (azimuth 2000, range 4000), we have a precision loss of 1e-10.

Attached is an visualizarion of the differences of h2ph between float32 and float16:

image

I am not sure if this precision loss of 1e-10 is acceptable. If yes, then there are still two drawbacks of this solution:

  1. the precision loss will not be homogenoues per image, since we need to use one offset value. In this case I use the average, so the low precision loss will happen around the middle of the image. I am not sure if the precision will be accepable when we apply this to the full scale.
  2. we need to save a offset and a scale per image (in time dimension). I did not find an encoding solution to specify this offset and scale in a standard way. Although there is "add_offset" and "scale_factor" keywords but only apply one general for the entire dataset. We can write our own functionality to use our customized encoding though. Maybe @SarahAlidoost can comment more on this?

On the other hand, for the same 2000x4000x409 slc stack, saving h2ph in float32 caused an extra 3GB storage. If we scale it to the entire stack, I expect a extra 300~400 GB.

Apptainer> du -h --max-depth=1 slcs*
3.0G    slcs_h2ph_float32.zarr/h2ph
6.1G    slcs_h2ph_float32.zarr/imag
3.0K    slcs_h2ph_float32.zarr/time
6.5M    slcs_h2ph_float32.zarr/lon
2.0K    slcs_h2ph_float32.zarr/range
1.5K    slcs_h2ph_float32.zarr/azimuth
2.4M    slcs_h2ph_float32.zarr/lat
6.1G    slcs_h2ph_float32.zarr/real
**16G     slcs_h2ph_float32.zarr**
**53M     slcs.zarr/h2ph**
6.1G    slcs.zarr/imag
2.0K    slcs.zarr/time
6.5M    slcs.zarr/lon
2.0K    slcs.zarr/range
1.5K    slcs.zarr/azimuth
2.4M    slcs.zarr/lat
6.1G    slcs.zarr/real
**13G     slcs.zarr**

With this info, at present, I would still recommend saving h2ph in np.float32. Since @FreekvanLeijen you also mentions that we are working with a tempory solution here. And the gain will be small.

Attaching the notebook I used to run the expriment here in case you would like to check details:
inspect_h2ph.zip

Copy link

@Simon-van-Diepen Simon-van-Diepen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rogerkuou going through the code of the example I noticed two things:

  1. packages datetime (from datetime import datetime), dask.array (as da) and xarray (as xr) are not imported
  2. I think slcs_output should be written to zarr instead of slcs_recon

Copy link

@Simon-van-Diepen Simon-van-Diepen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add to imports:
from datetime import datetime
import xarray as xr
import dask.array as da

examples/scripts/script_reslc.py Outdated Show resolved Hide resolved
@rogerkuou
Copy link
Member Author

Hi @Simon-van-Diepen, thanks for the comment. The missing modules has been updated. Can you check again?

@Simon-van-Diepen
Copy link

Hi @rogerkuou , I tested the appending of the time dimension on a live stack, and got some unexpected behaviour. In summary:

  • first run on stack of 200 images --> time axis is 200 long (generated by mode = "w")
  • second run on stack of 200 original images + 1 new --> time axis is 401 long (generated by mode="a")

Append thus does not check for duplicates, and simply appends everything. I am currently testing the behaviour of mode="w" in case the stack does exist.

@Simon-van-Diepen
Copy link

Simon-van-Diepen commented Oct 17, 2024

Hi @rogerkuou , mode="w" works but I found another bug: if you print the time axis of the resulting zarr the following shows:

    time     datetime64[ns] 2020-03-22
<xarray.DataArray 'time' ()>
array('2020-03-28T00:00:00.000000000', dtype='datetime64[ns]')
Coordinates:
    time     datetime64[ns] 2020-03-28
<xarray.DataArray 'time' ()>
array('2020-03-28T00:00:00.000000000', dtype='datetime64[ns]')
Coordinates:
    time     datetime64[ns] 2020-03-28
<xarray.DataArray 'time' ()>
array('2020-04-03T00:00:00.000000000', dtype='datetime64[ns]')
Coordinates:
    time     datetime64[ns] 2020-04-03
<xarray.DataArray 'time' ()>
array('2020-04-09T00:00:00.000000000', dtype='datetime64[ns]')

2020-03-28 is the mother and now appears twice. Can you have a look at what causes this?

@Simon-van-Diepen
Copy link

@rogerkuou I found that replacing line 165 by

slcs_output = xr.concat([slc_recon_output, slc_mother], dim="time").drop_duplicates(dim="time", keep="last").sortby("time")

properly removes the duplicated mother image

@rogerkuou
Copy link
Member Author

rogerkuou commented Oct 21, 2024

Hi @Simon-van-Diepen, thanks for the feedback! I took a deeper look and thought about this a bit more. My opinion is, we should try to invest a bit more on mode=w and mode=a.

On the one hand, I am glad mode=w works. And ideally, if Zarr is smart enough to recognize the existing time coordinates and only write the new image, this can be our ultimate solution. On the other hand, if Zarr does not, we are in a situtation that we re-write the whole stack everytime a new image come in, which is really not preferred.

I would experiment if it will help if we only read the binary file (phase, h2ph) of the new image and mother image, but read the exsting images from Zarr.

I would need a bit more time to investigate this, probably after I come back from holidat (Nov 11). I would keep this PR open for now.

@Simon-van-Diepen
Copy link

Hi @rogerkuou , I agree we should investigate if we can make mode='a' work. If zarr truly does not check for duplicate coordinates perhaps we could do that ourselves by checking, if a zarr exists, which timestamps it contains, and dropping those from the xarray before running to_zarr with mode='a'

Copy link

@FreekvanLeijen FreekvanLeijen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ou, one more small request. Rest is fine.

examples/scripts/script_reslc.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve script_reslc.py
3 participants