Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support lazy-loaded EK80 broadband-complex data #1311

Merged
merged 12 commits into from
Jun 11, 2024

Conversation

leewujung
Copy link
Member

@leewujung leewujung commented Apr 24, 2024

This PR addresses the issues with for "Indexing with a boolean dask array is not allowed" encountered when calibrating EK80 complex data that is lazy-loaded.

This issue essentially calls for forcing to compute in the places where such indexing occurs. The cases I've seen (and "fixed") so far are using boolean indexing to select channels (based on data from the first ping), so the impact on performance is likely small. Right now this is fine since we restrict the use case to when the transmit signals are identical across all ping_time (but can vary across channel). Performance issue may arise when we want to accommodate transmit signal change across ping _time (#925).

I also adjusted one place in the convolution code to allow dask to function in parallel, by rechunking the range_sample dimension into 1 chunk.

Note: this is not yet ready for review.

@leewujung leewujung added the enhancement This makes echopype better label Apr 24, 2024
@leewujung leewujung added this to the v0.9.0 milestone Apr 24, 2024
@leewujung
Copy link
Member Author

The failed tests are a little weird. I've just looked into one of them: test_ek80_BB_power_from_complex. My changes didn't actually change the type of operations being performed on the data, but the convolution output seems different.

@ctuguinay ctuguinay marked this pull request as ready for review May 23, 2024 20:04
@ctuguinay
Copy link
Collaborator

@leewujung if everything here looks good, could you merge this?

replica = np.flipud(np.conj(tx))
replica_dict[str(channel.values)] = np.flipud(np.conj(tx))
# Concatenate backscatter channels with dropped NaN beam dimensions.
backscatter_NaN_beam_drop_all = xr.concat(backscatter_NaN_beam_drop_all, dim="channel")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: I think the original intention to dropna with dim="beam" is to remove the padded sections along the range_sample dimension, since some channels contain data that are shorter. So, if you dropna for a particular channel, but other channels are longer, wouldn't this xr.concat recreate the dropped NaN values?

Copy link
Collaborator

@ctuguinay ctuguinay May 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, I think you're right. Let me try a toy example

Copy link
Collaborator

@ctuguinay ctuguinay May 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code:

data = np.array([
    [np.nan, np.nan], [4, 5], [np.nan, 6], [8, 9]
])
da = xr.DataArray(
    data,
    dims=["channel", "beam"],
    coords={
        "channel": np.arange(4),
        "beam": np.arange(2)
    })
dropped_nan_beam_da_list = []
for channel_index in range(len(da["channel"])):
    dropped_da = da.isel(channel=channel_index).dropna("beam", how="all")
    print(f"`dropped_da` shape (chan idx {channel_index}): {dropped_da.shape[0]}")
    dropped_nan_beam_da_list.append(dropped_da)
print("concatenated da:", xr.concat(dropped_nan_beam_da_list, dim="channel"))

Output:

`dropped_da` shape (chan idx 0): 0
`dropped_da` shape (chan idx 1): 2
`dropped_da` shape (chan idx 2): 1
`dropped_da` shape (chan idx 3): 2
concatenated da: <xarray.DataArray (channel: 4, beam: 2)> Size: 64B
array([[nan, nan],
       [ 4.,  5.],
       [nan,  6.],
       [ 8.,  9.]])
Coordinates:
  * beam     (beam) int64 16B 0 1
  * channel  (channel) int64 32B 0 1 2 3

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yeah you're right, this padding does happen. Let me tinker with the implementation that's currently in this PR. There's probably a clever way skip to skip this concat.

Copy link
Collaborator

@ctuguinay ctuguinay May 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I guess even if we can skip this concat, this concat and padding happens anyway during the apply_ufunc so that the final array that comes out of apply_ufunc could have all-NaN beam vectors. The benefit of the old implementation where we looped through channels was that we could skip having to convolve on all-NaN beam vectors, and the tradeoff was that the operation could not be delayed. So tricky 😆

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think your code above is missing the range_sample dimension, which is where that padding happens.

But yeah, I think the easiest is to just not dropna here. We have 2 options:

  • try drop the NaNs in the _convolve_per_channel function, since there you do have a loop that is channel by channel. The only thing is that you have to drop the NaN and add them back to ensure the output length is the same
  • just pay the computational price

Maybe you could try seeing how the speed differ for these 2 approaches? It is possible that with all the optimization people have put into convolution, the speed difference is small (and hopefully the memory consumption is reasonable too).

Copy link
Collaborator

@ctuguinay ctuguinay May 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good, I'll try it out. I have a suspicion that the 2nd option of just not dropping NaNs might be faster, since the additional time needed to convolve on arrays that contain a lot of NaNs may be faster than the additional time needed to remove and add back NaNs, because the writers of the convolve operation made it fast for arrays with NaNs (I think). I could be completely wrong though haha

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree, it very possibly would be that way!

@ctuguinay ctuguinay marked this pull request as draft May 25, 2024 06:01
@ctuguinay
Copy link
Collaborator

Recent changes will come from: leewujung#37

ctuguinay and others added 2 commits May 29, 2024 03:45
@codecov-commenter
Copy link

codecov-commenter commented May 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.19%. Comparing base (9f56124) to head (3f230ed).
Report is 53 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1311      +/-   ##
==========================================
+ Coverage   83.52%   89.19%   +5.67%     
==========================================
  Files          64       11      -53     
  Lines        5686      990    -4696     
==========================================
- Hits         4749      883    -3866     
+ Misses        937      107     -830     
Flag Coverage Δ
unittests 89.19% <100.00%> (+5.67%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ctuguinay ctuguinay marked this pull request as ready for review May 29, 2024 19:26
if dask_array:
ep_vals = ds_Sv["Sv"].sel(channel=ch_sel).squeeze().data.compute()
else:
ep_vals = ds_Sv["Sv"].sel(channel=ch_sel).squeeze().data
ep_vals = ds_Sv["Sv"].sel(channel=ch_sel).squeeze().data.compute()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that right now with the dask_array=False test case the data is still lazy-loaded? Asking because I seem to remember the original intention of this is to see if lazy-loading works explicitly, but I could also just be misremembering things.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, it's because of the .chunk({"range_sample": -1, "channel": -1}) that's now in the convolution. Let me make this chunk conditional on whether or not the backscatter variable is already chunked.

@ctuguinay
Copy link
Collaborator

@leewujung Thanks! Should be ready for another short review haha

@leewujung
Copy link
Member Author

Awesome! Thanks @ctuguinay, this looks great. I'll merge it now.

@leewujung leewujung merged commit 46b6296 into OSOceanAcoustics:main Jun 11, 2024
5 checks passed
@leewujung leewujung deleted the dask_bool_index branch July 21, 2024 00:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement This makes echopype better
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants