-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support lazy-loaded EK80 broadband-complex data #1311
Conversation
for more information, see https://pre-commit.ci
… into dask_bool_index
The failed tests are a little weird. I've just looked into one of them: |
compute convolution channels together
@leewujung if everything here looks good, could you merge this? |
echopype/calibrate/ek80_complex.py
Outdated
replica = np.flipud(np.conj(tx)) | ||
replica_dict[str(channel.values)] = np.flipud(np.conj(tx)) | ||
# Concatenate backscatter channels with dropped NaN beam dimensions. | ||
backscatter_NaN_beam_drop_all = xr.concat(backscatter_NaN_beam_drop_all, dim="channel") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: I think the original intention to dropna with dim="beam"
is to remove the padded sections along the range_sample
dimension, since some channels contain data that are shorter. So, if you dropna for a particular channel, but other channels are longer, wouldn't this xr.concat
recreate the dropped NaN values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, I think you're right. Let me try a toy example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code:
data = np.array([
[np.nan, np.nan], [4, 5], [np.nan, 6], [8, 9]
])
da = xr.DataArray(
data,
dims=["channel", "beam"],
coords={
"channel": np.arange(4),
"beam": np.arange(2)
})
dropped_nan_beam_da_list = []
for channel_index in range(len(da["channel"])):
dropped_da = da.isel(channel=channel_index).dropna("beam", how="all")
print(f"`dropped_da` shape (chan idx {channel_index}): {dropped_da.shape[0]}")
dropped_nan_beam_da_list.append(dropped_da)
print("concatenated da:", xr.concat(dropped_nan_beam_da_list, dim="channel"))
Output:
`dropped_da` shape (chan idx 0): 0
`dropped_da` shape (chan idx 1): 2
`dropped_da` shape (chan idx 2): 1
`dropped_da` shape (chan idx 3): 2
concatenated da: <xarray.DataArray (channel: 4, beam: 2)> Size: 64B
array([[nan, nan],
[ 4., 5.],
[nan, 6.],
[ 8., 9.]])
Coordinates:
* beam (beam) int64 16B 0 1
* channel (channel) int64 32B 0 1 2 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So yeah you're right, this padding does happen. Let me tinker with the implementation that's currently in this PR. There's probably a clever way skip to skip this concat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I guess even if we can skip this concat, this concat and padding happens anyway during the apply_ufunc
so that the final array that comes out of apply_ufunc
could have all-NaN beam vectors. The benefit of the old implementation where we looped through channels was that we could skip having to convolve on all-NaN beam vectors, and the tradeoff was that the operation could not be delayed. So tricky 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think your code above is missing the range_sample
dimension, which is where that padding happens.
But yeah, I think the easiest is to just not dropna here. We have 2 options:
- try drop the NaNs in the
_convolve_per_channel
function, since there you do have a loop that is channel by channel. The only thing is that you have to drop the NaN and add them back to ensure the output length is the same - just pay the computational price
Maybe you could try seeing how the speed differ for these 2 approaches? It is possible that with all the optimization people have put into convolution, the speed difference is small (and hopefully the memory consumption is reasonable too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good, I'll try it out. I have a suspicion that the 2nd option of just not dropping NaNs might be faster, since the additional time needed to convolve on arrays that contain a lot of NaNs may be faster than the additional time needed to remove and add back NaNs, because the writers of the convolve operation made it fast for arrays with NaNs (I think). I could be completely wrong though haha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree, it very possibly would be that way!
Recent changes will come from: leewujung#37 |
No Longer Drop Beam in Pulse Compression
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1311 +/- ##
==========================================
+ Coverage 83.52% 89.19% +5.67%
==========================================
Files 64 11 -53
Lines 5686 990 -4696
==========================================
- Hits 4749 883 -3866
+ Misses 937 107 -830
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
if dask_array: | ||
ep_vals = ds_Sv["Sv"].sel(channel=ch_sel).squeeze().data.compute() | ||
else: | ||
ep_vals = ds_Sv["Sv"].sel(channel=ch_sel).squeeze().data | ||
ep_vals = ds_Sv["Sv"].sel(channel=ch_sel).squeeze().data.compute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that right now with the dask_array=False
test case the data is still lazy-loaded? Asking because I seem to remember the original intention of this is to see if lazy-loading works explicitly, but I could also just be misremembering things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, it's because of the .chunk({"range_sample": -1, "channel": -1})
that's now in the convolution. Let me make this chunk conditional on whether or not the backscatter variable is already chunked.
@leewujung Thanks! Should be ready for another short review haha |
Awesome! Thanks @ctuguinay, this looks great. I'll merge it now. |
This PR addresses the issues with for "Indexing with a boolean dask array is not allowed" encountered when calibrating EK80 complex data that is lazy-loaded.
This issue essentially calls for forcing to compute in the places where such indexing occurs. The cases I've seen (and "fixed") so far are using boolean indexing to select channels (based on data from the first ping), so the impact on performance is likely small. Right now this is fine since we restrict the use case to when the transmit signals are identical across all
ping_time
(but can vary acrosschannel
). Performance issue may arise when we want to accommodate transmit signal change acrossping _time
(#925).I also adjusted one place in the convolution code to allow dask to function in parallel, by rechunking the
range_sample
dimension into 1 chunk.Note: this is not yet ready for review.