You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is originally from #407, but focus of that issue was turned to EK echosounder data instead. Issues related to EK files were addressed in #1185.
This new issue is to capture the similar needs for AD2CP files. The same approach as in #1185 would likely work here too, with the caveat that some part of it may need to be at the parser stage if the file is very very big. Say an AD2CP has a volume of a few GB, and system memory is small, then parser.parse_raw may fail due to insufficient memory.
During OceanHackWeek'21 a participant attempted to convert ~1GB files from Nortek AD2CP (Signature 100 and Signature 250), and failed both on the Jupyter Hub and on their personal machine. This is probably related to the exploding in-memory xarray.merge situation that we have seen.
@imranmaj@lsetiawan and I discussed this yesterday, and an interim solution is to have an under the hood procedure to do the following:
parse binary data
if parsed binary data reaches a certain size, save already parsed data into a small file
repeat 1-2 until end of binary file
merge all small files under the hood, which can be delayed and distributed to dask workers.
Looking back:
Step 1-3 could use the same approach as done for EK data to save parsed data into a temp zarr file.
Step 4 could use the same xr.concat approach done for EK data to avoid potentially large overhead for xr.merge.
A caveat here is that, without parsing all AD2CP data packets (aka datagrams in EK raw files), the "final" shape of the entire zarr store may change across the batches of sequentially parsed data packets. Some work is needed here to figure out a strategy to handle this.
The text was updated successfully, but these errors were encountered:
@leewujung has there been any progress on this issue? I have a ~ 4GB ad2cp file from a signature 100 instrument that I cannot convert. After a few minutes I get some warnings "UserWarning: Converting non-nanosecond precision datetime..." followed by a notice that the process was killed and another warning "UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown". My laptop has 16GB ram and should be able to handle the conversion no problem. I was monitoring the process memory usage and it was not excessive.
Hey @jessecusack : We haven't been able to work on this further because we're over-committed with other priorities.
Would you be interested in working on it? If I remember correctly from prior investigations, the main thing that creates the large memory expansion was the xr.merge, which we can side step by changing the handling of how data from different modes are stored, and probably also writing parsed data to disk directly. File size and memory are not always a one to one match, depending on the computation details involved.
For "UserWarning: Converting non-nanosecond precision datetime..." -- this is something we know how to fix, as we've fixed that for other echosounder models.
For "UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown", could you copy-paste the entire error message, or better yet, upload a notebook gist so that there's a reproducible example?
This is originally from #407, but focus of that issue was turned to EK echosounder data instead. Issues related to EK files were addressed in #1185.
This new issue is to capture the similar needs for AD2CP files. The same approach as in #1185 would likely work here too, with the caveat that some part of it may need to be at the parser stage if the file is very very big. Say an AD2CP has a volume of a few GB, and system memory is small, then
parser.parse_raw
may fail due to insufficient memory.From #407:
Looking back:
xr.concat
approach done for EK data to avoid potentially large overhead forxr.merge
.A caveat here is that, without parsing all AD2CP data packets (aka datagrams in EK raw files), the "final" shape of the entire zarr store may change across the batches of sequentially parsed data packets. Some work is needed here to figure out a strategy to handle this.
The text was updated successfully, but these errors were encountered: