Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subset command uses a lot of RAM when downloading large subsets #111

Closed
spiani opened this issue Aug 15, 2024 · 16 comments
Closed

subset command uses a lot of RAM when downloading large subsets #111

spiani opened this issue Aug 15, 2024 · 16 comments

Comments

@spiani
Copy link

spiani commented Aug 15, 2024

I continuously get an "Out of RAM" error when I run Copernicus Marine using the subset command. I was unable to find any minimal hardware requirements for the Copernicus Marine Toolbox, but after a few tests, I have the impression that the Copernicus Marine Toolbox uses an amount of RAM roughly equal to the size of the required subset.

For example, the following script goes "Out of RAM" when used on a machine with 32GB of RAM:

import copernicusmarine

MAX_DEPTH = 200

LATITUDE_RANGE = (38.894, 46.0)
LONGITUDE_RANGE = (11.346, 21.206)

my_dataset = copernicusmarine.subset(
    dataset_id = 'med-cmcc-tem-rean-d',
    minimum_longitude = LONGITUDE_RANGE[0],
    maximum_longitude = LONGITUDE_RANGE[1],
    minimum_latitude = LATITUDE_RANGE[0],
    maximum_latitude = LATITUDE_RANGE[1],
    minimum_depth = 0,
    maximum_depth = MAX_DEPTH,
    variables = ['thetao'],
    output_directory=SCRATCH_DIR,
    force_download=True
)

but it works if I reduce the domain of a factor 2. If I use the "zarr" format instead of the netcdf, I don't have any problem.

Can you confirm this? Is it possible to reduce the amount of required RAM?

I am using Copernicus Marine Toolbox version 1.3.2 on a CentOS 8 machine (kernel 4.18.0).

Thank you!

@veenstrajelmer
Copy link

You might want to include start_datetime and end_datetime in your query to avoid retrieving the entire time range at once.

@spiani
Copy link
Author

spiani commented Aug 15, 2024

Yes, I know that I may, but I wonder: should I? I tried to download the entire dataset in one single file because I need to compute some statistical indicators on the overall time series, and it is easier to write the code for a single file rather than jumping among several ones.

Is the "subset" command intended to be used only for smaller scenarios, such as situations where "open_dataset" would be used? In my case, is it better to split the file?

@veenstrajelmer
Copy link

I am just another user and I do not know the official take on this. But smaller requests are often also faster. What we do is retrieve the data per period (days/months/or anything that is convenient). Then reading the data with xarray.open_mfdataset() and write it to a single file, or use the xarray dataset directly. This can probably also be done with nco.

@spiani
Copy link
Author

spiani commented Aug 19, 2024

Thank you for your reply! I think I will use the Zarr format or follow your advice and use NCO to attach several files. I usually avoid using open_mfdataset because I have noticed that the performance of Dask degrades noticeably when the number of timesteps in each file is not consistent (for example, months with 30 or 31 days). In this case, it would be better to download NetCDF files for, for example, 1,000 days, but this approach starts to become somewhat cumbersome.

@veenstrajelmer
Copy link

Interesting, was this reported at dask also? I can imagine this happens since the chunks are inconsistent, but instead of downloading 1000 separate days it might also work to download per month and then read the files such that the times are chunked consistently (e.g. per day). However, this might also be more cumbersome in your case than using nco. Also, it might not be best for performance either, this would depend a bit on the rest of the sizes of your dataset.

@renaudjester
Copy link
Collaborator

renaudjester commented Aug 19, 2024

Hi @spiani thanks for reporting this issue! And @veenstrajelmer thanks for replying!

I haven't seen this bug before, this is definitely an issue. The idea is that the toolbox would allow you to download any amount of data with the subset. It is very interesting the difference between the zarr and netcdf format:

If I use the "zarr" format instead of the netcdf, I don't have any problem.

It could be an issue with xarray library and it's a limitation we need to take into account in the toolbox.
By any chance have you tried to apply yourself directly the xarray.Dataset.to_netcdf after downloading the data in zarr format?

@renaudjester
Copy link
Collaborator

@spiani I tried to reproduced the bug unsuccessfully unfortunately.

I tried the same request in a notebook:

Screenshot 2024-08-20 at 17 50 09

And the memory seems to be stable and doesn't use that much:

Screenshot 2024-08-20 at 17 49 48

I have a macbook pro 8GB of RAM. Could it be that the problem is OS specific?

@veenstrajelmer
Copy link

I encountered significant memory differences in xarray.open_dataset() with different backends: engine="netcdf4" sometimes consumes much more memory (but much less time) than engine="h5netcdf", but this might be in very specific cases only. Could it be that because of your environment a different engine is used? I cannot oversee what subset() does and whether this is relevant, but I thought it might be usefull to add this suggestion either way.

@spiani
Copy link
Author

spiani commented Aug 23, 2024

Hello,
Apologies for the delayed response; it's been a very busy week. I've conducted some additional experiments on my workstation (Ubuntu 22.04.4, kernel 6.8.0-40-generic, x86, 128 GB of RAM, Copernicus Marine 1.3.1 installed via Conda).

When I run the same script I mentioned in my initial comment (netcdf format), the result is as follows

INFO - 2024-08-21T09:10:32Z - Dataset version was not specified, the latest one was selected: "202012"
INFO - 2024-08-21T09:10:32Z - Dataset part was not specified, the first one was selected: "default"
INFO - 2024-08-21T09:10:33Z - Service was not specified, the default one was selected: "arco-time-series"
WARNING - 2024-08-21T09:10:34Z - Some or all of your subset selection [38.894, 46.0] for the latitude dimension  exceed the dataset coordinates [30.1875, 45.97916793823242]
INFO - 2024-08-21T09:10:34Z - Downloading using service arco-time-series...
INFO - 2024-08-21T09:10:40Z - Estimated size of the dataset file is 70062.329 MB.
INFO - 2024-08-21T09:10:40Z - Writing to local storage. Please wait...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300722/300722 [8:51:50<00:00,  9.42it/s]
INFO - 2024-08-21T18:03:01Z - Successfully downloaded to /data/temp/delete_me/med-cmcc-tem-rean-d_thetao_11.38E-21.17E_38.90N-45.98N_1.02-192.48m_1987-01-01-2022-07-31.nc

The download takes 9 hours. During execution, the process gradually increases its RAM usage. At one point, I checked when it was around 50% complete, and it was using approximately 30 GB of RAM.
If I use zarr, instead, the download takes 9 minutes!

INFO - 2024-08-22T09:22:03Z - Dataset version was not specified, the latest one was selected: "202012"
INFO - 2024-08-22T09:22:03Z - Dataset part was not specified, the first one was selected: "default"
INFO - 2024-08-22T09:22:05Z - Service was not specified, the default one was selected: "arco-time-series"
WARNING - 2024-08-22T09:22:06Z - Some or all of your subset selection [38.894, 46.0] for the latitude dimension  exceed the dataset coordinates [30.1875, 45.97916793823242]
INFO - 2024-08-22T09:22:06Z - Downloading using service arco-time-series...
INFO - 2024-08-22T09:22:11Z - Estimated size of the dataset file is 70062.329 MB.
INFO - 2024-08-22T09:22:11Z - Writing to local storage. Please wait...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300722/300722 [08:31<00:00, 587.68it/s]
INFO - 2024-08-22T09:31:09Z - Successfully downloaded to /data/temp/delete_me2/med-cmcc-tem-rean-d_thetao_11.38E-21.17E_38.90N-45.98N_1.02-192.48m_1987-01-01-2022-07-31.zarr

What’s particularly interesting to me is what happens when I limit the process to use a maximum of 16 GB of RAM using chgroup:

INFO - 2024-08-22T09:37:24Z - Dataset version was not specified, the latest one was selected: "202012"
INFO - 2024-08-22T09:37:24Z - Dataset part was not specified, the first one was selected: "default"
INFO - 2024-08-22T09:37:25Z - Service was not specified, the default one was selected: "arco-time-series"
WARNING - 2024-08-22T09:37:26Z - Some or all of your subset selection [38.894, 46.0] for the latitude dimension  exceed the dataset coordinates [30.1875, 45.97916793823242]
INFO - 2024-08-22T09:37:26Z - Downloading using service arco-time-series...
INFO - 2024-08-22T09:37:29Z - Estimated size of the dataset file is 70062.329 MB.
INFO - 2024-08-22T09:37:29Z - Writing to local storage. Please wait...
 22%|██████████████████████████████████████████████████████████                                                                                                                                                                                                                 | 65444/300722 [14:16:29<55:39:35,  1.17it/s]
Killed

The surprising part is that this process was terminated after 15 hours. So, roughly speaking, even if it hadn't run out of RAM, the download would have taken more than 3 days to complete. I’m not sure if this is because more RAM allows the kernel to use a larger cache, but this is definitely a different behavior.

Now, I will try opening the Zarr file and using xarray.Dataset.to_netcdf to see if the behavior is related to the Xarray library. If it is, we can consider opening a ticket with them.

@renaudjester
Copy link
Collaborator

Thanks for the investigation!

@spiani
Copy link
Author

spiani commented Sep 2, 2024

Hello,
I tried to save the content of the zarr file using xarray.Dataset.to_netcdf (and the default engine netcdf4). In this case, I don't see any problem. If I change the engine and I use scipy, instead, the script goes out of ram.

In any case, I think that copernicusmarine uses the default engine and therefore xarray is not the root of the problem.

Can anybody try to reproduce the problem with a linux machine? Thank you!

@renaudjester
Copy link
Collaborator

renaudjester commented Sep 24, 2024

@spiani Could you indicate which version of the packages and dependencies are you using? (with a pip freeze or something similar depending on your setup)

Thanks in advance!

@renaudjester
Copy link
Collaborator

Hi @spiani and @veenstrajelmer,

I was doing some tests related to this issue. Hence, I used the same python command as the one in this issue on an Ubuntu 128GB RAM, 16 cores computer and connection: Download: 9267.84 Mbps.

First, I could indeed reproduce some time difference between saving to zarr format or netcdf format using the toolbox:

  • zarr format: 3min50 sec
  • netcdf format: 29min15s

no memory usage increase (around 30GB used by the python process the whole time)

Now I also wanted to test the to_netcdf from xarray using the zarr file I just downloaded (that is in total 20GB) and I get this:

>>> import xarray
>>> dataset = xarray.open_dataset("todelete/med-cmcc-tem-rean-d_thetao_11.38E-21.17E_38.90N-45.98N_1.02-192.48m_1987-01-01-2022-07-31.zarr", engine="zarr")
>>> dataset.to_netcdf("from_zarr.nc")
Killed

The process gets killed due to a memory problem:

Screenshot 2024-10-01 at 15 14 13

FYI, @veenstrajelmer I tried both engines ("netcdf4" and "h5netcdf") and I obtain the same result: memory usage keeps increasing until it reaches 128GB and crashes.

So @spiani when you say:

I tried to save the content of the zarr file using xarray.Dataset.to_netcdf (and the default engine netcdf4). In this case, I don't see any problem.

I obtain a different result, I couldn't convert the zarr file to netcdf.

So from what I just saw here I see two things (hypothesis):

  1. it seems from me that it's more a problem between xarray and netcdf and not a specific configuration of the toolbox.
  2. It looks like transforming to netcdf is computationally intensive which would explain the difference in times between zarr and netcdf. Is it an acceptable difference? Not sure...

What the toolbox could do is find a workaround to those problems but not sure how to do this 🤔 For example, doing some multiprocessing (so calculating on several cores) might not be suited with everybody's infrastructure or it should be optional and I don't really know how it can be put in place (dask.distributed?)

@renaudjester
Copy link
Collaborator

After reading this stackoverflow issue I am gonna try to see if at least we can avoid the memory issues by setting some smart chunk size.

Though it seems that the time overhead induced by converting the file from zarr to netcdf is rather difficult to lower

@uriii3
Copy link
Collaborator

uriii3 commented Oct 2, 2024

I was reading it and in dask itself, they recommend also using xarray specifically... (their documentation). Not sure that if it is that problem specific or 'build' specific we can do better than xarray, no?

@renaudjester
Copy link
Collaborator

@spiani and @veenstrajelmer Closing this issue for now!

If you pinpoint the problem, please feel free to write it down here :D

Also maybe the v2.0.0a4 can help so don't hesitate to try it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants