-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with intake-esm #6
Comments
@wachsylon, how many results does the search find? Are these results independent, i.e. are there no results that would request the same tar file? |
This seems to be an issue with how with concurrent.futures.ThreadPoolExecutor(max_workers=dask.system.CPU_COUNT) as executor:
future_tasks = [
executor.submit(_load_source, key, source) for key, source in sources.items()
] Due to these independent jobs, the |
@wachsylon could you give access to the catalog or point me to a public catalog where the same problem occurs?
|
sorry, I put the catalog to I either ran into
Or with
where it seems like |
@wachsylon I had to adjust your example a bit. The arguments you were using do not exist in the current intake-esm version: import intake
import json
import pandas as pd
with open("/pool/data/Catalogs/dkrz_cmip6_disk.json") as f:
catconfig=json.load(f)
df=pd.read_csv("/work/ik1017/Catalogs/dkrz_cmip6_archive.csv.gz")
testcat=intake.open_esm_datastore(obj={"esmcat":catconfig,"df":df})
subset=testcat.search(source_id="MPI-ESM1-2-LR",
experiment_id="ssp370",
variable_id="tas",
table_id="Amon")
subset.to_dataset_dict(xarray_open_kwargs=dict(engine="h5netcdf")) The issue here is how intake-esm creates the datasets. As mentioned in #6 (comment) intake-esm opens every catalog entry independently. Because the In [4]: testcat['ScenarioMIP.MPI-ESM1-2-LR.ssp370.Amon.gn'].df.iloc[0]["uri"]
Out[4]: 'tar://./ScenarioMIP/MPI-M/MPI-ESM1-2-LR/ssp370/r10i1p1f1/Amon/cct/gn/v20190710/cct_Amon_MPI-ESM1-2-LR_ssp370_r10i1p1f1_gn_201501-203412.nc::slk:///arch/ik1017/cmip6/CMIP6/ScenarioMIP_3964.tar'
In [5]: testcat['ScenarioMIP.MPI-ESM1-2-LR.ssp370.Amon.gn'].df.iloc[1]["uri"]
Out[5]: 'tar://./ScenarioMIP/MPI-M/MPI-ESM1-2-LR/ssp370/r10i1p1f1/Amon/cct/gn/v20190710/cct_Amon_MPI-ESM1-2-LR_ssp370_r10i1p1f1_gn_203501-205412.nc::slk:///arch/ik1017/cmip6/CMIP6/ScenarioMIP_3964.tar'
In [6]: testcat['ScenarioMIP.MPI-ESM1-2-LR.ssp370.Amon.gn'].df.iloc[2]["uri"]
Out[6]: 'tar://./ScenarioMIP/MPI-M/MPI-ESM1-2-LR/ssp370/r10i1p1f1/Amon/cct/gn/v20190710/cct_Amon_MPI-ESM1-2-LR_ssp370_r10i1p1f1_gn_205501-207412.nc::slk:///arch/ik1017/cmip6/CMIP6/ScenarioMIP_3964.tar' In addition, and this is certainly something to fix upstream, because also a local tar file cannot be opened with intake-esm independent of subset.df["uri"] = subset.df["uri"].replace("slk://","file:///scratch/<path>/<to>/<SLK-CACHE>")
subset.to_dataset_dict() Error message--------------------------------------------------------------------------- TypeError Traceback (most recent call last) File ~/.conda/envs/slkspec_dev/lib/python3.10/site-packages/intake_esm/source.py:240, in ESMDataSource._open_dataset(self) 220 datasets = [ 221 _open_dataset( 222 record[self.path_column_name], (...) 237 for _, record in self.df.iterrows() 238 ] --> 240 datasets = dask.compute(*datasets) 241 if len(datasets) == 1: Maybe you can come up with a minimal example and raise an issue upstream. After that is fixed, we can see what we are still missing here. |
The queuing actually seems to work correctly now when patched with #18 . Therefore, I change the title. |
I am sorry - the most recent version has changed keywords of the arguments and I eventually have to update the entire intake-esm workflow which will be... annoying...
But how can that be a problem if it works when I apply your idea |
Sry, maybe I wasn't clear. I meant that this is not working either and just an option to test intake-esm for tar-files. Or are you saying that this is working with your intake-esm version? |
It is working. |
After the retrieval when tars exist, this is working:
|
which is the same old outdated code but with the extra line
|
Interesting! It fails for me. Which version of |
With the recent version, I end up with
annoying |
Great! Well, great that you can reproduce my issue, not so great that the feature that we need here had worked in the past. Would you mind opening an issue at intake-esm and link it here? Which version did work for you? |
but extrapolating the version releases of intake-esm, we can expect the next one in 2024 :( |
I will open up an issue there! Nevertheless, we could also try to get rid of
|
@wachsylon I went ahead and raised an issue upstream. |
Is that really true? The GIL should take care of the thread lock. If the thread lock doesn't work properly we'll have to find a better way of implementing it. |
Yes, this is true. I think we should create additional tests for all these cases that don't yet work. As a first step it would be okay if those tests require data on Levante and can only be executed there. |
calls 33
slk retrieves
which call 33/sw/spack-levante/slk-3.3.67-jrygfs/lib/slk-cli-tools-3.3.67.jar retrieve
(66 processes in total) for 10 unique tars from the same directory on hsm. that cant be rightOriginally posted by @wachsylon in #3 (comment)
The text was updated successfully, but these errors were encountered: