Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError when run multithreaded #589

Closed
djhoese opened this issue Apr 1, 2020 · 15 comments
Closed

UnicodeDecodeError when run multithreaded #589

djhoese opened this issue Apr 1, 2020 · 15 comments
Labels

Comments

@djhoese
Copy link
Contributor

djhoese commented Apr 1, 2020

Code Sample, a copy-pastable example if possible

# TODO: working on it

Problem description

This is something that's been noticed in Satpy specifically and is being tracked here: pytroll/satpy#1114

The bottom line is that a couple of our users have been getting UnicodeDecodeErrors or errors about bad proj definitions. The really annoying bit is that is seems to be some sort of race condition or other multi-threading related issue. We are using xarray+dask and have a pyproj CRS object in the .coords of our DataArrays. We get errors like:

    return [_execute_task(a, cache) for a in arg]
  File "/work/geo2grid/lib/python3.7/site-packages/dask/core.py", line 122, in _execute_task
    elif arg in cache:
  File "/work/geo2grid/lib/python3.7/site-packages/pyproj/crs/crs.py", line 869, in __hash__
    return hash(self.to_wkt())
  File "pyproj/_crs.pyx", line 451, in pyproj._crs.Base.to_wkt
  File "pyproj/_crs.pyx", line 120, in pyproj._crs._to_wkt
  File "pyproj/_crs.pyx", line 24, in pyproj._crs.cstrdecode
  File "/work/geo2grid/lib/python3.7/site-packages/pyproj/compat.py", line 21, in pystrdecode
    return cstr.decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 0: invalid continuation byte
Command exited with non-zero status 1

Or:

  File "C:\ProgramData\Miniconda3\lib\site-packages\pyresample\geometry.py", line 1012, in invproj
    target_proj = Proj(proj_dict)
  File "C:\ProgramData\Miniconda3\lib\site-packages\pyresample\_spatial_mp.py", line 121, in __init__
    **kwargs)
  File "C:\ProgramData\Miniconda3\lib\site-packages\pyproj\proj.py", line 171, in __init__
    super().__init__(cstrencode(projstring.strip()))
  File "pyproj/_proj.pyx", line 30, in pyproj._proj.Proj.__init__
pyproj.exceptions.ProjError: Invalid projection b'C'.: (Internal Proj Error: proj_create: unrecognized format / unknown name)

And other times it will print out the invalid projection with characters mixed in where they shouldn't be. Like very clearly wrong changes where +proj=merc is changed to some odd unicode character in place of the p in proj.

I'm trying my best to reproduce this, but so far have been unsuccessful which is why I don't have a reproducible example yet. I've only ever noticed this in logs.

Expected Output

No error.

Environment Information

  • Output from: python -m pyproj -v
pyproj info:
    pyproj: 2.5.0
      PROJ: 6.3.0
  data dir: /data1/users/davidh/miniconda3/envs/geo2grid_dist/share/proj

System:
    python: 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48)  [GCC 7.3.0]
executable: /data1/users/davidh/miniconda3/envs/geo2grid_dist/bin/python
   machine: Linux-2.6.32-573.12.1.el6.x86_64-x86_64-with-centos-6.10-Final

Python deps:
       pip: 20.0.2
setuptools: 45.2.0.post20200209
    Cython: None

Specific conda-forge builds:

proj                      6.3.0                hc80f0dc_0    conda-forge
pyproj                    2.5.0            py37h8ff28aa_0    conda-forge

Installation method

  • conda, pip wheel, from source, etc...

Conda environment information (if you installed with conda):

I mentioned specific conda packages above, but we've seen this now on Ubuntu, Windows, and a CentOS 7 docker container running a conda-pack'd version of a conda-forge environment.

@djhoese djhoese added the bug label Apr 1, 2020
@snowman2
Copy link
Member

snowman2 commented Apr 2, 2020

My suspicion is that it has to do with sharing a pyproj.CRS class across threads.

Here is how it is recommended to use the pyproj.CRS (or other pyproj classes) class with multithreading:
https://pyproj4.github.io/pyproj/stable/advanced_examples.html#multithreading

Otherwise, you can get strange behavior..I ran into this earlier: #384

Are you by chance storing the CRS object itself on the xarray or dask object?

@djhoese
Copy link
Contributor Author

djhoese commented Apr 2, 2020

Yes. I mentioned in my wall of text above that we are putting the CRS in the .coords of our DataArrays. That said, coords shouldn't be passed around between threads should they? Do you have this problem with rioxarray putting the CRS in the .attrs?

@snowman2
Copy link
Member

snowman2 commented Apr 2, 2020

I mentioned in my wall of text above that we are putting the CRS in the .coords of our DataArrays. That said, coords shouldn't be passed around between threads should they?

I missed that. I don't know how dask handles that, but I would guess it does so it does not move as much memory around. If nothing else, I would guess it has multiple threads accessing the same data. That is probably the issue with all of the threads attempting to hash the CRS object at the same time.

Do you have this problem with rioxarray putting the CRS in the .attrs?

rioxarray stores the CRS information in the crs_wkt in the attrs of the coord variabe (helps persist the data). I haven't tested it multithreaded yet. But, you could give it a try and see if that helps.

@djhoese
Copy link
Contributor Author

djhoese commented Apr 2, 2020

Interesting. Satpy doesn't currently work with xarray Dataset objects and instead uses collections of DataArray objects. So we need all the CRS information in each DataArray. I assume that's what you meant by a coord variable (being placed in a Dataset). Or could you add it to x/y coordinate variables?

The thing about dask is that it should only be working with the underlying dask array underneath the DataArray. Another possibility I can see is that Satpy might be passing another type of object we use that has a CRS object as one of its instance attributes to a dask delayed function. This would require accessing the object from a dask worker thread.

@snowman2
Copy link
Member

snowman2 commented Apr 2, 2020

Here is an example:

>>> import xarray
>>> import rioxarray
>>> xds = xarray.DataArray(1)
>>> xds
<xarray.DataArray ()>
array(1)
>>> xds_crs = xds.rio.write_crs("EPSG:4326")
>>> xds_crs
<xarray.DataArray ()>
array(1)
Coordinates:
    spatial_ref  int64 0
Attributes:
    grid_mapping:  spatial_ref
>>> xds_crs.spatial_ref
<xarray.DataArray 'spatial_ref' ()>
array(0)
Coordinates:
    spatial_ref  int64 0
Attributes:
    spatial_ref:  GEOGCRS["WGS 84",DATUM["World Geodetic System 1984",ELLIPSO...
    crs_wkt:      GEOGCRS["WGS 84",DATUM["World Geodetic System 1984",ELLIPSO...

Got a little trigger happy. Updated ^^

@djhoese
Copy link
Contributor Author

djhoese commented Apr 2, 2020

Are the duplicate spatial_ref and crs_wkt attributes for backwards compatibility? Do you plan on only having crs_wkt in the future?

I looked through the code being run that produces these errors and we have three cases that I can see where a CRS object might cross thread boundaries:

  1. data_arr.coords['crs'] = pyproj.CRS(...). However, the CRS object is never directly passed to dask from what I can tell in any of the moments when we call it "manually". Xarray could be doing this for us though. Typically you have two ways of using dask arrays: either xarray handles passing operations down to the underlying dask array, or you create dask delayed/map_blocks functions that you pass your dask array to.
  2. data_arr.attrs['area'] = pyresample.geometry.AreaDefinition(..., pyproj.CRS(...), ...) where the CRS object is stored as an attribute in the AreaDefinition class.
  3. We might have cases where an AreaDefinition or Proj object is passed to a dask delayed/map_blocks function, but so far I haven't found any that are being called in the cases that run in to this error.

I suppose just to be safe I should switch to what you are doing in rioxarray for storing the coordinate and I should only store the wkt in our AreaDefinition objects.

Edit: Another case: We use rasterio CRS objects when saving geotiffs. Any idea if those are thread safe? I'm not sure we pass the CRS object between threads, but we do create a rasterio file object that is written to from multiple threads (but one at a time with a lock).

@snowman2
Copy link
Member

snowman2 commented Apr 2, 2020

Are the duplicate spatial_ref and crs_wkt attributes for backwards compatibility?

  • spatial_ref is because that is what GDAL looks for (was before crs_wkt was introduced and GDAL hasn't updated to use crs_wkt yet)
  • crs_wkt - official CF standard for storing the WKT

Do you plan on only having crs_wkt in the future?

Probably keep both for a while for backwards compatibility with GDAL.

We use rasterio CRS objects when saving geotiffs. Any idea if those are thread safe?

My guess is that it is similar to pyproj.CRS. The reason is because GDAL is using PROJ under the hood in GDAL 3+. But, you can give it a try.

@djhoese
Copy link
Contributor Author

djhoese commented Apr 2, 2020

Sounds good on the spatial_ref stuff.

For rasterio CRS, since we are creating a rasterio object, it must be handling it properly.

Also, I think I found the problem in our processing:

https://github.com/pytroll/pyresample/blob/6bf389fcb51ec8990ecc61bbe56a1dde45c37cf2/pyresample/geometry.py#L1880-L1886

We get the CRS object and pass it to dask's map_blocks function. This has been a difficult bug to produce at all so I might have to just hope it works.

If I were to implement the same strategy for spatial_ref in satpy and geoxarray, do you think that would be a good idea? Would you do anything differently if you could now?

@snowman2
Copy link
Member

snowman2 commented Apr 2, 2020

If I were to implement the same strategy for spatial_ref in satpy and geoxarray, do you think that would be a good idea?

I think that would be great 👍. The more consistency across libraries the better.

Would you do anything differently if you could now?

Nothing that I can think of at the moment.

@snowman2
Copy link
Member

snowman2 commented Apr 2, 2020

Also see: opendatacube/datacube-core#837

@djhoese
Copy link
Contributor Author

djhoese commented Apr 16, 2020

Sorry, I thought I closed this already. This was our fault for using a CRS object with a dask map_blocks function (passing CRS objects between threads).

@snowman2
Copy link
Member

@djhoese this may be useful for reference: geopandas/geopandas#1842

@snowman2
Copy link
Member

snowman2 commented Mar 6, 2021

Fix in #793 - I am unable to replicate your issue, but I think that it will fix this. If you are able to verify that it fixes the issue that would be helpful.

@djhoese
Copy link
Contributor Author

djhoese commented Mar 6, 2021

Even when this did happen, it was random and sometimes more common depending on the CPU and work load of the machine (it seemed). Our code no longer does what was being done to cause this issue (passing CRS objects between dask workers).

@snowman2
Copy link
Member

snowman2 commented Mar 6, 2021

That makes sense. No worries if you are unable to test it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants