-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load_ard() task graph creation limited by GeoBox/CRS hash function performance #1230
Comments
Related: pyproj4/pyproj#782 |
@csiro-dave thank you for this detailed analysis. I'm currently working on extracting that part of datacube into a separate library (https://github.com/opendatacube/odc-geo). There has been some significant clean up and refactor in there. I will create an issue in that repo. And will address it there first. We are already caching pyproj CRS object as well as epsg code, so adding wkt to that cache should be relatively straightforward. datacube-core/datacube/utils/geometry/_base.py Lines 119 to 122 in ac9a466
Another quick to implement option is to use datacube-core/datacube/utils/geometry/_base.py Lines 259 to 260 in ac9a466
If you can modify above to something like: def __hash__(self) -> int:
if self._epsg is not None:
return self._epsg
return hash(self.to_wkt()) and rerun your tests and see if that helps. @snowman2 thanks for the links, I didn't realize that CRS objects and transformers had threading constraints. We should probably switch to thread-local caches then. |
EPSG codes are likely worse than WKT. See #1223 I recommend using the |
thanks for the info, but we always have them extracted at construction and they are cached unlike wkt. In the Dask context there is really only one CRS object, so this can be solved with caching very well. I don't think scenarios with a large number of different CRS objects are common, so we should try improve caching first and then look at pyproj context. |
They shouldn't have issues now: pyproj4/pyproj#782. It should create a new CRS object for each thread for you. |
As in, one global |
Correct: pyproj4/pyproj#793 EDIT: requires pyproj 3.1+ |
Thank you snowman2 and Kirill888! It looks like pyproj.set_use_global_context(True) makes a huge difference. There is an overhead to re-creating the database context at the proj level. At the pyproj level, the thread protections are perhaps a little aggressive if they need to create a new database context for every method call on an object. The example above that took 0.5 seconds is now 0.001 seconds. The 16 second persist call now takes 0.7 seconds. Hooray! from pyproj import CRS
from pyproj.enums import WktVersion
import timeit
pyproj.set_use_global_context(True)
crs = CRS.from_epsg(4326)
timeit.timeit(lambda:crs.to_wkt(),number=100) Some thoughts on the hash/equals of the CRS. I don't think GeoBox needs to include CRS in its hash function. If it has the same shape and transform, it is most likely going to be the same. In the rare case they are different, it will call equals a few extra times. So it might make sense to change def __hash__(self):
return hash((*self._shape, self._crs, self._affine)) to def __hash__(self):
return hash((*self._shape, self._affine)) I don't currently have an environment set up to test the datacube source changes, but they look like they could work too. |
regarding hash change on geobox class, good point, but maybe replace |
Thanks Alan, useful to know. |
Thanks Kirill888, snowman2 |
Expected behaviour
When I run load_ard() on a large region (the Murray Darling Basin) using distributed dask, I would like to be able to create up to 500,000 tasks for my workflow. That is, fully utilise the dask scheduler’s capacity before I parallelise the problem across multiple dask clusters/tiles/time slices etc.
Actual behaviour
On my simplified problem, using dask chunking (1x10240x10240) the production of the task graph (6144 tasks) in the distributed dask cluster takes around 16 seconds. On a real problem, I haven’t the patience to wait for it to finish.
As suggested in the dask manual, I have run the %prun over the persist function to see what is consuming the resources. The step that is taking most of the resources is the cull step of the task graph optimisation. The cull step identifies any tasks that don’t need to be executed and removes them from the task graph. When culling, dask stores the tasks that it identifies in a set. Python sets make use of object hashes to efficiently calculate which objects are in the intersection of those objects used vs those not used. However in the case of the simplified problem, calculating the hash of the CRS consumes most of the processing time (96%).
Why are the CRS hashes part of the task graph? As child of task dc_load_fmask-to_float, two lazy functions (datacube.api.core.fuse_lazy) include a GeoBox as an argument. The GeoBox is a composite object including a CRS. When GeoBox is hashed, then the hash CRS is required.
Why is CRS hashing slow? Open data cube CRS is a wrapper around PyProj CRS, which is a wrapper around Proj. ODC CRS considers CRS an immutable object defined by the value of its contents. When calculating the hash, it uses the hash of the WKT projection text. The underlying Proj library can calculate the text very quickly. However, PyProj function call is much slower. I think it might be due to the overhead of transferring data from C++ to Python. Or it could be creating new proj_context for each new CRS object (to handle multi-threading issues).
The implementation seems well motivated, so I have not been able to bring a solution. Perhaps during the step that generates multiple GeoBox objects, the CRS should be eagerly converted to a WKT for the purposes of the hash.
Steps to reproduce the behaviour
I have attached a Jupyter notebook.
geobox_hash_performance.zip
A collection of major lines are here:
A small snippet to reproduce the pyproj performance (takes about 0.5 seconds):
A small snippet to eliminate the proj performance (takes 6 milliseconds)
Environment information
Open Data Cube core, version 1.8.6
pyproj info:
pyproj: 3.2.1
PROJ: 7.2.1
data dir: /usr/local/share/proj
user_data_dir: /home/jovyan/.local/share/proj
System:
python: 3.8.10 (default, Jun 2 2021, 10:49:15) [GCC 9.4.0]
executable: /env/bin/python
machine: Linux-4.14.256-197.484.amzn2.x86_64-x86_64-with-glibc2.29
Python deps:
certifi: 2021.10.08
pip: 21.3.1
setuptools: 58.5.3
Cython: 0.29.24
The text was updated successfully, but these errors were encountered: