Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make small AuxCoords (including scalar) non-lazy, for efficiency #5069

Closed
wants to merge 1 commit into from

Conversation

pp-mo
Copy link
Member

@pp-mo pp-mo commented Nov 15, 2022

See #5053.
The tiny fix here seems to show that the slowness is largely due to creating lots of tiny lazy coords (for scalar coords).
This really speeds up the testcase for netcdf load -- from ~150sec to ~4secs ...
(testcase is a file with ~200 variables all of which have 2 scalar coords)

Here we are making any smaller AuxCoords real : we fetch variable data immediately to make a new coord, instead of giving it a "dask.Array wrapping a NetcdfDataProxy referring to a file variable".

In practice, if the size threshold is set right, this approach should save memory too.
But it's not trivial to determine what a typical minimal overhead for a "dask.Array wrapping a NetcdfDataProxy" actually is.
At least the NetcdfDataProxy object is pretty small + simple, containing some numbers + a couple of strings. The Dask array object is probably more costly.
TBD

@pp-mo pp-mo changed the title Trial solution. Make small AuxCoords non-lazy, for efficiency Nov 15, 2022
@pp-mo
Copy link
Member Author

pp-mo commented Nov 15, 2022

The test results are encouraging : nothing unexpected.

@pp-mo
Copy link
Member Author

pp-mo commented Nov 30, 2022

At present, a NetcdfDataProxy contains __slots__ = ("shape", "dtype", "path", "variable_name", "fill_value") : """.
So, if we're interested in the smallest cases (like scalar coords), then using "sizeof" on simple python objects gives, typically ...

  • shape : (1,) ~32
  • dtype('f8') : ~96
  • path : str('X' * 40) : ~89
  • variable_name : str('X' * 10) : ~59
  • fill_value: 1.e30 : ~24

Plus "sizeof" a NetCDFDataProxy object itself is ~56
So, that's about 350 bytes.

However, the question of how much memory a Python object "costs" has, it seems generally agreed,
no definite straightforward answer.
Not least because, if other objects are referenced, you need to judge whether they are
unique to this structure, or may be shared elsewhere, or should count as separate items in their own right.

Also, look at this ...

    >>> miniarr = np.array([], dtype=np.float64)
    >>> miniarr.__sizeof__()
    112
    >>> miniarr.data.__sizeof__()
    168
    >>>

So, it looks like anything beyond a Python primitive object is generally rather hard to assess for memory consumption.

So Instead

Using very simplistic measures, based on "tracemalloc", as explained in #4883

    >>> arrs = []
    >>> with mm.context():
    ...  for n in range(2000):
    ...    mindp = NetCDFDataProxy(shape=(1,), dtype=np.dtype('f8'), path=' ' * 40, variable_name=' ' * 10, fill_value=1.e30)
    ...    minarr = da.from_array(mindp, chunks=-1, meta=np.ndarray)
    ...    arrs.append(minarr)
    ...
    >>> mm.memory_mb * 2.**20 / 2000
    4883.27

This example is intending to mimic an absolutely minimal (1-point) lazy array, like that within a scalar coordinate.

The above "measures" 2000 of them --> average of ~4883.3 bytes.
Likewise *100 and *1 measured (average) sizes of 5144.45 and 5496.0
So, 4800 - 5500 bytes seems like a reasonable minimum size ??

It seems a lot, doesn't it ?!?

@pp-mo pp-mo changed the title Make small AuxCoords non-lazy, for efficiency Make small AuxCoords (including scalar) non-lazy, for efficiency Feb 8, 2023
@ESadek-MO ESadek-MO self-assigned this Mar 23, 2023
@pp-mo
Copy link
Member Author

pp-mo commented Apr 3, 2023

Replaced by #5229

@pp-mo pp-mo closed this Apr 3, 2023
@pp-mo pp-mo mentioned this pull request Apr 14, 2023
7 tasks
@pp-mo pp-mo deleted the nondask_aux branch April 28, 2023 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants