-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Micro optimize dataset.isel for speed on large datasets #9003
base: main
Are you sure you want to change the base?
Conversation
021ba45
to
9128c7c
Compare
I'm happy to add benchmarks for these if you think it would help. That said. I would love to leave that addition for future work. My time spent playing on this kind of speedup is up for the week. |
Thanks. Do you see any changes in our asv benchmarks in We'd be happy to take updates for those too :) |
I didn't get to running The speedups here are more associated with:
I don't think this benchmark exists at quick glance. I could create one |
xarray/core/dataset.py
Outdated
# Fastpath, skip all of this for variables with no dimensions | ||
# Keep the result cached for future dictionary update | ||
elif var_dims := var.dims: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Fastpath, skip all of this for variables with no dimensions | |
# Keep the result cached for future dictionary update | |
elif var_dims := var.dims: | |
elif var.ndim == 0: | |
continue | |
else: |
Does this work
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no wait, i spoke too soon. i had a typo. oddly, it is slower...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
diff --git a/xarray/core/dataset.py b/xarray/core/dataset.py
index ec756176..4e8c31e5 100644
--- a/xarray/core/dataset.py
+++ b/xarray/core/dataset.py
@@ -2987,22 +2987,20 @@ class Dataset(
if name in index_variables:
var = index_variables[name]
dims.update(zip(var.dims, var.shape))
- # Fastpath, skip all of this for variables with no dimensions
- # Keep the result cached for future dictionary update
- elif var_dims := var.dims:
+ elif var.ndim == 0:
+ continue
+ else:
# Large datasets with alot of metadata may have many scalars
# without any relevant dimensions for slicing.
# Pick those out quickly and avoid paying the cost below
# of resolving the var_indexers variables
- if var_indexer_keys := all_keys.intersection(var_dims):
+ if var_indexer_keys := all_keys.intersection(var.dims):
var_indexers = {k: indexers[k] for k in var_indexer_keys}
var = var.isel(var_indexers)
if drop and var.ndim == 0 and name in coord_names:
coord_names.remove(name)
continue
- # Update our reference to `var_dims` after the call to isel
- var_dims = var.dims
- dims.update(zip(var_dims, var.shape))
+ dims.update(zip(var.dims, var.shape))
variables[name] = var
return self._construct_direct(
was slower.... this is somewhat unexpected. ndim should be "instant".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me add a benchmark tonight to "show" that this is the better way explicitely, otherwise it will be too easy to undo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My conclusion is that:
len(tuple)
seems to be pretty fast.- But the
.shape
attribute is only resolved after 4-5 different python indirections going down to aLazilyIndexedArray
,MemoryCachedArray
,H5BackedArray
(sorry, i'm not getting the class names right), but ultimately it isn't "readily available and needs to be resolved.
My little heuristic test is that with my dataset (93 variables long)
In [16]: %%timeit
...: for v in dataset._variables.values():
...: v.ndim
...:
119 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [17]: %%timeit
...: for v in dataset._variables.values():
...: v.shape
...:
105 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [18]: %%timeit
...: for v in dataset._variables.values():
...: v.dims
...:
7.66 µs ± 38.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [19]: %%timeit
...: for v in dataset._variables.values():
...: v._dims
...:
3.1 µs ± 22 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [20]: len(dataset._variables)
93
I mean, micro-optimizations are sometimes dumb. So that is why i've been breaking them out into distinct ideas when I find them, but together they can add up, especially when taken together.
So in other words, my hypothesis is that the the use of _dims
is really helpful because it avoids many indirections in shape
since dims is a "cached" version of the shape (where every number is replaced with a string).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len(v.dims)
or len(v._dims)
sounds OK to me. They're both readily understandable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so I better understand the xarray style,
The truthyness of tuples is not obvious enough while Len(tuple) is more obviously associated with a true/false statement
Would a comment be ok if Len(tuple) hurts performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It not about style, but about readability and understandability.
I've read this snippet about 6 times now, but I still have to look at it closely to see what it does. The perf improvement is also sensitive to order of iteration over variables (what if you alternated between 0D and 1D variable as you iterated through?)
This is why I'd prefer an explicit check for scalar variable. It's easy to see and reason about the special-case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if you alternated between 0D and 1D variable as you iterated through?
You know, this is something I've thought about alot.
I'm generally not too happy with this optimization.
This is why I'd prefer an explicit check for scalar variable. It's easy to see and reason about the special-case.
Ok understood. The challenge is that this PR doesn't do much on my benchmarks without #9002 and my current theory is that we are limited by calls to python methods, so I feel like even len(tuple)
will slow things down.
I'll try again, but if its OK, i'm going to rebase onto #9002 until a resolution is found for those optimizations.
ef95538
to
83b0599
Compare
On main:
On this branch
On the combined #9002 + this branch:
|
5a7c903
to
2592076
Compare
I want to make sure I have a good reason for choosing the syntax I did. So without taking dcherian's suggestions:
With the
Which seems to be the same if not better. |
fadd876
to
e865e4f
Compare
So final numbers again for myself on my internal benchmark (I know it isn't of that much use to others, but there are the other benchmarks for that)
A 42% increase in throughput. i'll take it! |
e865e4f
to
314c72b
Compare
@hmaarrfk does the latest version see the same speedups? |
Rerunning the benchmarks. Are the test failures due to this pR??? |
The improvement is not as significant, but i'm unsure if it is related to the test failures:
|
This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition. For example, we have about 80 of these in our datasets (and I want to incrase this number) Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands. However, it has become quite slow to index in the dataset. We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application. These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile: * main (as of pydata#9001) - 2.5k its/s * With pydata#9002 - 4.2k its/s * With this Pull Request (on top of pydata#9002) -- 6.1k its/s Thanks for considering.
84ef547
to
e0fb6ef
Compare
Sorry i force pushed, let me run benchmarks again, my computer is currently in use right now, so i'll run them a little bit later. |
e0fb6ef
to
f7945a3
Compare
This reverts commit f7945a3.
So on my personal benchmark:
Trying to rerun the benchmark with and without cleanups. I definitely see an improvement on my dataset (and I'm willing to "donate" it), but I just can't recreate the significant improvements that I saw on the benchmarking suite. Maybe a dependency changed in terms of performance which is now hiding the effects? Nothing pops out in the only change i can find to the indexing benchmark: #9013 |
I suspect that caching Can you upload a representative dataset somewhere? Or provide code to generate one pleasE? |
@dcherian could you link the PR you are referring to? I would be very much interested in potential h5netcdf performance improvements (see my comments in h5netcdf/h5netcdf#195 and h5netcdf/h5netcdf#221 and Deltares/dfm_tools#484). There are many related issues and PR's open and I have lost track of the potential causes and fixes. Are you saying that this PR is redundant after another PR was already merged or is there still potential? |
I think the problem is that i never got arrond to uploading my sample dataset. Do you have a ssample one we can add to the benchmarks? If so that might help speed things up. |
I am affraid not, the datasets where I notice performance issues (time or memory wise) are all approx 3GB. Most of the smaller test datasets I use do not show these issues. What I noticed is that the performance issues are mainly happening for datasets with many variables, since these cause significant increases of accessing of sizes/dims/properties of variables. So in that light, I have tested with this code (based on an xarray test): import numpy as np
from xarray import Dataset
import xarray as xr
nx = 80
ny = 50
ntime = 100
file_nc = f"test_x{nx}_y{ny}_time{ntime}.nc"
write_dataset = False
if write_dataset:
ds = Dataset(
dict(
z1=(["y", "x"], np.random.randn(ny, nx)),
z2=(["time", "y"], np.random.randn(ntime, ny)),
z3=(["y", "x", "time"], np.random.randn(ny, nx, ntime)),
),
dict(
x=("x", np.linspace(0, 1.0, nx)),
time=("time", np.linspace(0, 1.0, ntime)),
y=range(ny),
),
)
varnum = 1500
for i in range(varnum):
ds[f"var_{i:04d}"] = ds["z3"]
ds.to_netcdf(file_nc)
print("opening file")
ds_in = xr.open_dataset(file_nc, engine="h5netcdf") I ran this code with Results in this timings/usage for And these timings/usage for This shows similar behaviour to what I documented in Deltares/dfm_tools#484, but simplified: h5netcdf consumes less memory but is significantly slower. I cannot judge if this testdata and/or results add value to this PR though. |
Lets try to keep this conversation focused on the performance of isel. your benchmarks are interesting, but unrelated to this PR. |
@dcherian please find a zip containing a file called
On the tag 2024.09.0 I get the following performance (
On this branch
On my threadripper
It has:
I acknowledge that I am the owner of this data and I give you permission (@dcherian) to use it how you see fit. |
now that the legalize has gone out of the way, let me know what you want me to do to get it integrated. |
This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition.
For example, we have about 80 of these in our datasets (and I want to incrase this number)
Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands.
However, it has become quite slow to index in the dataset.
We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application.
These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile:
Thanks for considering.
whats-new.rst
api.rst
xref: #2799
xref: #7045