-
-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance degradation of repeated aggregations with distributed scheduler and client.persist() on a single machine #332
Comments
To be clear, |
@martindurant I'm seeing the issue with repeat aggregations on the same dataset - both when the aggregation touches some of the data, and when it touches all of the data. It would be nice if someone could confirm; here are the commands I'm using (in datashader/examples directory, after following the instructions at the top of With threaded scheduler +
With distributed scheduler +
Note that because both aggregations take so long for the distributed+persist configuration, overall runtime more than doubles. |
@gbrener note that persist is non-blocking. If you are profiling it might be worthwhile to call |
Thanks for the tip @mrocklin. I tried your suggestion and the issue remains. Here is more output, comparing two runs with the distributed scheduler, one with With persisting/waiting:
Without persisting/waiting:
The odd thing (to me) is not necessarily that it takes more time to load (persist) the data initially, but that it takes more time to do aggregations after the data is persisted. Maybe I'm just not using the feature properly; here is the relevant code for anyone who's interested: https://github.com/bokeh/datashader/blob/dd_scheduler/examples/filetimes.py#L94-L98 Does |
This is mixing issues. The original issue above is load+persist+calculate being slower than load+calculate, even when there is enough memory to go around, which should be investigated. Aside from perhaps time for allocating enough memory to hold the persisted values, they should be the same speed (and subsequent calculations on the same data should be much faster). The second issue is specific to the census/datashader pipeline, which is almost certainly to do with running out of physical memory, and it makes sense that persist makes a potentially big difference here. For reference, the in-memory data is over 5GB, but the aggregation takes up 2GB more (plus any intermediates) in the workers and is then copied to the client process. This is much bigger than it should be. I don't know what the peak memory during pandas aggregation on the same data is, also would be worth checking with memory_profiler. |
@martindurant , I just updated the title. Hopefully that makes it clearer that this specific issue is about the second one you're referring to. |
The aggregation should involve about 4MB of data, not 2GB, because it's just a 900x525 array of doubles. Something seems fishy. |
Memory plot when running in pure-pandas mode (script extract below): The data size is ~5GB, but there are regular excursions to almost 8GB. There being four for each agg, I would suspect these are calculating max/min on the axes - snakeviz seems to confirm this. I note with interest the following:
I don't know what datashader has to say about null-handling, but it would be well worthwhile to determine the nulls only once up-front than to pay the cost every time. Here is the script I used for pure-pandas:
-EDIT-
|
@martindurant, I appreciate the snippet. There is different behavior in datashader for handling dask dataframes versus pandas dataframes; the benchmark code uses What are the specs of the machine you are using (OS, RAM, CPU)? |
Well, the point is that the profiling is far far easier to do, and illustrates the important point: doing max/min on a pandas column comes at a memory cost. When we persist and then do max/min in dask, we have the memory cost at the same time as some additional overhead to move data between workers and copy some to the client. Given that it takes 10s to do max/min four times on the whole pandas dataframe, and >>10s for either complete aggregation, I'd say ~4s for the dask aggregations is very reasonably indeed. I'm saying that instead of trying to tweak dask's memory usage for this specific data size and physical available memory, first we should find the easy win either by assuming a policy around NaNs or using custom complied code something like the numba function above. Combining both max/min runs into the same function, I get ~500ms runtime, instead of 10s for pure pandas. You'll not get such a speedup from dask. |
@martindurant it sounds like we're in agreement that the additional overhead after |
The extra time is due to running out of memory when forcing all data to be held in processes, and then having some additional memory usage due to copying of intermediates and aggregate products. Some simple guidelines for dask:
Use persist when:
|
@martindurant, thanks for all that! I've just discussed some of this briefly with Greg, and I think there are several issues raised here he'll be splitting out separately. For what it's worth, we do not need any special handling for NaNs; the original point data that we are dealing with should never be NaN in the first place, so we can simply assert that there are no NaNs and use whatever function is fastest with no NaNs. That said, I'm still not getting why there is a non-trivial memory cost to doing max/min on a pandas column, so there is clearly something I'm missing. Why would computing max involve anything other than scanning all the values and keeping the largest? Even if it's threaded, seems like it shouldn't need more than the number of threads worth of data (e.g. keeping four partial maxes and comparing them later), so why would the memory required be in gigabytes? |
@martindurant , thank you for the clarifications. Can you please provide a working example showing |
"Why would computing max involve anything other than scanning all the values and keeping the largest?" - honestly, I don't know what pandas can be doing during this time. From experience, pandas copies data at the most innocuous and inconvenient commands. btw: one obvious way to avoid the cost of calculating bounds is to provide them up-front. Then the benchmark if of something different, of course, but it would seem reasonable in the proposed situation where the bounds are the same for both aggs. |
Also, I think we probably have the answer for why cachey was doing so much better: it was caching not just the dataset itself, but the max/min values, for instant recall. |
I think so. We're just calling |
If that's true, then we can surely do something about that! But I'm not sure why that would have varied so much between different ways of loading the data, which was the original finding with cachey; surely the min and max would be preserved regardless of what else might have been cached. |
Care to try this one? https://gist.github.com/martindurant/b23dcb8eb205e769707c1f58c46d13e0
...of course this isn't very flexible, but I thought you might appreciate it, @gbrener . This might give you hints. (btw: not sure I have my x and y axes the right way around, but you know what I mean) (add nogil to the numba decorator, and it's equally fast with threads instead of processes) |
Thanks for the example @martindurant . I closed #336 (the min/max NaN-checking issue) after adding some speed improvements to the bounds-calculations code in datashader (#344), including |
Yes, to use numba on the dask side, you would need to write code to combine the results for each partition. This is exactly what already happens for the actual aggregation, so similar code is there somewhere. |
The following code illustrates a performance issue with dask's distributed scheduler when combined the client.persist(df) functionality on a single machine:
When we insert the following line before
agg = cvs.points(dask_df, ...)
:The runtime performance of
cvs.points()
gets much worse than whenpersist
was not used.The text was updated successfully, but these errors were encountered: