-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance notes with and without tree reductions etc. #34
Comments
1065s, RAM touched 300Gb but I'll take it, it's a tough plot. |
Same settings but 16 colors. Note this is 4e+10 points!
707s (box was a bit competed for, though). RAM ~200Gb. 2e+10 points. I think I broken core selection though, as it plotted all four (but why not 4e+10 then? odd):
830s. Didn't notice the RAM. :) OK time to put a proper profiler in. |
I think it may be possible to recover a couple of factors in RAM usage. Running the numbers: 2*1000*4096*4 x 16 bytes (broadcasted MS data) + 1024*1024 * 64 * 8 bytes (image) ~ 1012MB per thread Multiply by 64 threads ~ 64GB My instinct would be to check the broadcast. The fix @JSKenyon put into ragavi avoided alot of extraneous communication in the graph. |
Updated estimate to include categories in image |
Updated again to cater for 64 categories |
The task ordering looks pretty good, for 10,000 rows on minsize.ms. Colour indicates priority (in the round nodes) with red prioritised first and blue last. There are four independent colour streams, corresponding to four MS chunks Here's another view of the same graph, with task names in the round nodes. I'm fiddling to try and get both names and priority in the same plot, its a bit difficult to decipher both at the same time. EditGenerated by calling the visualize method on a dask array/dataframe collection R.visualize("graph.pdf")
R.visualize("graph.png", order="color") |
Another realistic case, two fields, UV-plot coloured by phase.
Chunk size 10000, 1000, tree reduction.
Chunk size 10000, original reduction. Faster but hungier:
Chunk size 1000, original reduction blew out my 512G RAM so I gave up. |
@o-smirnov These look really cool. Are these last ones also by the tree reduction? |
The top set is for tree reduction (sorry, editing the comment under you!), bottom set original reduction. Phase should be ==0 on properly corrected calibrator data, so a good plot of this kind is a bland plot. The stripy pattern in the left column suggests a problem in 0408 -- most likely an unmodelled field source contributing a fringe. |
Ooh I see now. Would you know what is causing that weird peak in memory towards the end in those original reductions? |
No, but @sjperkins has also been wondering... |
@Mulan-94 Are you referring to this kind of plot (taken from #34 (comment))? If so, the climb in memory at the end is almost certainly the Having said that, I'm bothered by this kind of pattern in the tree reduction (#34 (comment)). I would've hoped for a flatter heartbeat pattern, without those two peaks. I'll try block off some time next week to look at this. |
As a weird check, could someone try running a test using dask==2.9.1 as opposed to the latest version? While the ordering plot Simon included looks correct, I would be interested to see if the ordering changes introduced after 2.9.1 are affecting the results. |
Oh, and for ordering plots, I was informed of the following useful invocation: dask.visualize(dsk,
color='order',
cmap='autumn',
filename='output.pdf',
node_attr={'penwidth': '10'}) It just makes the colours more obvious. |
So, in a weird coincidence, I was doing some profiling of CubiCalV2 this morning and noticed some very familiar features - specifcially those beautiful mountainous ramps in memory usage followed by a precipitous drop. Here are some plots: |
Yes, I'd think the next step would be to try embedding @o-smirnov, or @Mulan-94 would you be willing to pursue the above? Otherwise I'll look at it next week when I've cleared my stack a bit. I dislike the idea of embedding
|
I have absolutely no idea what you just said, but that won't stop me from giving it a try anyway! @JSKenyon, garbage collection is really the answer to everything, isn't it. :) The only time I got a flat heartbeat was in this case, top left: #34 (comment) |
Argh apologies, was assuming familiarity on the subject matter. The generational garbage collection bits of this article might explain it quickly: https://stackify.com/python-garbage-collection/. |
I have played around with the threshold settings and I actually think that having manual GC calls is safer/better for applications which don't do much allocation. There are three threshold levels corresponding to different generations - lets call them |
Yes, I think applications such as cubicalv2and shadems have more leeway in using the garbage collector as they wish. Unfortunately, in this case, I think the optimal place to put the collect calls is in the datashader tree reduction, which is an internal API. To publish code like that in an API is a hard no to me. Of course this is Python so we can monkeypatch everything as a workaround, within reason ;-) |
Ah that does make sense - if the goal is taming a dependency, then the thresholds are probably the way to go. |
I'm happy to monkeypatch it in for now, and if it works and solves the problem, then we discuss how and if to get it into datashader properly. @sjperkins, where is this stack() call you speak of happening? |
In datashader's The tree reduction still calls combine but in batches with far fewer arrays (roughly, the split_every parameter in dask.array.reduction). |
I added |
@sjperkins, if you've re-stocked your beer supplies, tonight would be a good night to open one. Here's a memory profile with dataframe_factory:
|
Ah great, that looks like a factor of 4 improvement
…On Tue, 2 Jun 2020, 18:49 Oleg Smirnov, ***@***.***> wrote:
@sjperkins <https://github.com/sjperkins>, if you've re-stocked your beer
supplies, tonight would be a good night to open one. Here's a memory
profile with dataframe_factory:
df-factory previous version
[image: bokeh_plot]
<https://user-images.githubusercontent.com/6470079/83546751-73870400-a501-11ea-94d6-b45e7be895e7.png> [image:
bokeh_plot(1)]
<https://user-images.githubusercontent.com/6470079/83546745-71bd4080-a501-11ea-9a37-a1143701cf14.png>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#34 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA253ZHDD6UDSAX3QQCG5ELRUUUSLANCNFSM4MO7BKMQ>
.
|
Also possibly: Big-O space complexity for the win: #34 (comment)
I'm not sure if the above figures are right for this:
but if they are it's pretty close. It looks like shadems is running 72 threads. On average they're using 500MB each (36GB total), but at peak memory usage they're using ~1GB (75GB total). @o-smirnov, was the above run done with 1000 rows, 4096 channels ~1024^2 image and 64 categories? That double-peak in the memory pattern is retained in the new version. I wonder what it is? Images being aggregated to form a final value? Does datashader run through the data twice? |
/cc'ing @rubyvanrooyen, who may also find #34 (comment) useful. |
@sjperkins: it was 722610 rows, 4k channels, 4 corrs, 16 categories. The double-peak is there because there's a first run-through to determine the data ranges. We can eliminate that if we fix the plot limits, but we don't always know what to fix them to. #55 will help. |
Ah, but what was the -z option, 1000, 10000? |
10000 |
Then maybe we're doing quite well. Let's remove the factor of 2 on the visibilities, because I think numpy broadcasting functionality doesn't actually expand the underlying array to full resolution: it uses 0 strides to give the impression of it. Then: 10,000 x 4,096 x 4 x 16 + 1024 x 1024 x 16 x 8 ~ max 2.5GB per thread The memory profile is suggesting ~36GB over 72 threads in the average case (i.e. an average of ~500MB per thread) and ~75GB over 72 threads in the peak case (i.e. an average of ~1GB per thread). I guess the visibility data doesn't stay in memory all that long -- it gets converted to an image before the tree reduction step. All speculation, but useful to start with some sort of model and refine it. |
I did a few more benchmarks with and without tree reduction ("tree" versus "master") for varying problem sizes and ask chunk sizes: https://www.dropbox.com/sh/m0fch390vliqkkf/AACBcAHkHCZyzsFW3U7dnXcOa?dl=0 Observations:
|
Continuing on from #29, just more systematically.
With tree reduction, 1e+10 points.
Tops out at ~250Gb, runs in 145s.
Blows out my RAM.
Tops out at ~70Gb, runs in 245s (but the run had competition on the box). Going to get more exact numbers from ResourceProfiler shortly.
So @sjperkins, first take: mission accomplished insofar as getting the RAM use under control. Memory-parched users can just dial their chunk size down.
The text was updated successfully, but these errors were encountered: