-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
array assignment slower than numpy #2095
Comments
Thanks for the detailed report! I'm a bit puzzled by the result. I setup the following test, inspired by your test case:
with numpy (loop): 436 msec per loop So i see an issue with the assign part (and I can work on it!) but none with the loop :-/ |
For full reproducibility, I probably owe you information about my conda environment, and compiler versions. I was travelling when I wrote the report so it wasn't so easy for me to spin up new, clean, environments. I'll try your little test when I get back on a few machines. |
Environment before installing numba
I included the output of |
I should also note, that the "size" of the array here is kinda huge compared to what I see many benchmarks use. Often I see that people benchmark with 256 x 256 images or 512 x 512 images. These images are less than 1MB in size (with uint8 precision). However, this image, encoded in yuv420p is more than 13MB. It doesn't fit in my cache, all at once. (but the uv part might) It might fit in yours if you have a large machine which may explain some differences in our results. |
It's very rare to have non-constant stride, and using a constant unlocks various optimization, including potential improvement for #2095
Thanks for the extra piece of information. Does #2097 improve the situation? It does on my setup, making EDIT: this PR doesn't passes full validation yet, but it should be ok for our concern though. |
It's very rare to have non-constant stride, and using a constant unlocks various optimization, including potential improvement for #2095
Unfortunately, it does not resolve things. Is there any intermediate output I can give you to debug things? compiler info and whatnot? |
Hell! A funny side effect of this track is that I'm fixing a lot of performance issues (some very big ones) that shows in various numpy benchmarks :-) |
Wow, that's great news!
…On 4/14/23 10:15 AM, serge-sans-paille wrote:
Hell! A funny side effect of this track is that I'm fixing a lot of
performance issues (some very big ones) that shows in various numpy
benchmarks :-)
—
Reply to this email directly, view it on GitHub
<#2095 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGUBEPN76ICHF3U4DUDLAYTXBGA4LANCNFSM6AAAAAAWYJ36D4>.
You are receiving this because you are subscribed to this
thread.Message ID:
***@***.***>
|
@jeanlaroche : you can check PR #2096, #2097 and #2096, they are likely to be merged soon. @hmaarrfk : thanks for the archive - I can reproduce. An analysis of the generated assembly shows twice as many |
#2096: no problem.
#2097 no problem
#2098 is not building.
In file included from
/Users/jlaroche/packages/system/algo/Transients/transient_api.cpp:4:
/Users/jlaroche/packages/system/algo/build/darwin/Debug-Individual-Xcode/generated/transient.cpp:999:55:
error: no matching function for call to 'call'
typename
pythonic::assignable_noescape<decltype(pythonic::types::call(transient_tf_bridge::tflite_bridge::runModel(),
0L, Input_x, std::get<1>(pythonic::types::as_const(self))))>::type y =
pythonic::types::call(transient_tf_bridge::tflite_bridge::runM...
That last one isn't in the message below where you repeat 2096... and
the error could be because of some of my code. Can you confirm you're
planning on merging 2098 as well?
Jean
…On 4/14/23 10:27 AM, serge-sans-paille wrote:
@jeanlaroche <https://github.com/jeanlaroche> : you can check PR #2096
<#2096>, #2097
<#2097> and #2096
<#2096>, they are
likely to be merged soon.
@hmaarrfk <https://github.com/hmaarrfk> : thanks for the archive - I
can reproduce. An analysis of the generated assembly shows twice as
many |mov| in the pythran generated code compared to numpy's, I'll
investigate.
—
Reply to this email directly, view it on GitHub
<#2095 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGUBEPOP33ZJ5MWMY7FXWGDXBGCIPANCNFSM6AAAAAAWYJ36D4>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
It's very rare to have non-constant stride, and using a constant unlocks various optimization, including potential improvement for #2095
Short notice: the following numpy version achieves the same result with twice less memory:
no significant impact on runtime from the pythran perspective on my setup though. EDIT: it actually brings interesting speedup to the pythran-compiled version. Could you give it a try? (from the feature/faster-gexpr branch) |
Away from my computer at the moment. But at some point I found mixed results with this approach. I think my hunch was that due to the memory aliasing. Looking into things in the past, i found that x86 had some inter sting instructions that skipped the cache. I didn't know if pythran could be optimized for that, i wanted an to avoid speedups due to that optimization. |
Looking into this now that I have a second, it seems that there is no action item on my part. keep me posted. |
Pythran is not doing anything specific related to cache. I can get good speedup if I remove a runtime check for aliasing between lhs and rhs when assigning between slices, but I currently fail at doing so in an elegant way. |
As for the memory copy, I'm mostly talking about my results from: https://github.com/awreece/memory-bandwidth-demo
I feel like this is a "future improvement" maybe..... I'm not a fan of using "threading" for speedups, but the non_temporal_avx shows an impressive speedup. |
It's very rare to have non-constant stride, and using a constant unlocks various optimization, including potential improvement for #2095
Hey, thanks for making this cool library. I really do believe that the advantages you outline in terms of ahead of time compilation are valuable to those building powerful scientific computation libraries.
I was trying my hand at doing a "simple" image analysis task: rgb -> nv12 conversion.
nv12 seems to be similar to yuv420 (I420) but with the U and V channels interleaved instead of on distinct planes.
This should be a simple transpose operation, but as is typical, it is easy to do this operation very slowly based on how you do it.
The operation amounts to something like transposing a 2D Array. From
[u_0, u_1, ... u_n, v_0, v_v1, v_n]
to[u_0, v_0, u_1, v_1, ... u_n, v_n]
.To setup the problem, lets start with:
For some reason, I need to do the operation in place. In numpy, this would amount to:
On my computer, I get about 700-750 iterations per second with this loop:
As a bound, I estimated that that the "upper bound of performance" would be achieved with
This acheives 1184 iterations / sec! Pretty good. I was hoping that pythran could help get that leve
I tries two diferent ways of doing this:
pack_my_pythran_assign
acheives 315 its / second.pack_my_pythran_loop
achieves 460 its / secondI tried with numba for completeness
Achieves pretty close to 600 its/second.
I mean, I'm not really expecting too much "speedup here" but I figured that this was strange. I'm hoping that this little example can help improve this library.
Best.
Mark
The text was updated successfully, but these errors were encountered: