-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix compilation warnings, decide_worker
now a C func, stealing improvements
#4375
Conversation
@quasiben, this should fix some of the compile warnings we were seeing earlier 🙂 |
Use Cython's `compiled` flag to check whether the code is being compiled. If `cython` can't even be `import`ed, then set the flag to `False`. Then put all Cython bits in the `True` branch and all the Python bits under the `False` branch. This should help Cython bracket the Python fallbacks under the `False` case to optimize them out during compilation (as opposed to leaving a bunch of unused fallback functions around). As a result this should get rid of the unused symbol warnings seen with the C compiler previously.
Unlike `ccall`, which adds Python + C APIs for functions, `cfunc` adds only a C API for functions. This can be handy for internal functions used in `scheduler.py` only.
Tells Cython to add a C API for calls to this function. Should speed up calls to this function where a Python API is not needed and would add otherwise unnecessary overhead.
CI failures seem to be the same as those in issue ( #4374 ). |
In Python 3.3+, `math` includes `log2`. So just use it directly instead of dividing by `log(2)`. This is twice as fast.
The default is already to round to `0` digits after the decimal point. It's also twice as fast to just leave the default as opposed to passing `0`.
decide_worker
a C funcdecide_worker
a C func, stealing math improvements
It's 5x faster to just compare and assign to `level` if needed. So just do that.
@@ -134,8 +133,10 @@ def steal_time_ratio(self, ts): | |||
if cost_multiplier > 100: | |||
return None, None | |||
|
|||
level = int(round(log(cost_multiplier) / log_2 + 6, 0)) | |||
level = max(1, level) | |||
level = int(round(log2(cost_multiplier) + 6)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log2
was added in Python 3.3+ and is faster than hand rolling by 2x
In [3]: %timeit log(3) / log(2)
401 ns ± 5.33 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit log(3, 2)
234 ns ± 2.72 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
round
already rounds to 0
digits after the decimal place. Using the default is 2x faster than specifying one
In [7]: %timeit round(1.5, 0)
393 ns ± 2.58 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [8]: %timeit round(1.5)
195 ns ± 1.77 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
if level < 1: | ||
level = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using if
is roughly an order of magnitude faster than max
.
In [1]: %timeit max(1, 3)
222 ns ± 1.66 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [2]: %timeit max(1, 0)
222 ns ± 1.41 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [3]: %%timeit
...: L = 3
...: if L < 1:
...: L = 1
...:
25.5 ns ± 0.165 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [4]: %%timeit
...: L = 0
...: if L < 1:
...: L = 1
...:
29.7 ns ± 0.231 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objection here, but I do want us to look at the full picture rather than just micro-optimizing. 200ns may not matter in the grand scheme. At some point it might make more sense to keep code around that is nanoseconds slower because it is more readable.
I'm not making a judgement in this case (I haven't looked at this much). I just want to avoid us going overboard here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly came here since put_key_in_stealable
takes as much runtime as transition_waiting_processing
and transition_processing_memory
. With the latter 2 we have already put a lot of time into optimizing them, but this part of the code has largely been untouched. So am trying to tackle some of the low hanging fruit here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add that collectively these changes more than halve the time spent in put_key_in_stealable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad to hear it. My comment was just general. I have no particular concern with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. Just wanted to provide context :)
Besides we don't use any of the stuff until the 3rd case. So just move it later.
nbytes = ts.get_nbytes_deps() | ||
|
||
transfer_time = nbytes / self.scheduler.bandwidth + LATENCY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't even use these until the 3rd if
. So just move them later. We could already have return
ed before needing these.
decide_worker
a C func, stealing math improvementsdecide_worker
now a C func, stealing math improvements
This function gets called a lot and logging from makes it run kind of slow. So just drop it to remove this bottleneck.
decide_worker
now a C func, stealing math improvementsdecide_worker
now a C func, stealing improvements
Understand if this doesn't get looked at for a while. Just making sure you are aware 😉 |
Restarted CI now that issue ( #4374 ) is fixed |
Planning to merge tomorrow if no comments. |
Thanks @jakirkham -- looks good |
@@ -85,7 +84,6 @@ def put_key_in_stealable(self, ts): | |||
ws = ts.processing_on | |||
worker = ws.address | |||
cost_multiplier, level = self.steal_time_ratio(ts) | |||
self.log(("add-stealable", ts.key, worker, level)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing the same thing for remove_key_from_stealable
( #4609 )
Fixes #4359
cython.compiled
to skip pure Python fallback functions in Cython case (fixes unused symbol warnings from C compiler)decide_worker
function into C (removes overhead of Python API, which is unneeded)Note: There are still some warnings that predate this PR and are not handled currently. See issue ( cython/cython#3474 ) for details.