Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix compilation warnings, decide_worker now a C func, stealing improvements #4375

Merged
merged 9 commits into from
Jan 5, 2021

Conversation

jakirkham
Copy link
Member

@jakirkham jakirkham commented Dec 18, 2020

Fixes #4359

  • Use cython.compiled to skip pure Python fallback functions in Cython case (fixes unused symbol warnings from C compiler)
  • Move decide_worker function into C (removes overhead of Python API, which is unneeded)
  • Some optimization to stealing math and other misc. stealing improvements

Note: There are still some warnings that predate this PR and are not handled currently. See issue ( cython/cython#3474 ) for details.

@jakirkham
Copy link
Member Author

@quasiben, this should fix some of the compile warnings we were seeing earlier 🙂

Use Cython's `compiled` flag to check whether the code is being
compiled. If `cython` can't even be `import`ed, then set the flag to
`False`. Then put all Cython bits in the `True` branch and all the
Python bits under the `False` branch. This should help Cython bracket
the Python fallbacks under the `False` case to optimize them out during
compilation (as opposed to leaving a bunch of unused fallback functions
around). As a result this should get rid of the unused symbol warnings
seen with the C compiler previously.
Unlike `ccall`, which adds Python + C APIs for functions, `cfunc` adds
only a C API for functions. This can be handy for internal functions
used in `scheduler.py` only.
Tells Cython to add a C API for calls to this function. Should speed up
calls to this function where a Python API is not needed and would add
otherwise unnecessary overhead.
@jakirkham
Copy link
Member Author

CI failures seem to be the same as those in issue ( #4374 ).

In Python 3.3+, `math` includes `log2`. So just use it directly instead
of dividing by `log(2)`. This is twice as fast.
The default is already to round to `0` digits after the decimal point.
It's also twice as fast to just leave the default as opposed to passing
`0`.
@jakirkham jakirkham changed the title Address unused symbol warnings & make decide_worker a C func Address unused symbol warnings, make decide_worker a C func, stealing math improvements Dec 18, 2020
It's 5x faster to just compare and assign to `level` if needed. So just
do that.
@@ -134,8 +133,10 @@ def steal_time_ratio(self, ts):
if cost_multiplier > 100:
return None, None

level = int(round(log(cost_multiplier) / log_2 + 6, 0))
level = max(1, level)
level = int(round(log2(cost_multiplier) + 6))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log2 was added in Python 3.3+ and is faster than hand rolling by 2x

In [3]: %timeit log(3) / log(2)
401 ns ± 5.33 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit log(3, 2)
234 ns ± 2.72 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

round already rounds to 0 digits after the decimal place. Using the default is 2x faster than specifying one

In [7]: %timeit round(1.5, 0)
393 ns ± 2.58 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [8]: %timeit round(1.5)
195 ns ± 1.77 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Comment on lines +137 to +138
if level < 1:
level = 1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using if is roughly an order of magnitude faster than max.

In [1]: %timeit max(1, 3)
222 ns ± 1.66 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [2]: %timeit max(1, 0)
222 ns ± 1.41 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [3]: %%timeit
   ...: L = 3
   ...: if L < 1:
   ...:     L = 1
   ...: 
25.5 ns ± 0.165 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [4]: %%timeit
   ...: L = 0
   ...: if L < 1:
   ...:     L = 1
   ...: 
29.7 ns ± 0.231 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objection here, but I do want us to look at the full picture rather than just micro-optimizing. 200ns may not matter in the grand scheme. At some point it might make more sense to keep code around that is nanoseconds slower because it is more readable.

I'm not making a judgement in this case (I haven't looked at this much). I just want to avoid us going overboard here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly came here since put_key_in_stealable takes as much runtime as transition_waiting_processing and transition_processing_memory. With the latter 2 we have already put a lot of time into optimizing them, but this part of the code has largely been untouched. So am trying to tackle some of the low hanging fruit here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add that collectively these changes more than halve the time spent in put_key_in_stealable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to hear it. My comment was just general. I have no particular concern with this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Just wanted to provide context :)

Besides we don't use any of the stuff until the 3rd case. So just move
it later.
Comment on lines -123 to -125
nbytes = ts.get_nbytes_deps()

transfer_time = nbytes / self.scheduler.bandwidth + LATENCY
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't even use these until the 3rd if. So just move them later. We could already have returned before needing these.

@jakirkham jakirkham changed the title Address unused symbol warnings, make decide_worker a C func, stealing math improvements Fix compilation warnings, decide_worker now a C func, stealing math improvements Dec 18, 2020
This function gets called a lot and logging from makes it run kind of
slow. So just drop it to remove this bottleneck.
@jakirkham jakirkham changed the title Fix compilation warnings, decide_worker now a C func, stealing math improvements Fix compilation warnings, decide_worker now a C func, stealing improvements Dec 18, 2020
@jakirkham jakirkham requested a review from mrocklin December 18, 2020 22:32
@jakirkham
Copy link
Member Author

Understand if this doesn't get looked at for a while. Just making sure you are aware 😉

@jakirkham
Copy link
Member Author

Restarted CI now that issue ( #4374 ) is fixed

@jakirkham jakirkham requested a review from jrbourbeau December 22, 2020 00:04
@jakirkham
Copy link
Member Author

Planning to merge tomorrow if no comments.

@quasiben
Copy link
Member

quasiben commented Jan 4, 2021

Thanks @jakirkham -- looks good

@jakirkham jakirkham merged commit 76ef459 into dask:master Jan 5, 2021
@jakirkham jakirkham deleted the misc_opt5 branch January 5, 2021 18:08
@@ -85,7 +84,6 @@ def put_key_in_stealable(self, ts):
ws = ts.processing_on
worker = ws.address
cost_multiplier, level = self.steal_time_ratio(ts)
self.log(("add-stealable", ts.key, worker, level))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing the same thing for remove_key_from_stealable ( #4609 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Disabling logging in stealing
3 participants