Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid using @sync_add on remotecalls #44671

Merged
merged 7 commits into from
Mar 23, 2022
Merged

avoid using @sync_add on remotecalls #44671

merged 7 commits into from
Mar 23, 2022

Conversation

exaexa
Copy link
Contributor

@exaexa exaexa commented Mar 18, 2022

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously and serially (here: https://github.com/JuliaLang/julia/blob/v1.7.2/base/task.jl#L358). Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows:

First add some workers:

using Distributed
addprocs(10)

and then trigger something that, for example, causes package imports on the workers:

using SomeTinyPackage

In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running @everywhere for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s.

Related issues:

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.
@giordano giordano added performance Must go faster parallelism Parallel or distributed computation labels Mar 18, 2022
@exaexa
Copy link
Contributor Author

exaexa commented Mar 18, 2022

Side thought: The use of @sync_add in Distributed was actually the only use in whole Julia. Since it's also not exported from Base nor officially documented, it might be feasible to slash it.

 ~/work/julia $ grep sync_add * -r
base/task.jl:macro sync_add(expr)
 ~/work/julia $ 

@exaexa
Copy link
Contributor Author

exaexa commented Mar 18, 2022

PS, illustrated relation to #42156:

 ~/work/julia $ time ./julia -p16 -e 'using Distributed; @everywhere 1+1'

# unpatched 1.7.2
real	0m27.972s
user	2m17.593s
sys	0m5.705s

# this PR
real	0m18.705s
user	1m38.069s
sys	0m4.570s

# 42156 only
real	0m10.502s
user	0m29.906s
sys	0m3.604s

# both PRs
real	0m6.800s
user	0m33.056s
sys	0m4.423s

@jpsamaroo
Copy link
Member

@exaexa since I know you have access to some large node counts, can you validate that this actually alleviates #39291 (and show timing results here)?

@exaexa
Copy link
Contributor Author

exaexa commented Mar 18, 2022

can you validate that this actually alleviates #39291 (and show timing results here)?

Sure, it's only going to take some time (need to wrap the whole thing in slurm and compile this branch on the HPC). Hopefully tomorrow. Feel free to suggest whatever benchmark in the meantime; now I think I'm doing just @everywhere 1+1 (because that's so simple that the slowdown hurts most) and @everywhere using JuMP (because that's literally what we need solved, ha).

@exaexa
Copy link
Contributor Author

exaexa commented Mar 19, 2022

A slightly more rigorous benchmark

The script:

using Distributed, ClusterManagers
n_workers = parse(Int, ENV["SLURM_NTASKS"])
@info "starting" n_workers
t = @timed addprocs_slurm(n_workers, topology = :master_worker)
@info "finished!" t.time
t = @timed @everywhere 1+1     #alternatively: @timed @everywhere using JuMP
@info "1+1d" t.time

Slurm batch:

#!/bin/sh
#SBATCH -t 20
#SBATCH -c 1
#SBATCH -n 256   # or 1024
#SBATCH --mem-per-cpu 3G

JULIA=$HOME/work/julia-git/julia

time $JULIA testfile.jl

Hardware: cpu nodes on Iris cluster of Uni.Lu HPC, see https://hpc-docs.uni.lu/systems/iris/

Results:

version test 256w @info 256w time wallclock 1024w @info 1024w time wallclock
1.7.2 1+1 2m14s 2m38s 9m43s 10m28s
fda2c82 1+1 0.58s 21s 0.66s 50s
1.7.2 using JuMP 3m26s 4m8s 11m1s 11m44s
fda2c82 using JuMP 11s 33s 13s 1m13s

I'd say this pretty much confirms the hypothesis.

I didn't measure #42156, but from what I've seen from testing locally, this roughly describes the differences:

  • 1.7.2 takes roughly O(local_work + n_workers*(latency + long_precompile_time))
  • reenable the precompile generation for Distributed #42156 takes just O(local_work + n_workers * small_roundtrip_latency) (waaaaay faster but still not "scalable" per se)
  • this takes O(local_work + n_workers * negligible_packet_send_time + long_precompile_time) (the Distributed precompilation is present, but done in parallel way so it doesn't hurt at all. Also this is the most scalable we can get without reaching to complicated task spawning & result collection schemes)

@vtjnash
Copy link
Member

vtjnash commented Mar 21, 2022

SGTM. I think remotecall is supposed to be cheap, but makes sense we may need to launch a Task to handle the cases when it is not cheap.

Need to fix the error return type in the test though.

@exaexa
Copy link
Contributor Author

exaexa commented Mar 21, 2022

SGTM. I think remotecall is supposed to be cheap, but makes sense we may need to launch a Task to handle the cases when it is not cheap.

Yeah remotecall itself is super-fast (I'd say it wouldn't even need to be asynchronous, except for dodging an extra tiny piece of network latency); the "expensive" action that we're dodging here is what gets triggered on the remote from @sync after you use @sync_add on the Future returned from the remotecall.

Need to fix the error return type in the test though.

You mean this one right? https://buildkite.com/julialang/julia-master/builds/10190#d6e3e440-c11b-46c5-9ab1-489cea189b10/361-912

(I got lost in which tests here are "expectably failing" and which ones should not fail)

Is there any good way to unwrap the right choice of TaskFailedExceptions? It seems to me that doing that properly would require adding some wrapper logic so that the Task gets started right away but it dodges the TaskFailedException wrap in sync_end... Alternatively I can catch the exceptions manually in remotecall_eval, mark them, and then rethrow them all correctly unwrapped after the sync is done.

@exaexa
Copy link
Contributor Author

exaexa commented Mar 21, 2022

re exceptions, it seems there's no other way around than implementing a small unwrapping helper (I was pointed to a related issue here: #38931).

I'll push my attempt later today. :D

@exaexa
Copy link
Contributor Author

exaexa commented Mar 21, 2022

This what I pushed is a "slightly less painful" way to do that, with less code duplication (still needs a bit of polishing). It is a bit suboptimal because it first creates the TaskFailedException and then forcibly unwraps it, while the "optimal" way would be to reimplement a lot of the Task to completely avoid this wrap-unwrap. I guess having less code and almost-negligible-cost wrap&unwrap is the better option. Opinions/recommendations welcome though.

Also seems to kinda solve #38931 . almost, we'd need the same wrapper for @spawn.

base/task.jl Outdated Show resolved Hide resolved
base/task.jl Outdated Show resolved Hide resolved
base/task.jl Outdated Show resolved Hide resolved
base/task.jl Outdated Show resolved Hide resolved
@tkf
Copy link
Member

tkf commented Mar 23, 2022

Yeah remotecall itself is super-fast (I'd say it wouldn't even need to be asynchronous, except for dodging an extra tiny piece of network latency); the "expensive" action that we're dodging here is what gets triggered on the remote from @sync after you use @sync_add on the Future returned from the remotecall.

Maybe it's just me, but I find this description (and the one in the OP) confusing. Whatever happens with @sync_add remotecall(f, p) has to happen with @async unwrap remotecall_wait(f, p) for Distributed to function properly. Perhaps a more verbose way to describe this is that, since wait(::Future) requires several IOs (presumably), it is beneficial to interleave these IOs. For example, the primary process can start fetching from worker B before fetching from worker A completes. So, I find it easier to understand if the explanation is that we are increasing overall performance by interleaving expensive actions (IOs), and not by dodging anything. Does this make sense?

@exaexa
Copy link
Contributor Author

exaexa commented Mar 23, 2022

@tkf yes, it seems that the second "dodging" in my explanation wasn't the best word choice. 😅

Just to make it perfectly clear,

  • the actions performed with and without this PR are the same
  • the main difference is that calls for wait() here are forcibly interleaved (by @async), which is not the case of naive wait()s in sync_end()
  • the main source of slowness here was that the first wait() caused compilation on remote side (>0.2s in my case). In consequence, calling these serially really hurt. (If Distributed got precompiled (reenable the precompile generation for Distributed #42156), the main source of slowness would mostly disappear, but the wait()s would still be serial (in my case ~0.05s per call) thus still a scalability problem.)

@tkf
Copy link
Member

tkf commented Mar 23, 2022

Thanks for clarification. I was also wondering if the main benefits are from the "pre-computation" I/O in remotecall or the "post-computation" I/O in fetch(::Future).

@KristofferC KristofferC merged commit 62e0729 into JuliaLang:master Mar 23, 2022
@exaexa exaexa deleted the mk-fix-serial-sync-add branch March 23, 2022 10:22
@exaexa
Copy link
Contributor Author

exaexa commented Mar 23, 2022

🎉

Thanks everyone for comments&hints!

@exaexa
Copy link
Contributor Author

exaexa commented Mar 23, 2022

PS @KristofferC is there any chance for backporting this to 1.6 or 1.7 branches? If so, how do I start it? (open PRs against release-1.[67] ?)

@oscardssmith
Copy link
Member

imo, we don't really need to back-port to 1.7 since anyone using 1.7 will probably switch to 1.8 soon (and it's not a bugfix). That said, backporting to 1.6 probably makes sense since large clusters are more likely than most to stick to LTS.

@jishnub
Copy link
Contributor

jishnub commented Mar 23, 2022

anyone using 1.7 will probably switch to 1.8 soon (and it's not a bugfix).

Is this being backported to 1.8? That would be nice. I don't see a label though

@KristofferC KristofferC added backport 1.8 Change should be backported to release-1.8 backport 1.6 Change should be backported to release-1.6 labels Mar 23, 2022
@vtjnash
Copy link
Member

vtjnash commented Mar 23, 2022

Now that I understand better, could you try just using Experimental.@sync here? I think that is written to avoid the problems you described.

@exaexa
Copy link
Contributor Author

exaexa commented Mar 23, 2022

Now that I understand better, could you try just using Experimental.@sync here? I think that is written to avoid the problems you described.

Yeah it seems like that should work too, although it clearly shows some kind of event counting which I'm not a fan of (it's breakable). Also, if I read correctly, it's only able to pick up a single exception from the failing tasks, right?

KristofferC pushed a commit that referenced this pull request Mar 25, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
(cherry picked from commit 62e0729)
@KristofferC KristofferC mentioned this pull request Mar 25, 2022
22 tasks
@KristofferC KristofferC removed the backport 1.8 Change should be backported to release-1.8 label Mar 29, 2022
@giordano
Copy link
Contributor

Since there may be a new version in v1.7.x series I added the backport 1.7 label

@giordano giordano mentioned this pull request Apr 20, 2022
40 tasks
KristofferC pushed a commit that referenced this pull request Apr 20, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
(cherry picked from commit 62e0729)
@KristofferC KristofferC mentioned this pull request May 16, 2022
45 tasks
KristofferC pushed a commit that referenced this pull request May 23, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
(cherry picked from commit 62e0729)
KristofferC pushed a commit that referenced this pull request May 23, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
(cherry picked from commit 62e0729)
KristofferC pushed a commit that referenced this pull request Jul 4, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
(cherry picked from commit 62e0729)
@KristofferC KristofferC removed the backport 1.6 Change should be backported to release-1.6 label Jul 6, 2022
KristofferC pushed a commit that referenced this pull request Dec 21, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
(cherry picked from commit 62e0729)
staticfloat pushed a commit that referenced this pull request Dec 23, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
(cherry picked from commit 62e0729)
vchuravy pushed a commit to JuliaLang/Distributed.jl that referenced this pull request Oct 6, 2023
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes JuliaLang/julia#44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see JuliaLang/julia#44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes JuliaLang/julia#39291.
- JuliaLang/julia#42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with JuliaLang/julia#38931

Co-authored-by: Valentin Churavy <[email protected]>
(cherry picked from commit 3b57a49)
Keno pushed a commit that referenced this pull request Jun 5, 2024
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallelism Parallel or distributed computation performance Must go faster
Projects
None yet
10 participants