Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow rethrowing exception from task with original backtrace #38931

Open
nalimilan opened this issue Dec 17, 2020 · 6 comments
Open

Allow rethrowing exception from task with original backtrace #38931

nalimilan opened this issue Dec 17, 2020 · 6 comments
Labels
error handling Handling of exceptions by Julia or the user multithreading Base.Threads and related functionality speculative Whether the change will be implemented is speculative

Comments

@nalimilan
Copy link
Member

nalimilan commented Dec 17, 2020

When an exception happens in a task, a task TaskFailedException is thrown. But packages that accept a user-provided function and call it in a task (like DataFrames) generally don't want to throw this exception type, but rethrow the original exception that happened in user code instead. Otherwise, changing the implementation (e.g. adding multithreading support) would change the user-visible behavior and break the API.

Currently there doesn't seem to be a good way to do this. The package ExceptionUnwrapping.jl has even been created to work around that.

The best mitigation I could find is to use try... catch to catch the TaskFailedException, and then call throw on the task's exception. This throws the original exception type, which allows preserving the API. But the backtrace refers to the place where the exception was rethrown rather than to the original place where the error happened, and users have to look at the third nested exception to see the most useful backtrace.

Would there be a way to preserve the original backtrace? If not, wouldn't it be useful to allow this kind of pattern?

# User passes custom function f to the package
julia> f(x) = nonexistent(x)
f (generic function with 1 method)

# Default: TaskFailedException
julia> fetch(Threads.@spawn f(1))
ERROR: 
TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:317 [inlined]
 [2] fetch(t::Task)
   @ Base ./task.jl:332
 [3] top-level scope
   @ threadingconstructs.jl:179

    nested task error: UndefVarError: nonexistent not defined
    Stacktrace:
     [1] f(x::Int64)
       @ Main ./REPL[1]:1
     [2] (::var"#1#2")()
       @ Main ./threadingconstructs.jl:169

# Best workaround I could find
# Note that user function f only appears in the third backtrace
julia> try
           t = Threads.@spawn f(1)
           fetch(t)
       catch
           throw(t.exception)
       end
ERROR: UndefVarError: t not defined
Stacktrace:
 [1] top-level scope
   @ REPL[3]:5

caused by: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:317 [inlined]
 [2] fetch(t::Task)
   @ Base ./task.jl:332
 [3] top-level scope
   @ REPL[3]:3

    nested task error: UndefVarError: nonexistent not defined
    Stacktrace:
     [1] f(x::Int64)
       @ Main ./REPL[1]:1
     [2] (::var"#3#4")()
       @ Main ./threadingconstructs.jl:169
@nalimilan nalimilan added speculative Whether the change will be implemented is speculative error handling Handling of exceptions by Julia or the user multithreading Base.Threads and related functionality labels Dec 17, 2020
@bkamins
Copy link
Member

bkamins commented Dec 17, 2020

This is inconvenient indeed. My only question would be if:

 try
           t = Threads.@spawn f(1)
           fetch(t)
       catch
           throw(t.exception)
       end

is guaranteed to throw the TaskFailedException or we have to put an additional check for it in the catch clause.

@JeffBezanson
Copy link
Member

I see the issue but I'm not sure we want libraries making piece-meal decisions about what exception information to hide. I would prefer something systematic, e.g. having @sync or @threads propagate the inner exception only.

@bkamins
Copy link
Member

bkamins commented Dec 17, 2020

I would prefer something systematic, e.g. having @sync or @threads propagate the inner exception only.

This is what we would prefer also 😄.

@nalimilan
Copy link
Member Author

Yes that would be even nicer for most cases. Though maybe being able to do that when you call fetch would also be useful -- not sure.

@bkamins
Copy link
Member

bkamins commented Mar 27, 2021

bumping this issue. Just to show you a stack trace I got today:

julia> combine(d -> d.x == [1] ? d[1, [1, 2]] : d[1, [2, 1]], gdf)
ERROR: ArgumentError: return value must have the same column names for all groups (got (:x, :y) and [:y, :x])
Stacktrace:
 [1] _combine(gd::GroupedDataFrame{DataFrame}, cs_norm::Vector{Any}, optional_transform::Vector{Bool}, copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:573
 [2] _combine_prepare(gd::GroupedDataFrame{DataFrame}, cs::Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, AbstractVector{T} where T, Type, All, Between, Cols, InvertedIndex}; keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:66
 [3] #combine#621
   @ ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:674 [inlined]
 [4] #combine#619
   @ ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:663 [inlined]
 [5] combine(f::Function, gd::GroupedDataFrame{DataFrame})
   @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:660
 [6] top-level scope
   @ REPL[8]:1

caused by: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:317 [inlined]
 [2] _combine(gd::GroupedDataFrame{DataFrame}, cs_norm::Vector{Any}, optional_transform::Vector{Bool}, copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:569
 [3] _combine_prepare(gd::GroupedDataFrame{DataFrame}, cs::Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, AbstractVector{T} where T, Type, All, Between, Cols, InvertedIndex}; keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:66
 [4] #combine#621
   @ ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:674 [inlined]
 [5] #combine#619
   @ ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:663 [inlined]
 [6] combine(f::Function, gd::GroupedDataFrame{DataFrame})
   @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:660
 [7] top-level scope
   @ REPL[8]:1

    nested task error: ArgumentError: return value must have the same column names for all groups (got (:x, :y) and [:y, :x])
    Stacktrace:
     [1] _combine_rows_with_first!(firstrow::DataFrameRow{DataFrame, DataFrames.SubIndex{DataFrames.Index, Vector{Int64}, Vector{Int64}}}, outcols::Tuple{Vector{Int64}, Vector{Int64}}, f::Function, gd::GroupedDataFrame{DataFrame}, incols::Nothing, colnames::Tuple{Symbol, Symbol}, firstmulticol::Val{true})
       @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/complextransforms.jl:267
     [2] _combine_with_first(first::DataFrameRow{DataFrame, DataFrames.SubIndex{DataFrames.Index, Vector{Int64}, Vector{Int64}}}, f::Function, gd::GroupedDataFrame{DataFrame}, incols::Nothing, firstmulticol::Val{true}, idx_agg::Vector{Int64})
       @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/complextransforms.jl:63
     [3] _combine_multicol(firstres::DataFrameRow{DataFrame, DataFrames.SubIndex{DataFrames.Index, Vector{Int64}, Vector{Int64}}}, fun::Function, gd::GroupedDataFrame{DataFrame}, incols::Nothing)
       @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/complextransforms.jl:20
     [4] _combine_process_callable(cs_i::Union{Function, Type}, optional_i::Bool, parentdf::DataFrame, gd::GroupedDataFrame{DataFrame}, seen_cols::Dict{Symbol, Tuple{Bool, Int64}}, trans_res::Vector{DataFrames.TransformationResult}, idx_agg::Base.RefValue{Union{Nothing, Vector{Int64}}})
       @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:268
     [5] macro expansion
       @ ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:558 [inlined]
     [6] (::DataFrames.var"#609#615"{GroupedDataFrame{DataFrame}, Bool, Bool, DataFrame, Dict{Symbol, Tuple{Bool, Int64}}, Vector{DataFrames.TransformationResult}, Base.RefValue{Union{Nothing, Vector{Int64}}}, Bool, var"#5#6"})()
       @ DataFrames ./threadingconstructs.jl:169
    
    caused by: TaskFailedException
    Stacktrace:
     [1] wait
       @ ./task.jl:317 [inlined]
     [2] _combine_rows_with_first!(firstrow::DataFrameRow{DataFrame, DataFrames.SubIndex{DataFrames.Index, Vector{Int64}, Vector{Int64}}}, outcols::Tuple{Vector{Int64}, Vector{Int64}}, f::Function, gd::GroupedDataFrame{DataFrame}, incols::Nothing, colnames::Tuple{Symbol, Symbol}, firstmulticol::Val{true})
       @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/complextransforms.jl:265
     [3] _combine_with_first(first::DataFrameRow{DataFrame, DataFrames.SubIndex{DataFrames.Index, Vector{Int64}, Vector{Int64}}}, f::Function, gd::GroupedDataFrame{DataFrame}, incols::Nothing, firstmulticol::Val{true}, idx_agg::Vector{Int64})
       @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/complextransforms.jl:63
     [4] _combine_multicol(firstres::DataFrameRow{DataFrame, DataFrames.SubIndex{DataFrames.Index, Vector{Int64}, Vector{Int64}}}, fun::Function, gd::GroupedDataFrame{DataFrame}, incols::Nothing)
       @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/complextransforms.jl:20
     [5] _combine_process_callable(cs_i::Union{Function, Type}, optional_i::Bool, parentdf::DataFrame, gd::GroupedDataFrame{DataFrame}, seen_cols::Dict{Symbol, Tuple{Bool, Int64}}, trans_res::Vector{DataFrames.TransformationResult}, idx_agg::Base.RefValue{Union{Nothing, Vector{Int64}}})
       @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:268
     [6] macro expansion
       @ ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:558 [inlined]
     [7] (::DataFrames.var"#609#615"{GroupedDataFrame{DataFrame}, Bool, Bool, DataFrame, Dict{Symbol, Tuple{Bool, Int64}}, Vector{DataFrames.TransformationResult}, Base.RefValue{Union{Nothing, Vector{Int64}}}, Bool, var"#5#6"})()
       @ DataFrames ./threadingconstructs.jl:169
    
        nested task error: ArgumentError: return value must have the same column names for all groups (got (:x, :y) and [:y, :x])
        Stacktrace:
         [1] fill_row!(row::DataFrameRow{DataFrame, DataFrames.SubIndex{DataFrames.Index, Vector{Int64}, Vector{Int64}}}, outcols::Tuple{Vector{Int64}, Vector{Int64}}, i::Int64, colstart::Int64, colnames::Tuple{Symbol, Symbol})
           @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/complextransforms.jl:84
         [2] _combine_rows_with_first_task!(tid::Int64, rowstart::Int64, rowend::Int64, rownext::Int64, outcols::Tuple{Vector{Int64}, Vector{Int64}}, outcolsref::Base.RefValue{Tuple{Vararg{AbstractVector{T} where T, var"#s280"}} where var"#s280"}, type_widened::Vector{Bool}, widen_type_lock::ReentrantLock, f::var"#5#6", gd::GroupedDataFrame{DataFrame}, starts::Vector{Int64}, ends::Vector{Int64}, incols::Nothing, colnames::Tuple{Symbol, Symbol}, firstmulticol::Val{true})
           @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/complextransforms.jl:122
         [3] (::DataFrames.var"#664#665"{var"#5#6", GroupedDataFrame{DataFrame}, Nothing, Tuple{Symbol, Symbol}, Val{true}, Vector{Bool}, Base.RefValue{Tuple{Vararg{AbstractVector{T} where T, var"#s280"}} where var"#s280"}, ReentrantLock, Vector{Int64}, Vector{Int64}, UnitRange{Int64}, Int64})()
           @ DataFrames ./threadingconstructs.jl:169

KristofferC pushed a commit that referenced this issue Mar 23, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
KristofferC pushed a commit that referenced this issue Mar 25, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
(cherry picked from commit 62e0729)
KristofferC pushed a commit that referenced this issue Apr 20, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
(cherry picked from commit 62e0729)
KristofferC pushed a commit that referenced this issue May 23, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
(cherry picked from commit 62e0729)
KristofferC pushed a commit that referenced this issue May 23, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
(cherry picked from commit 62e0729)
KristofferC pushed a commit that referenced this issue Jul 4, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
(cherry picked from commit 62e0729)
KristofferC pushed a commit that referenced this issue Dec 21, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
(cherry picked from commit 62e0729)
staticfloat pushed a commit that referenced this issue Dec 23, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
(cherry picked from commit 62e0729)
@nrontsis
Copy link
Contributor

Have there been any developments to this? Thanks

vchuravy pushed a commit to JuliaLang/Distributed.jl that referenced this issue Oct 6, 2023
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes JuliaLang/julia#44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see JuliaLang/julia#44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes JuliaLang/julia#39291.
- JuliaLang/julia#42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with JuliaLang/julia#38931

Co-authored-by: Valentin Churavy <[email protected]>
(cherry picked from commit 3b57a49)
Keno pushed a commit that referenced this issue Jun 5, 2024
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error handling Handling of exceptions by Julia or the user multithreading Base.Threads and related functionality speculative Whether the change will be implemented is speculative
Projects
None yet
Development

No branches or pull requests

4 participants