Allow rethrowing exception from task with original backtrace #38931

nalimilan · 2020-12-17T17:45:32Z

When an exception happens in a task, a task TaskFailedException is thrown. But packages that accept a user-provided function and call it in a task (like DataFrames) generally don't want to throw this exception type, but rethrow the original exception that happened in user code instead. Otherwise, changing the implementation (e.g. adding multithreading support) would change the user-visible behavior and break the API.

Currently there doesn't seem to be a good way to do this. The package ExceptionUnwrapping.jl has even been created to work around that.

The best mitigation I could find is to use try... catch to catch the TaskFailedException, and then call throw on the task's exception. This throws the original exception type, which allows preserving the API. But the backtrace refers to the place where the exception was rethrown rather than to the original place where the error happened, and users have to look at the third nested exception to see the most useful backtrace.

Would there be a way to preserve the original backtrace? If not, wouldn't it be useful to allow this kind of pattern?

# User passes custom function f to the package
julia> f(x) = nonexistent(x)
f (generic function with 1 method)

# Default: TaskFailedException
julia> fetch(Threads.@spawn f(1))
ERROR: 
TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:317 [inlined]
 [2] fetch(t::Task)
   @ Base ./task.jl:332
 [3] top-level scope
   @ threadingconstructs.jl:179

    nested task error: UndefVarError: nonexistent not defined
    Stacktrace:
     [1] f(x::Int64)
       @ Main ./REPL[1]:1
     [2] (::var"#1#2")()
       @ Main ./threadingconstructs.jl:169

# Best workaround I could find
# Note that user function f only appears in the third backtrace
julia> try
           t = Threads.@spawn f(1)
           fetch(t)
       catch
           throw(t.exception)
       end
ERROR: UndefVarError: t not defined
Stacktrace:
 [1] top-level scope
   @ REPL[3]:5

caused by: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:317 [inlined]
 [2] fetch(t::Task)
   @ Base ./task.jl:332
 [3] top-level scope
   @ REPL[3]:3

    nested task error: UndefVarError: nonexistent not defined
    Stacktrace:
     [1] f(x::Int64)
       @ Main ./REPL[1]:1
     [2] (::var"#3#4")()
       @ Main ./threadingconstructs.jl:169

The text was updated successfully, but these errors were encountered:

bkamins · 2020-12-17T18:12:52Z

This is inconvenient indeed. My only question would be if:

 try
           t = Threads.@spawn f(1)
           fetch(t)
       catch
           throw(t.exception)
       end

is guaranteed to throw the TaskFailedException or we have to put an additional check for it in the catch clause.

JeffBezanson · 2020-12-17T18:47:00Z

I see the issue but I'm not sure we want libraries making piece-meal decisions about what exception information to hide. I would prefer something systematic, e.g. having @sync or @threads propagate the inner exception only.

bkamins · 2020-12-17T18:57:03Z

I would prefer something systematic, e.g. having @sync or @threads propagate the inner exception only.

This is what we would prefer also 😄.

nalimilan · 2020-12-17T20:25:57Z

Yes that would be even nicer for most cases. Though maybe being able to do that when you call fetch would also be useful -- not sure.

bkamins · 2021-03-27T15:20:05Z

bumping this issue. Just to show you a stack trace I got today:

julia> combine(d -> d.x == [1] ? d[1, [1, 2]] : d[1, [2, 1]], gdf)
ERROR: ArgumentError: return value must have the same column names for all groups (got (:x, :y) and [:y, :x])
Stacktrace:
 [1] _combine(gd::GroupedDataFrame{DataFrame}, cs_norm::Vector{Any}, optional_transform::Vector{Bool}, copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:573
 [2] _combine_prepare(gd::GroupedDataFrame{DataFrame}, cs::Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, AbstractVector{T} where T, Type, All, Between, Cols, InvertedIndex}; keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:66
 [3] #combine#621
   @ ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:674 [inlined]
 [4] #combine#619
   @ ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:663 [inlined]
 [5] combine(f::Function, gd::GroupedDataFrame{DataFrame})
   @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:660
 [6] top-level scope
   @ REPL[8]:1

caused by: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:317 [inlined]
 [2] _combine(gd::GroupedDataFrame{DataFrame}, cs_norm::Vector{Any}, optional_transform::Vector{Bool}, copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:569
 [3] _combine_prepare(gd::GroupedDataFrame{DataFrame}, cs::Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, AbstractVector{T} where T, Type, All, Between, Cols, InvertedIndex}; keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:66
 [4] #combine#621
   @ ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:674 [inlined]
 [5] #combine#619
   @ ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:663 [inlined]
 [6] combine(f::Function, gd::GroupedDataFrame{DataFrame})
   @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:660
 [7] top-level scope
   @ REPL[8]:1

    nested task error: ArgumentError: return value must have the same column names for all groups (got (:x, :y) and [:y, :x])
    Stacktrace:
     [1] _combine_rows_with_first!(firstrow::DataFrameRow{DataFrame, DataFrames.SubIndex{DataFrames.Index, Vector{Int64}, Vector{Int64}}}, outcols::Tuple{Vector{Int64}, Vector{Int64}}, f::Function, gd::GroupedDataFrame{DataFrame}, incols::Nothing, colnames::Tuple{Symbol, Symbol}, firstmulticol::Val{true})
       @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/complextransforms.jl:267
     [2] _combine_with_first(first::DataFrameRow{DataFrame, DataFrames.SubIndex{DataFrames.Index, Vector{Int64}, Vector{Int64}}}, f::Function, gd::GroupedDataFrame{DataFrame}, incols::Nothing, firstmulticol::Val{true}, idx_agg::Vector{Int64})
       @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/complextransforms.jl:63
     [3] _combine_multicol(firstres::DataFrameRow{DataFrame, DataFrames.SubIndex{DataFrames.Index, Vector{Int64}, Vector{Int64}}}, fun::Function, gd::GroupedDataFrame{DataFrame}, incols::Nothing)
       @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/complextransforms.jl:20
     [4] _combine_process_callable(cs_i::Union{Function, Type}, optional_i::Bool, parentdf::DataFrame, gd::GroupedDataFrame{DataFrame}, seen_cols::Dict{Symbol, Tuple{Bool, Int64}}, trans_res::Vector{DataFrames.TransformationResult}, idx_agg::Base.RefValue{Union{Nothing, Vector{Int64}}})
       @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:268
     [5] macro expansion
       @ ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:558 [inlined]
     [6] (::DataFrames.var"#609#615"{GroupedDataFrame{DataFrame}, Bool, Bool, DataFrame, Dict{Symbol, Tuple{Bool, Int64}}, Vector{DataFrames.TransformationResult}, Base.RefValue{Union{Nothing, Vector{Int64}}}, Bool, var"#5#6"})()
       @ DataFrames ./threadingconstructs.jl:169
    
    caused by: TaskFailedException
    Stacktrace:
     [1] wait
       @ ./task.jl:317 [inlined]
     [2] _combine_rows_with_first!(firstrow::DataFrameRow{DataFrame, DataFrames.SubIndex{DataFrames.Index, Vector{Int64}, Vector{Int64}}}, outcols::Tuple{Vector{Int64}, Vector{Int64}}, f::Function, gd::GroupedDataFrame{DataFrame}, incols::Nothing, colnames::Tuple{Symbol, Symbol}, firstmulticol::Val{true})
       @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/complextransforms.jl:265
     [3] _combine_with_first(first::DataFrameRow{DataFrame, DataFrames.SubIndex{DataFrames.Index, Vector{Int64}, Vector{Int64}}}, f::Function, gd::GroupedDataFrame{DataFrame}, incols::Nothing, firstmulticol::Val{true}, idx_agg::Vector{Int64})
       @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/complextransforms.jl:63
     [4] _combine_multicol(firstres::DataFrameRow{DataFrame, DataFrames.SubIndex{DataFrames.Index, Vector{Int64}, Vector{Int64}}}, fun::Function, gd::GroupedDataFrame{DataFrame}, incols::Nothing)
       @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/complextransforms.jl:20
     [5] _combine_process_callable(cs_i::Union{Function, Type}, optional_i::Bool, parentdf::DataFrame, gd::GroupedDataFrame{DataFrame}, seen_cols::Dict{Symbol, Tuple{Bool, Int64}}, trans_res::Vector{DataFrames.TransformationResult}, idx_agg::Base.RefValue{Union{Nothing, Vector{Int64}}})
       @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:268
     [6] macro expansion
       @ ~/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:558 [inlined]
     [7] (::DataFrames.var"#609#615"{GroupedDataFrame{DataFrame}, Bool, Bool, DataFrame, Dict{Symbol, Tuple{Bool, Int64}}, Vector{DataFrames.TransformationResult}, Base.RefValue{Union{Nothing, Vector{Int64}}}, Bool, var"#5#6"})()
       @ DataFrames ./threadingconstructs.jl:169
    
        nested task error: ArgumentError: return value must have the same column names for all groups (got (:x, :y) and [:y, :x])
        Stacktrace:
         [1] fill_row!(row::DataFrameRow{DataFrame, DataFrames.SubIndex{DataFrames.Index, Vector{Int64}, Vector{Int64}}}, outcols::Tuple{Vector{Int64}, Vector{Int64}}, i::Int64, colstart::Int64, colnames::Tuple{Symbol, Symbol})
           @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/complextransforms.jl:84
         [2] _combine_rows_with_first_task!(tid::Int64, rowstart::Int64, rowend::Int64, rownext::Int64, outcols::Tuple{Vector{Int64}, Vector{Int64}}, outcolsref::Base.RefValue{Tuple{Vararg{AbstractVector{T} where T, var"#s280"}} where var"#s280"}, type_widened::Vector{Bool}, widen_type_lock::ReentrantLock, f::var"#5#6", gd::GroupedDataFrame{DataFrame}, starts::Vector{Int64}, ends::Vector{Int64}, incols::Nothing, colnames::Tuple{Symbol, Symbol}, firstmulticol::Val{true})
           @ DataFrames ~/.julia/dev/DataFrames/src/groupeddataframe/complextransforms.jl:122
         [3] (::DataFrames.var"#664#665"{var"#5#6", GroupedDataFrame{DataFrame}, Nothing, Tuple{Symbol, Symbol}, Val{true}, Vector{Bool}, Base.RefValue{Tuple{Vararg{AbstractVector{T} where T, var"#s280"}} where var"#s280"}, ReentrantLock, Vector{Int64}, Vector{Int64}, UnitRange{Int64}, Int64})()
           @ DataFrames ./threadingconstructs.jl:169

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <[email protected]>

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <[email protected]> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <[email protected]> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <[email protected]> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <[email protected]> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <[email protected]> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <[email protected]> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <[email protected]> (cherry picked from commit 62e0729)

nrontsis · 2023-02-15T11:52:10Z

Have there been any developments to this? Thanks

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes JuliaLang/julia#44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see JuliaLang/julia#44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes JuliaLang/julia#39291. - JuliaLang/julia#42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with JuliaLang/julia#38931 Co-authored-by: Valentin Churavy <[email protected]> (cherry picked from commit 3b57a49)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <[email protected]>

nalimilan added speculative Whether the change will be implemented is speculative error handling Handling of exceptions by Julia or the user multithreading Base.Threads and related functionality labels Dec 17, 2020

nalimilan mentioned this issue Jun 23, 2021

should @sync unwrap TaskFailedExceptions when constructing CompositeException? #41301

Closed

exaexa mentioned this issue Mar 21, 2022

avoid using @sync_add on remotecalls #44671

Merged

nalimilan mentioned this issue Jan 6, 2023

improve error message in operation specification syntax JuliaData/DataFrames.jl#3263

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow rethrowing exception from task with original backtrace #38931

Allow rethrowing exception from task with original backtrace #38931

nalimilan commented Dec 17, 2020 •

edited

Loading

bkamins commented Dec 17, 2020

JeffBezanson commented Dec 17, 2020

bkamins commented Dec 17, 2020

nalimilan commented Dec 17, 2020

bkamins commented Mar 27, 2021

nrontsis commented Feb 15, 2023

Allow rethrowing exception from task with original backtrace #38931

Allow rethrowing exception from task with original backtrace #38931

Comments

nalimilan commented Dec 17, 2020 • edited Loading

bkamins commented Dec 17, 2020

JeffBezanson commented Dec 17, 2020

bkamins commented Dec 17, 2020

nalimilan commented Dec 17, 2020

bkamins commented Mar 27, 2021

nrontsis commented Feb 15, 2023

nalimilan commented Dec 17, 2020 •

edited

Loading