You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this issue, I will provide a brief overview of the details in this discourse topic..
First, the environment is a cluster with a head node and 18 compute nodes with Slurm. The entire issue I had for two days was that on the head node I had v1.0.1 and on the compute nodes it was still 0.6. This was essentially the problem. But I'll start from the beginning.
It all started when running addprocs(SlurmManager(1)), I got cryptic exception thrown. The worker julia binary that was launched on the compute node crashed with the following exception (written to stdout/stderr)
julia_worker:9009#172.16.x.x
MethodError(convert, (Tuple, :all_to_all), 0x0000000000005549)CapturedException(MethodError(convert, (Tuple, :all_to_all), 0x0000000000005549), Any[(setindex!(::Array{Tuple,1}, ::Symbol, ::Int64) at array.jl:583, 1), ((::Base.Distributed.##99#100{TCPSocket,TCPSocket,Bool})() at event.jl:73, 1)])
Process(1) - Unknown remote, closing connection.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
After two days of debugging and a million println statements to track the functions, I realized the problem was in the process_hdr function which is called from a try/catch block in message_handler_loop. The process_hdr function verifies that the version of the launched worker matches that of the master process.
if length(version) < HDR_VERSION_LEN
println("about to throw an error")
error("Version read failed. Connection closed by peer.")
end
If the versions fail, there is an error thrown with a meaningful error message. If I had seen this error it would've saved me quite a bit of time. Since this function was called in message_handler_loop, shouldn't it be the case that this error message is propagated to the catch part? I.e.
function message_handler_loop(r_stream::IO, w_stream::IO, incoming::Bool)
try
version = process_hdr(r_stream, incoming) ## ERROR IS THROWN HERE.
...
catch e
if wpid < 1
println(stderr, e, CapturedException(e, catch_backtrace()))
println(stderr, "Process($(myid())) - Unknown remote, closing connection.")
elseif !(wpid in map_del_wrkr)
...
end
...
end
end
The e here should be the error thrown above. However, it's some cryptic exception about convert function of trying to put a symbol into an array of tuples. I don't know where that error message is coming from. I am guessing the launched worker binary which is v0.6 is crashing when trying to communicate (likely due to breaking change from 0.6 to 1.0). But even if the worker binary is crashing, shouldn't it be the case that the main master process can print out the correct error? Since the master process couldn't connect within 60 seconds, it simply terminates the worker and prints Worker x terminated.
I can't really reproduce this anymore since as of this morning the sysadmin has upgraded all nodes to 1.0.3. However, I hope I've provided enough information for someone who knows their way around Distributed to provide some input.
Edit: Maybe #28878 fixes this issue, but unfortunately I won't be able to test it (it would require to someone put 0.6 back on the compute nodes which our administration won't allow).
Edit 2: A way to reproduce this is to deliberately throw an error from process_hdr without running the if statement. That I can do if one requires more information.
The text was updated successfully, but these errors were encountered:
In this issue, I will provide a brief overview of the details in this discourse topic..
First, the environment is a cluster with a head node and 18 compute nodes with Slurm. The entire issue I had for two days was that on the head node I had v1.0.1 and on the compute nodes it was still 0.6. This was essentially the problem. But I'll start from the beginning.
It all started when running
addprocs(SlurmManager(1))
, I got cryptic exception thrown. The worker julia binary that was launched on the compute node crashed with the following exception (written to stdout/stderr)After two days of debugging and a million
println
statements to track the functions, I realized the problem was in theprocess_hdr
function which is called from atry/catch
block inmessage_handler_loop
. Theprocess_hdr
function verifies that the version of the launched worker matches that of the master process.If the versions fail, there is an
error
thrown with a meaningful error message. If I had seen this error it would've saved me quite a bit of time. Since this function was called inmessage_handler_loop
, shouldn't it be the case that this error message is propagated to the catch part? I.e.In particular, in
message_handler_loop
The
e
here should be the error thrown above. However, it's some cryptic exception aboutconvert
function of trying toput
a symbol into an array of tuples. I don't know where that error message is coming from. I am guessing the launched worker binary which is v0.6 is crashing when trying to communicate (likely due to breaking change from 0.6 to 1.0). But even if the worker binary is crashing, shouldn't it be the case that the main master process can print out the correct error? Since the master process couldn't connect within 60 seconds, it simply terminates the worker and printsWorker x terminated.
I can't really reproduce this anymore since as of this morning the sysadmin has upgraded all nodes to 1.0.3. However, I hope I've provided enough information for someone who knows their way around
Distributed
to provide some input.Edit: Maybe #28878 fixes this issue, but unfortunately I won't be able to test it (it would require to someone put 0.6 back on the compute nodes which our administration won't allow).
Edit 2: A way to reproduce this is to deliberately throw an error from
process_hdr
without running theif
statement. That I can do if one requires more information.The text was updated successfully, but these errors were encountered: