Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithread issue(julia 1.71) #182

Closed
caseykneale opened this issue Mar 16, 2022 · 27 comments
Closed

Multithread issue(julia 1.71) #182

caseykneale opened this issue Mar 16, 2022 · 27 comments

Comments

@caseykneale
Copy link

Conditions: Ubuntu/MacOS, Julia 1.7.1/1.6.5
Related issues: #94 , #110

Note: the examples here are MRE's and are not what the troublesome code looks like. It seems like a multi-threaded julia application calling Azure/Downloads suffers sporadically from this issue.

The error looks like this(where the last 3 lines repeat seemingly forever):

     From worker 4:    ┌ Error: 2022-03-14T16:11:24.498 (txn_id: 1044876555423232961)
     From worker 4:    │ curl_multi_socket_action: 8
     From worker 4:    └ @ Downloads.Curl /build/source/usr/share/julia/stdlib/v1.7/Downloads/src/Curl/utils.jl:36

Can sometimes reproduce with: JULIA_NUM_THREADS=4 julia171

using Downloads

Threads.@threads for url in ["1","2","3","4","5","6","7","8"]
    open("/tmp/cool$url", "w") do io
        resp = Downloads.request("http://google.com"; output=io, method="GET")
    end
end

If this does not reproduce the curl_multi_socket_action bug rerun the Threads.@threads loop until it happens (typically takes 1-5 executions).

Also sporadically reproduces without the explicit I/O scope/pattern with the same frequency, ie:

using Downloads
Threads.@threads for url in ["1","2","3","4","5","6","7","8"]
    resp = Downloads.request("http://google.com"; output="/tmp/cool_$url", method="GET")
end

Maybe more/less interesting: without the output field(ie: Downloads.request("http://google.com"; method="GET")) there is another nondeterministic error that infrequently occurs..:

ERROR: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:334 [inlined]
 [2] threading_run(func::Function)
   @ Base.Threads ./threadingconstructs.jl:38
 [3] top-level scope
   @ ./threadingconstructs.jl:97

    nested task error: val already in a list
    Stacktrace:
      [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
        @ Base ./task.jl:777
      [2] wait()
        @ Base ./task.jl:837
      [3] wait(c::Base.GenericCondition{ReentrantLock})
        @ Base ./condition.jl:123
      [4] take_buffered(c::Channel{Any})
        @ Base ./channels.jl:389
      [5] take!
        @ ./channels.jl:383 [inlined]
      [6] sync_end(c::Channel{Any})
        @ Base.Experimental ./experimental.jl:64
      [7] macro expansion
        @ ./experimental.jl:102 [inlined]
      [8] (::Downloads.var"#9#18"{Base.DevNull, Base.DevNull, String, Vector{Pair{String, String}}, Float64, Nothing, Bool, Bool, String, Int64, Bool, Bool})(easy::Downloads.Curl.Easy)
        @ Downloads ~/julia-1.7.1/share/julia/stdlib/v1.7/Downloads/src/Downloads.jl:347
      [9] with_handle(f::Downloads.var"#9#18"{Base.DevNull, Base.DevNull, String, Vector{Pair{String, String}}, Float64, Nothing, Bool, Bool, String, Int64, Bool, Bool}, handle::Downloads.Curl.Easy)
        @ Downloads.Curl ~/julia-1.7.1/share/julia/stdlib/v1.7/Downloads/src/Curl/Curl.jl:64
     [10] #8
        @ ~/julia-1.7.1/share/julia/stdlib/v1.7/Downloads/src/Downloads.jl:311 [inlined]
     [11] arg_write(f::Downloads.var"#8#17"{Base.DevNull, String, Vector{Pair{String, String}}, Float64, Nothing, Bool, Bool, String, Int64, Bool, Bool}, arg::Base.DevNull)
        @ ArgTools ~/julia-1.7.1/share/julia/stdlib/v1.7/ArgTools/src/ArgTools.jl:112
     [12] #7
        @ ~/julia-1.7.1/share/julia/stdlib/v1.7/Downloads/src/Downloads.jl:310 [inlined]
     [13] arg_read
        @ ~/julia-1.7.1/share/julia/stdlib/v1.7/ArgTools/src/ArgTools.jl:61 [inlined]
     [14] request(url::String; input::Nothing, output::Nothing, method::String, headers::Vector{Pair{String, String}}, timeout::Float64, progress::Nothing, verbose::Bool, throw::Bool, downloader::Nothing)
        @ Downloads ~/julia-1.7.1/share/julia/stdlib/v1.7/Downloads/src/Downloads.jl:309
     [15] macro expansion
        @ ./REPL[2]:2 [inlined]
     [16] (::var"#124#threadsfor_fun#9"{Vector{String}})(onethread::Bool)
        @ Main ./threadingconstructs.jl:85
     [17] (::var"#124#threadsfor_fun#9"{Vector{String}})()
        @ Main ./threadingconstructs.jl:52

Any help or guidance appreciated

@StefanKarpinski
Copy link
Member

As noted in the issues you linked, Downloads was not threadsafe in Julia 1.7; the Julia 1.8 release is the first one that is threadsafe. As also noted in those issues, you don't need to use threads to do concurrent downloads, you can just uses tasks.

@caseykneale
Copy link
Author

Thanks for the quick reply.

FWIW: I ran the MRE on Julia 1.8beta. The error seems to be gone but - unfortunately if you run the loop maybe 10 times or so (YMMV) you'll get an infinite loop/hanging execution. So the failure mode is different, but it does still seem to fail, unfortunately.

One thing I would like to mention is: this problem appears to happen in programs which are not just threaded loops downloading files. For example, someone(our situation) could have a multithreaded client (say Azure.jl) where downloading files is a only a small "part" of what is happening. So the Task suggestion doesn't work as well there.

@StefanKarpinski
Copy link
Member

I cannot reproduce that on Julia 1.8-beta1 or Julia master on macOS. What operating system are you trying this on and what is the exact MWE?

@caseykneale
Copy link
Author

ubuntu@ip-172-31-11-132:~/dev/raicode$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.4 LTS
Release:        20.04
Codename:       focal
Version 1.8.0-beta1 (2022-02-23)
JULIA_NUM_THREADS=4 julia18b
              _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.8.0-beta1 (2022-02-23)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using Downloads

julia> Threads.@threads for url in ["1","2","3","4","5","6","7","8"]
           resp = Downloads.request("http://google.com"; output="/tmp/cool_$url", method="GET")
       end

julia> Threads.@threads for url in ["1","2","3","4","5","6","7","8"]
           resp = Downloads.request("http://google.com"; output="/tmp/cool_$url", method="GET")
       end

julia> Threads.@threads for url in ["1","2","3","4","5","6","7","8"]
           resp = Downloads.request("http://google.com"; output="/tmp/cool_$url", method="GET")
       end

julia> Threads.@threads for url in ["1","2","3","4","5","6","7","8"]
           resp = Downloads.request("http://google.com"; output="/tmp/cool_$url", method="GET")
       end

julia> Threads.@threads for url in ["1","2","3","4","5","6","7","8"]
           resp = Downloads.request("http://google.com"; output="/tmp/cool_$url", method="GET")
       end
this hangs forever

@StefanKarpinski
Copy link
Member

I just tried it on the exact same LSB without issues. Not sure what to do given the inability to repro.

@giordano
Copy link
Contributor

giordano commented Mar 16, 2022

For what is worth, that works for me with Julia master on Linux:

julia> @time Threads.@threads for url in 1:400
           resp = Downloads.request("http://google.com"; output="/tmp/cool_$url", method="GET")
       end
 40.885960 seconds (229.03 k allocations: 13.134 MiB, 0.13% compilation time)

julia> Threads.nthreads()
4

@ericphanson
Copy link
Contributor

the Julia 1.8 release is the first one that is threadsafe

1.7.3 should be too, since we backported the fixes (#175)

@caseykneale
Copy link
Author

For what is worth, that works for me with Julia master on Linux:

julia> @time Threads.@threads for url in 1:400
           resp = Downloads.request("http://google.com"; output="/tmp/cool_$url", method="GET")
       end
 40.885960 seconds (229.03 k allocations: 13.134 MiB, 0.13% compilation time)

julia> Threads.nthreads()
4

it works the first randn times that it's run, but if I run it consecutively in the REPL (not hammering it either - ie: waiting for cursor to come back), I get intermittent infinite loop conditions. This is similar to the curl bug - in that I could run the code some number of times before seg faulting. REPL is still frozen from when I made the last post.

I could make a little video but its pretty boring to watch. Maybe it's nothing or linked only to the REPL, but I dunno, it seems like undesirable behavior.

@StefanKarpinski
Copy link
Member

I did the same thing, hitting up in the REPL twenty or so times, running the threaded loop repeatedly. No hangs. I think to debug this further we need some deeper debugging on a system that exhibits the behavior.

@caseykneale
Copy link
Author

Could run an RR trace on it if I could reproduce it in a script. I'll let you know how that goes - maybe tomorrow.

@caseykneale
Copy link
Author

caseykneale commented Mar 17, 2022

I was able to reproduce it in a script, but it is intermittant(IE: may require 3-10 attempts)

Script:

using Downloads

Threads.@threads for url in ["1","2","3","4","5","6","7","8"]
           resp = Downloads.request("http://google.com"; output="/tmp/cool_$url", method="GET")
end

Threads.@threads for url in ["1","2","3","4","5","6","7","8"]
           resp = Downloads.request("http://google.com"; output="/tmp/cool_$url", method="GET")
end

Threads.@threads for url in ["1","2","3","4","5","6","7","8"]
           resp = Downloads.request("http://google.com"; output="/tmp/cool_$url", method="GET")
end

Threads.@threads for url in ["1","2","3","4","5","6","7","8"]
           resp = Downloads.request("http://google.com"; output="/tmp/cool_$url", method="GET")
end

Threads.@threads for url in ["1","2","3","4","5","6","7","8"]
           resp = Downloads.request("http://google.com"; output="/tmp/cool_$url", method="GET")
end
println("did not hang")

Script is called via: JULIA_NUM_THREADS=4 julia18b repro.jl

Please try running this a few times in succession (ideally on linux/ubuntu), if that fails I will try to submit an RR.

Attempts from the console look like this:

ubuntu@ip-:~/dev/mre_downloads$ JULIA_NUM_THREADS=4 julia18b repro.jl 
did not hang
ubuntu@ip-:~/dev/mre_downloads$ JULIA_NUM_THREADS=4 julia18b repro.jl 
did not hang
ubuntu@ip-:~/dev/mre_downloads$ JULIA_NUM_THREADS=4 julia18b repro.jl 
did not hang
ubuntu@ip-:~/dev/mre_downloads$ JULIA_NUM_THREADS=4 julia18b repro.jl 
did not hang
ubuntu@ip-:~/dev/mre_downloads$ JULIA_NUM_THREADS=4 julia18b repro.jl 

(where the last execution is stuck)

@ericphanson
Copy link
Contributor

I get a hang too:

julia> using Downloads

julia> for i = 1:100
       Threads.@threads for url in ["1","2","3","4","5","6","7","8"]
                  resp = Downloads.request("http://google.com"; output="/tmp/cool_$url", method="GET")
       end
       println("$i did not hang")
       end
1 did not hang
2 did not hang
3 did not hang
4 did not hang
5 did not hang
6 did not hang
7 did not hang
8 did not hang
9 did not hang
10 did not hang
11 did not hang
12 did not hang
13 did not hang
14 did not hang
15 did not hang
16 did not hang
17 did not hang
18 did not hang
19 did not hang
20 did not hang
21 did not hang
22 did not hang
23 did not hang
24 did not hang
25 did not hang
26 did not hang
27 did not hang
28 did not hang
29 did not hang
30 did not hang
31 did not hang
32 did not hang
33 did not hang
34 did not hang
35 did not hang
36 did not hang
37 did not hang
38 did not hang
39 did not hang
40 did not hang
41 did not hang
42 did not hang
43 did not hang
44 did not hang

(Then it hung on 45). When I ctrl-c out:

^C^C^C^C^C^CWARNING: Force throwing a SIGINT
^C^CERROR: InterruptException:
Stacktrace:
 [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
   @ Base ./task.jl:834
 [2] wait()
   @ Base ./task.jl:894
 [3] wait(c::Base.GenericCondition{Base.Threads.SpinLock})
   @ Base ./condition.jl:124
 [4] _wait(t::Task)
   @ Base ./task.jl:304
 [5] wait
   @ ./task.jl:343 [inlined]
 [6] threading_run(fun::var"#307#threadsfor_fun#62"{var"#307#threadsfor_fun#61#63"{Vector{String}}}, static::Bool)
   @ Base.Threads ./threadingconstructs.jl:38
 [7] macro expansion
   @ ./threadingconstructs.jl:89 [inlined]
 [8] top-level scope
   @ ./REPL[4]:2

On

julia> versioninfo()
Julia Version 1.8.0-beta1
Commit 7b711ce699 (2022-02-23 15:09 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 4 × Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake-avx512)
  Threads: 4 on 4 virtual cores

shell> lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.3 LTS
Release:        20.04
Codename:       focal

@caseykneale
Copy link
Author

Nice change to the MRE :)

I wouldn't be surprised if this issue is different from the original one presented at the start of this (might be google thinking we're doing a DoS), but it does seem to be unintended behavior that could break something in a hard way to diagnose while in production.

@ericphanson
Copy link
Contributor

ericphanson commented Mar 17, 2022

Hm, well if I change it to

for i = 1:100
       Threads.@threads for url in ["1","2","3","4","5","6","7","8"]
                  resp = Downloads.request("https://httpbingo.julialang.org"; output="/tmp/cool_$url", method="GET")
       end
       sleep(1)
       println("$i did not hang")
end

I get to 46 and then it hangs, same stacktrace.

Would hope httpbingo.julialang.org doesn't mind lots of requests.

@caseykneale
Copy link
Author

Likely not a DoS issue then. Interesting that yours occurs ~40 some odd iterations and on my machine it's happening much sooner.

@StefanKarpinski
Copy link
Member

I can only reproduce the google version of this, which seems like it might be them throttling us.

@caseykneale
Copy link
Author

When I run eric's httpbingo example I stop at roughly the same number of iterations as the google version
ex:

ubuntu@ip-:~/dev/mre_downloads$ JULIA_NUM_THREADS=4 julia18b repro.jl 
1 did not hang
2 did not hang
3 did not hang

(but its stuck on iteration #4)

So its reproducible for me.

@ericphanson
Copy link
Contributor

On my m1 laptop on rosetta, with

❯ julia-beta
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.8.0-beta1 (2022-02-23)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using Downloads

julia> versioninfo()
Julia Version 1.8.0-beta1
Commit 7b711ce699 (2022-02-23 15:09 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, westmere)
  Threads: 4 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = 4
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager

the httpbingo.julialang.org one did not hang, with two tries (200 total runs).

@caseykneale
Copy link
Author

caseykneale commented Mar 18, 2022

On my m1 laptop on rosetta, with

...
the httpbingo.julialang.org one did not hang, with two tries (200 total runs).

I ran both trials on a MacOS rosetta and also did not reproduce the bug(neither on Google nor httpbingo). So - I would suggest this is not network throttling, unless common VM IP addresses are flagged on both httpbingo &b google. Even if it is a network thing - I am not sure if an infinite loop is the correct outcome?

So it seems like its happening on sporadically on Linux for sure. Don't have access to a windows machine, but that would be interesting too.

@StefanKarpinski
Copy link
Member

#188 is interesting: can folks who are able to reproduce this indicate whether they have ~/.julia/conda or not, and also try running:

using Libdl
filter!(contains("curl"), dllist())

@ericphanson
Copy link
Contributor

I've got a ~/.julia/conda, and

julia> using Libdl

julia> filter!(contains("curl"), dllist())
1-element Vector{String}:
 "/home/ubuntu/.asdf/installs/julia/1.8.0-beta3/bin/../lib/julia/libcurl.so"

Then I did mv ~/.julia/conda ~/.julia/conda-backup, and restarted my session, and got the same Libdl output.

However... now I cannot reproduce the original issue, even with the conda folder. (At first that was because I forgot to pass -t 4, but then even with it, on the same 1.8.0-beta1 I was using originally, it doesn't hang).

@caseykneale
Copy link
Author

caseykneale commented Apr 13, 2022

I do have a ~/.julia/conda/

julia> using Libdl

julia> filter!(contains("curl"), dllist())
1-element Vector{String}:
 "/home/ubuntu/julia-1.8.0-beta1/bin/../lib/julia/libcurl.so"

Was able to reproduce the issue.

I did the mv ~/.julia/conda ~/.julia/conda-backup and got the same result as before.

Was able to reproduce the issue again.

Weird that you can't reproduce it anymore @ericphanson
Making an RR now...edit - soon

@caseykneale
Copy link
Author

caseykneale commented Apr 13, 2022

Just tested on beta3. Interestingly, while running an rr on the httpbingo site I could not reproduce it via the REPL. Leading us back to throttling, however, I was able to get an RR using the google example by executing a script instead. @StefanKarpinski where do we submit public RR's too? I can also send it over RAI slack if that's preferable.

ubuntu@ip-172-31-84-172:~$ rr record julia-1.8.0-beta3/bin/julia script.jl -t 4 
rr: Saving execution to trace directory `/home/ubuntu/.local/share/rr/julia-15'.
1 did not hang
2 did not hang
3 did not hang
4 did not hang
5 did not hang
6 did not hang
7 did not hang
8 did not hang
9 did not hang
10 did not hang
11 did not hang
12 did not hang
13 did not hang
14 did not hang
15 did not hang
16 did not hang
17 did not hang
18 did not hang
19 did not hang
20 did not hang
21 did not hang
22 did not hang
23 did not hang
24 did not hang
25 did not hang
26 did not hang
27 did not hang

signal (15): Terminated
in expression starting at /home/ubuntu/script.jl:3
unknown function (ip: 0x70000002)
unknown function (ip: 0x7f00b2bff1a3)
traced_raw_syscall at /home/roc/rr/rr/src/preload/syscallbuf.c:278
sys_epoll_wait at /home/roc/rr/rr/src/preload/syscallbuf.c:2278 [inlined]
syscall_hook_internal at /home/roc/rr/rr/src/preload/syscallbuf.c:3287 [inlined]
syscall_hook at /home/roc/rr/rr/src/preload/syscallbuf.c:3454
unknown function (ip: 0x7f00b2bfa0af)
unknown function (ip: 0x7f00b2bfa10e)
unknown function (ip: 0x7f00b2bfa12a)
epoll_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
uv__io_poll at /workspace/srcdir/libuv/src/unix/epoll.c:240
uv_run at /workspace/srcdir/libuv/src/unix/core.c:383
ijl_task_get_next at /buildworker/worker/package_linux64/build/src/partr.c:556
poptask at ./task.jl:919
wait at ./task.jl:928
task_done_hook at ./task.jl:632
jfptr_task_done_hook_57395.clone_1 at /home/ubuntu/julia-1.8.0-beta3/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2358 [inlined]
ijl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2540
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1831 [inlined]
jl_finish_task at /buildworker/worker/package_linux64/build/src/task.c:254
start_task at /buildworker/worker/package_linux64/build/src/task.c:942
unknown function (ip: (nil))
Allocations: 2901 (Pool: 2890; Big: 11); GC: 0

@StefanKarpinski
Copy link
Member

Having an rr trace is excellent! RAI Slack would be easiest.

@StefanKarpinski
Copy link
Member

According to @vtjnash, this may have been solved by JuliaLang/julia#44956. If this cannot be reproduced on Julia master (or the next version of Julia 1.8 when it comes out), then we can consider this fixed.

@caseykneale
Copy link
Author

Just compiled and ran master on the same machine/configuration experiencing the issue. The issue appears resolved in that release.

@StefanKarpinski
Copy link
Member

Great. I'm gonna close this then. Thanks for the debugging work!

arnaudh added a commit to arnaudh/AWS.jl that referenced this issue Dec 13, 2022
The mentioned issues are closed:
- JuliaLang/Downloads.jl#110 (closed in Nov 2021
- JuliaLang/Downloads.jl#182 (closed in Apr 2022)
mattBrzezinski added a commit to JuliaCloud/AWS.jl that referenced this issue Mar 21, 2023
* Remove mention of Downloads.jl having threading issues

The mentioned issues are closed:
- JuliaLang/Downloads.jl#110 (closed in Nov 2021
- JuliaLang/Downloads.jl#182 (closed in Apr 2022)

* Update src/utilities/downloads_backend.jl

Co-authored-by: arnaudh <[email protected]>

---------

Co-authored-by: mattBrzezinski <[email protected]>
Co-authored-by: mattBrzezinski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants