Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

download() multi-thread error #110

Closed
Moelf opened this issue Apr 11, 2021 · 16 comments
Closed

download() multi-thread error #110

Moelf opened this issue Apr 11, 2021 · 16 comments

Comments

@Moelf
Copy link

Moelf commented Apr 11, 2021

on:

Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
julia> imgs = [
       "https://images.unsplash.com/photo-1554266183-2696fdafe3ff?ixlib=rb-1.2.1&auto=format&fit=crop&w=564&q=80"
       "https://images.unsplash.com/photo-1527026950045-9e066846bae4?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1419&q=80"
       "https://images.unsplash.com/photo-1549287748-f095932c9f81?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=606&q=80"
       "https://images.unsplash.com/photo-1437448317784-3a480be9571e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=634&q=80"
       ]

julia> imgs = repeat(imgs, 4)

julia> Threads.nthreads()
4

julia> Threads.@threads for url in imgs
           download(url)
       end

results in

ERROR: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:317 [inlined]
 [2] threading_run(func::Function)
   @ Base.Threads ./threadingconstructs.jl:34
 [3] top-level scope
   @ ./threadingconstructs.jl:93

    nested task error: TaskFailedException
    Stacktrace:
      [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))

also seen in another occation:

Error: curl_multi_socket_action: 8
@fredrikekre fredrikekre transferred this issue from JuliaLang/julia Apr 12, 2021
@StefanKarpinski
Copy link
Member

Yeah, something definitely isn't threadsafe. Will look into it.

@fredrikekre
Copy link
Member

Here is the nested task error which might shed some more light:

        nested task error: val already in a list
        Stacktrace:
          [1] error(s::String)
            @ Base ./error.jl:33
          [2] push!
            @ ./linked_list.jl:53 [inlined]
          [3] wait(c::Base.GenericCondition{Base.Threads.SpinLock})
            @ Base ./condition.jl:103
          [4] lock(rl::ReentrantLock)
            @ Base ./lock.jl:100
          [5] lock(f::Downloads.Curl.var"#39#41"{Int32, Downloads.Curl.Multi}, l::ReentrantLock)
            @ Base ./lock.jl:185
          [6] event_callback(uv_poll_p::Ptr{Nothing}, status::Int32, events::Int32)
            @ Downloads.Curl ~/julia16/usr/share/julia/stdlib/v1.6/Downloads/src/Curl/Multi.jl:120

@anandijain
Copy link

anandijain commented Apr 13, 2021

I have little to add, but that I am also getting this. Is the temporary solution to just never call @threads ... download(...) end?

https://github.com/anandijain/SBMLBioModelsRepository.jl/runs/2335985628?check_suite_focus=true#step:5:2330 Is a link to a full trace/crash happening on CI.

Edit: I think threading run(curl...) is still okay

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Apr 13, 2021

This should be fixed but there's also no need to do multithreaded downloading — you can just use tasks:

@sync for url in imgs
    @async @show download(url)
end

In general, unless you're doing concurrent CPU-bound compute, tasks are preferable to threads.

@c42f
Copy link
Member

c42f commented Aug 20, 2021

Reading https://curl.se/libcurl/c/threadsafe.html it seems there could be several things we're not doing:

  • We should call curl_global_init explicitly as part of libcurl __init__
  • We probably need to set locking and unlocking callbacks using curl_share_setopt in order that libcurl itself can protect its internal connection pool and DNS cache. See https://curl.se/libcurl/c/curl_share_setopt.html
  • The page says "You must never share the same handle in multiple threads", but we share the multi handle. It's behind our own lock so that might be ok, but multiple easy handles are created from that multi handle — it's unclear from the docs whether they will then point to shared multi handle data structures and implicitly need locking as well.

@c42f
Copy link
Member

c42f commented Aug 20, 2021

To add to this, libcurl has a known bug where it can't share the connection pool between multiple threads. See

The following PR removed the libcurl example which showed how to use shared connection pools (due to this bug) but it's interesting nonetheless: curl/curl#6795

Unfortunately I think this libcurl bug combines quite badly with the task migration introduced in julia-1.7. A set of tasks which share a connection pool should also be running on the same OS thread, but the Julia runtime can now migrate those tasks at any time. We can't disable sharing of the connection pool either - that would be a performance disaster in the normal @async case.

@StefanKarpinski
Copy link
Member

We could potentially do all libcurl calls from a single thread, but I'm not sure that we have a mechanism to do that anymore since all tasks can be migrated now? The API calls could just dispatch to a single thread that does the actual libcurl stuff. Since libcurl doesn't actually benefit from multithreading at all, it would be fine.

@c42f
Copy link
Member

c42f commented Aug 27, 2021

We could potentially do all libcurl calls from a single thread, but I'm not sure that we have a mechanism to do that anymore since all tasks can be migrated now?

At first I thought this was true, but there's a way around it:

  • @async still pins tasks to the thread of the parent task. (Side note — unpinned tasks become pinned in julia-1.7 if they launch children with @async to preserve coscheduling on the same OS thread, see [Tasks] Mark parent as sticky if we use @async julia#41334)
  • So we just need to launch all curl-calling tasks from a particular well-defined parent task using @async. This could be a worker pool or we could just launch a task per download.
  • User tasks can then run on any thread, but we can insulate libcurl from that with a Channel.

Possible caveats:

  • Limiting to a particular OS thread might make I/O resource starvation worse if a compute-bound task ends up on that thread.
  • How do curl callbacks relate to libuv in terms of which thread do they end up scheduled on?

libcurl doesn't actually benefit from multithreading at all

This is interesting. There must be some crossover point of CPU memory bandwidth vs network speed. But it seems CPUs should have of order ~10GiB/s memory bandwidth for a single core which naively needs 100Gigabit network to overwhelm a core. AWS networking tops out around there, except for special exceptions like the P4.

@ericphanson
Copy link
Contributor

ericphanson commented Aug 27, 2021

For what it's worth, @kolia and I have both been using Downloads.jl with AWS.jl to asyncmap (with the default 100 tasks) pull down a bunch of bytes from s3, and getting single-threaded throughput of only ~0.15 Gbps (in network between s3 and an ec2 instance in the same region), and seeing 100% CPU utilization on one core (with htop). By throwing a pmap on top with 16 cores, Kolia was able to hit 2.4 Gbps, which is at least the right order of magnitude (depending on the instance you should be able to get at least 5-10 Gbps, and way more for the super network optimized ones). So naively at least it seems like something where there is a CPU bottleneck and threading could really help.

@c42f
Copy link
Member

c42f commented Aug 30, 2021

Excellent, I've been meaning to do some performance tests of this kind of thing.

What was the typical file size for the data you're requesting, and what kind of bandwidth can you hit if you use a standalone downloader like rclone or s5cmd?

@ericphanson
Copy link
Contributor

I think Kolia was working with larger files, but I had ~380k 20 kB files. I put 100k of them in a prefix test_cache_path to make it easier to use s5cmd.

I get (using Downloads.jl)

using AWSS3, AWS
AWS.DEFAULT_BACKEND[] = AWS.DownloadsBackend()
paths = readdir(test_cache_path)
all_bytes, t = @timed asyncmap(read, paths)
throughput = (sum(length, all_bytes)*8)/(t*1e9)
@info "Pulled down $(Base.format_bytes(sum(length, all_bytes))) bytes in $(round(t; digits=1)) seconds for a throughput of $(round(throughput; sigdigits=3)) Gbps (gigabits per second)"

using (bits/ns == gigabits/s), I get

[ Info: Pulled down 1.837 GiB bytes in 145.0 seconds for a throughput of 0.109 Gbps (gigabits per second)

I couldn't figure out a good way to read them in bulk into memory with s5cmd (seems like cat doesn't support wildcards, so I guess the right way is to write out a file with my 100k paths and use a run?) so I tried just writing them to disk. With 100k such files in a prefix test_cache_path, just doing

_, s5cmd_time = @timed read(`s5cmd cp $(test_cache_path)\* /tmp/s5cmd`)
throughput = (sum(length, all_bytes)*8)/(s5cmd_time*1e9)
@info "Pulled down $(Base.format_bytes(sum(length, all_bytes))) bytes in $(round(s5cmd_time; digits=1)) seconds for a throughput of $(round(throughput; sigdigits=3)) Gbps (gigabits per second)"

gives

[ Info: Pulled down 1.837 GiB bytes in 97.9 seconds for a throughput of 0.161 Gbps (gigabits per second)

So faster, but also not good. But I wonder how much is the cost of the disk IO -- since we skip that in the Julia case we should be able to be pretty fast. (That's also my actual use-case-- pulling cached data directly into ram, skipping disk IO).

@kolia
Copy link

kolia commented Aug 30, 2021

I was getting 2.4Gps downloading 23k files of size ~2 - 4Mb each, using pmap from a single ec2 instance.

@c42f
Copy link
Member

c42f commented Aug 31, 2021

I couldn't figure out a good way to read them in bulk into memory with s5cmd

You could use a ramdisk (mount type tmpfs) to avoid any disk overhead. /dev/shm is usually mounted as tmpfs so you could create the files in there.

So faster, but also not good.

But if Downloads.jl can come close to the performance of s5cmd (or whatever else is regarded to be the fastest concurrent downloader) then we should be in reasonable shape.

@mkschulze
Copy link

Hello, just leaving a note here, that we would also like to see threadsafety with libcurl/downloads.jl. We use it extensively within our TypeDBClient.jl (gRPCClient.jl as transaction layer) and it would be great to have it being able to use multithreading. Shall we post an example of our usecase and where we would need it?

@StefanKarpinski
Copy link
Member

I haven't tested beyond the original example here, but #151 seems to have fixed this. Are others able to test Downloads master in multithreaded use cases and confirm that it works now? If it does, then I'll tag a new release since thread safety seems like a significant feature.

@StefanKarpinski
Copy link
Member

After some back and forth, the version on master here as well as Julia master should be threadsafe. Closing since that's the case and we can open specific issues about more fine-grained threading problems as they're encountered.

arnaudh added a commit to arnaudh/AWS.jl that referenced this issue Dec 13, 2022
The mentioned issues are closed:
- JuliaLang/Downloads.jl#110 (closed in Nov 2021
- JuliaLang/Downloads.jl#182 (closed in Apr 2022)
mattBrzezinski added a commit to JuliaCloud/AWS.jl that referenced this issue Mar 21, 2023
* Remove mention of Downloads.jl having threading issues

The mentioned issues are closed:
- JuliaLang/Downloads.jl#110 (closed in Nov 2021
- JuliaLang/Downloads.jl#182 (closed in Apr 2022)

* Update src/utilities/downloads_backend.jl

Co-authored-by: arnaudh <[email protected]>

---------

Co-authored-by: mattBrzezinski <[email protected]>
Co-authored-by: mattBrzezinski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants