-
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
download()
multi-thread error
#110
Comments
Yeah, something definitely isn't threadsafe. Will look into it. |
Here is the nested task error which might shed some more light:
|
I have little to add, but that I am also getting this. Is the temporary solution to just never call https://github.com/anandijain/SBMLBioModelsRepository.jl/runs/2335985628?check_suite_focus=true#step:5:2330 Is a link to a full trace/crash happening on CI. Edit: I think threading run( |
This should be fixed but there's also no need to do multithreaded downloading — you can just use tasks: @sync for url in imgs
@async @show download(url)
end In general, unless you're doing concurrent CPU-bound compute, tasks are preferable to threads. |
Reading https://curl.se/libcurl/c/threadsafe.html it seems there could be several things we're not doing:
|
To add to this, libcurl has a known bug where it can't share the connection pool between multiple threads. See
The following PR removed the libcurl example which showed how to use shared connection pools (due to this bug) but it's interesting nonetheless: curl/curl#6795 Unfortunately I think this libcurl bug combines quite badly with the task migration introduced in julia-1.7. A set of tasks which share a connection pool should also be running on the same OS thread, but the Julia runtime can now migrate those tasks at any time. We can't disable sharing of the connection pool either - that would be a performance disaster in the normal |
We could potentially do all libcurl calls from a single thread, but I'm not sure that we have a mechanism to do that anymore since all tasks can be migrated now? The API calls could just dispatch to a single thread that does the actual libcurl stuff. Since libcurl doesn't actually benefit from multithreading at all, it would be fine. |
At first I thought this was true, but there's a way around it:
Possible caveats:
This is interesting. There must be some crossover point of CPU memory bandwidth vs network speed. But it seems CPUs should have of order ~10GiB/s memory bandwidth for a single core which naively needs 100Gigabit network to overwhelm a core. AWS networking tops out around there, except for special exceptions like the P4. |
For what it's worth, @kolia and I have both been using Downloads.jl with AWS.jl to |
Excellent, I've been meaning to do some performance tests of this kind of thing. What was the typical file size for the data you're requesting, and what kind of bandwidth can you hit if you use a standalone downloader like |
I think Kolia was working with larger files, but I had ~380k 20 kB files. I put 100k of them in a prefix I get (using Downloads.jl) using AWSS3, AWS
AWS.DEFAULT_BACKEND[] = AWS.DownloadsBackend()
paths = readdir(test_cache_path)
all_bytes, t = @timed asyncmap(read, paths)
throughput = (sum(length, all_bytes)*8)/(t*1e9)
@info "Pulled down $(Base.format_bytes(sum(length, all_bytes))) bytes in $(round(t; digits=1)) seconds for a throughput of $(round(throughput; sigdigits=3)) Gbps (gigabits per second)" using (bits/ns == gigabits/s), I get [ Info: Pulled down 1.837 GiB bytes in 145.0 seconds for a throughput of 0.109 Gbps (gigabits per second) I couldn't figure out a good way to read them in bulk into memory with _, s5cmd_time = @timed read(`s5cmd cp $(test_cache_path)\* /tmp/s5cmd`)
throughput = (sum(length, all_bytes)*8)/(s5cmd_time*1e9)
@info "Pulled down $(Base.format_bytes(sum(length, all_bytes))) bytes in $(round(s5cmd_time; digits=1)) seconds for a throughput of $(round(throughput; sigdigits=3)) Gbps (gigabits per second)" gives [ Info: Pulled down 1.837 GiB bytes in 97.9 seconds for a throughput of 0.161 Gbps (gigabits per second) So faster, but also not good. But I wonder how much is the cost of the disk IO -- since we skip that in the Julia case we should be able to be pretty fast. (That's also my actual use-case-- pulling cached data directly into ram, skipping disk IO). |
I was getting 2.4Gps downloading 23k files of size ~2 - 4Mb each, using |
You could use a ramdisk (mount type
But if Downloads.jl can come close to the performance of |
Hello, just leaving a note here, that we would also like to see threadsafety with libcurl/downloads.jl. We use it extensively within our TypeDBClient.jl (gRPCClient.jl as transaction layer) and it would be great to have it being able to use multithreading. Shall we post an example of our usecase and where we would need it? |
I haven't tested beyond the original example here, but #151 seems to have fixed this. Are others able to test Downloads master in multithreaded use cases and confirm that it works now? If it does, then I'll tag a new release since thread safety seems like a significant feature. |
After some back and forth, the version on master here as well as Julia master should be threadsafe. Closing since that's the case and we can open specific issues about more fine-grained threading problems as they're encountered. |
The mentioned issues are closed: - JuliaLang/Downloads.jl#110 (closed in Nov 2021 - JuliaLang/Downloads.jl#182 (closed in Apr 2022)
* Remove mention of Downloads.jl having threading issues The mentioned issues are closed: - JuliaLang/Downloads.jl#110 (closed in Nov 2021 - JuliaLang/Downloads.jl#182 (closed in Apr 2022) * Update src/utilities/downloads_backend.jl Co-authored-by: arnaudh <[email protected]> --------- Co-authored-by: mattBrzezinski <[email protected]> Co-authored-by: mattBrzezinski <[email protected]>
on:
results in
also seen in another occation:
The text was updated successfully, but these errors were encountered: