Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify lock(DOWNLOAD_LOCK) #136

Merged
merged 4 commits into from
May 17, 2022
Merged

Conversation

tkf
Copy link
Member

@tkf tkf commented Aug 7, 2021

This snippet was brought up in https://julialang.zulipchat.com/#narrow/stream/236830-concurrency/topic/Concurrency.20question.20in.20Downloads.2Ejl.

I only looked at the code mostly locally so I may be missing something, but, the while loop seems to be unnecessary since, in particular, no other part of the code sets DOWNLOADER[] (aside from the tests). Also, I couldn't find the reason why calling yield while holding a lock (it could just increase the number of tasks in the waiter list of the lock). So, I suggest calling it ouside.

(Aside: This code style as-is decreases inferrability of the downloader variable. But I chose this style to match with other parts of the code, to emphasize the clarify.)

@codecov
Copy link

codecov bot commented Aug 7, 2021

Codecov Report

Merging #136 (3df806b) into master (20e0990) will decrease coverage by 0.04%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #136      +/-   ##
==========================================
- Coverage   92.24%   92.20%   -0.05%     
==========================================
  Files           5        5              
  Lines         529      526       -3     
==========================================
- Hits          488      485       -3     
  Misses         41       41              
Impacted Files Coverage Δ
src/Downloads.jl 85.04% <100.00%> (-0.41%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 20e0990...3df806b. Read the comment docs.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Aug 8, 2021

The message for the commit that added this code explains it:

3. It addresses one other issue by adding a global DOWNLOAD lock and calling yield() while the lock is held. This allows any downloads already in progress to process events before starting any new ones. Just calling yield() without the lock doesn't fix this issue since if there are new downloads that are trying to start it will allow them to start first, piling on even more work. Locking first prevents that from happening since any additional downloads have to wait for the lock, so the yield() actually allows older downloads to finish. Without this, highly concurrent downloads, as in the tests, can cause "could not connect to server" errors, that I suspect are due to events not being processed fast enough.

Perhaps this should be added to the code as a comment. Or, another way to put it, the simplified version with the yield on the outside should be commented as let other downloads start rather than let other downloads finish, which is the opposite of what we want. Putting the yield before the lock means that if a lot of downloads are started at the same time, they actually start in reverse order rather than starting in sequential order and allowing the earlier ones to handle events before starting new ones.

The while loop elimination is probably fine. I was trying to make sure that the downloader object that we use is actually the one that's saved to the DOWNLOADER global, but that should be guaranteed by the fact that the code is holding a lock anyway.

@tkf
Copy link
Member Author

tkf commented Aug 8, 2021

Thanks, I think I now understand the intention. I should've done a proper git archaeology.

But I still don't understand if the yield() is only a hint or a requirement for correctness. For example, if other downloads also yield multiple times (or there are some shufflings due to multi-threading), wouldn't it be required to yield() multiple times here (which is rather ugly and unreliable)? So, rather, if you want to make sure all in-flight downloads are finished (or go below some threshold?), wouldn't it make more sense to use a condition variable?

Anyway, I moved back yield inside lock, to make the PR mergeable. It seems that the logic behind this is deeper than I thought.

@StefanKarpinski
Copy link
Member

Perhaps but there isn't really a fixed maximum number of outstanding requests — it's the server that gets overloaded if you open too many concurrent requests and causing it to just drop some of them and the limit depends on the server. You also don't want to block until downloads are complete, you just want any events for in-progress downloads to be handled before starting a new download. The issue is that starting downloads happens in the main task and doesn't require yielding, but making progress at handling a download requires yielding. So if some code starts a lot of downloads without yielding then no progress is ever made on any of the downloads. There's a (unknown) limit to how many connections a server will leave in an outstanding state before dropping them. By yielding at the start of each download, you're making sure that before you start a new download you at least handle the events from the downloads you've already started, even if they're not done. That typically entails processing some data and sending another packet to the server to get more data or end the connection. As long as you keep replying to the server in a timely fashion, you can have a lot of concurrent connections going, but if you stop replying then they get dropped. So that's why we yield.

Yielding outside the lock would address that but if there are a number of tasks starting downloads, then it allows other tasks to start downloads before this one, effectively reversing the order the downloads get started, but still not allowing download events to get handled, so that makes the problem worse.

Yielding inside the lock prevents later downloads from starting since they don't have the lock, but still gives events from in-progress downloads to be handled, letting the downloads make progress before starting new ones. In practice, this throttles the number of concurrent connections to match what the server can handle.

@tkf
Copy link
Member Author

tkf commented Aug 9, 2021

By yielding at the start of each download, you're making sure that before you start a new download you at least handle the events from the downloads you've already started, even if they're not done.

If you don't wait for downloads until compilation, I'm still puzzled how a single yield helps to limit concurrent requests to a server. Doesn't a single request have many yield points? My impression is that the yield call in the lock implements some loose bound on the concurrent requests. If it is known to work with a wide range of servers, I think that's good, but then I wonder if you can use, say, CURLMOPT_MAX_TOTAL_CONNECTIONS for more direct control. It's also easy to add worker pool-based concurrency bound on the Julia side.

@c42f
Copy link
Member

c42f commented Aug 9, 2021

I think that's good, but then I wonder if you can use, say, CURLMOPT_MAX_TOTAL_CONNECTIONS for more direct control.

Yes this seems like the right idea. The default is 0 (unlimited). The HTTP RFC previously had a prescribed maximum, but now just says "A client ought to limit the number of simultaneous open connections that it maintains to a given server." (https://datatracker.ietf.org/doc/html/rfc7230#section-6.4)

It looks like browsers have settled on order ~10 concurrent connections per host:
https://stackoverflow.com/questions/985431/max-parallel-http-connections-in-a-browser

Side note - there's an interesting interaction between multiple connections and curl's HTTP/2 multiplexing. To quote their docs:

While libcurl sets up a connection to a HTTP server there is a period during which it doesn't know if it can pipeline or do multiplexing and if you add new transfers in that period, libcurl will default to start new connections for those transfers. With the new option CURLOPT_PIPEWAIT (added in 7.43.0), you can ask that a transfer should rather wait and see in case there's a connection for the same host in progress that might end up being possible to multiplex on.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Aug 9, 2021

If you don't wait for downloads until compilation, I'm still puzzled how a single yield helps to limit concurrent requests to a server.

By forcing new downloads to wait to start until events for all other downloads have been processed. Otherwise events related to ongoing downloads get queued by libcurl and may not get processed by Julia if there are no yield points. If we don't yield, Julia can just start more downloads, which causes more events to pile up and eventually you get dropped connections.

Doesn't a single request have many yield points?

Maybe this is what's unclear: processing a request doesn't necessarily involve any Julia-side yield points because events are put directly into the libuv event loop by libcurl. The problem this addresses is those piling up without getting a chance to get processed.

I'm open to other ways to do this but I determined this approach experimentally and we know it works in practice. I also tried this without the yield and it causes a lot of timed out connections. I also tried it with the yield before the lock and that also produced timed out connections. I also tried it with a max connection limit and that did help but not as reliably as the current approach — with some servers any particular fixed connection limit still can cause too many events to pile up, whereas the current approach seems to adjust well to different servers.

So, we can change this but before we do, someone needs to do some serious testing with lots of concurrent requests and prove that it works as reliably as the current pattern does.

@tkf
Copy link
Member Author

tkf commented Aug 9, 2021

I'm not suggesting throwing away a solution that is known to work in practice lightly. That's why I've already moved the yield call back. I think it's better to merge the patch as-is, which seems to be non-controversial.

Maybe this is what's unclear: processing a request doesn't necessarily involve any Julia-side yield points because events are put directly into the libuv event loop by libcurl. The problem this addresses is those piling up without getting a chance to get processed.

Thanks, this clarifies a lot. But then it sounds like calling Base.process_events() at the beginning of request (outside the lock) seems to be a more direct implementation of this approach? IIUC, there's no reason for special-casing downloader === nothing?

@StefanKarpinski
Copy link
Member

I wanted to use official APIs and do something that works across all versions of Julia, but looking through the history, Base.process_events has existed since 2012, so I'd be happy with that change as it does solve the same problem more directly and without letting some other task take over. It might also be possible to eliminate the DOWNLOAD lock altogether or only lock when needing to set the global default downloader.

It might independently make sense to have a limit on the number of concurrent connections to each server, but in practice it has been fine to make this unlimited and just let the speed with which the client-server route can establish connections implicitly limit things.

The reason for the dropped connections without letting libcurl events get processed is a little subtle and I may still not fully understand it correctly, but my understanding is as follows. It's not about overwhelming the server — nginx (for example) is quite capable of handling as many concurrent client connections as we can throw at it. The server gets our initial requests just fine no matter what and it replies. The libcurl callbacks get those responses and dispatches libuv events for each of them which then need to be processed by some Julia task. If we don't allow that to happen — and here's where I'm a little fuzzy on what exactly goes wrong — it causes connections to get dropped. I don't know if that's because libuv takes longer than libcurl allows to handle events, causing libcurl to decide the connection is dead, or if it's because we're too slow to respond to the server and the server decides to reset the delayed connections. But if we don't process libuv events before starting new connections, connections get dropped if we make enough concurrent connections. Limiting the number of concurrent connections that libcurl will start just papers over this, which is why, while I'm fine with it, I'm not fine with it as the only way to address this problem. But the process_events approach is good.

@StefanKarpinski
Copy link
Member

Are you planning on continuing this, @tkf? Or should I make the change myself?

@tkf
Copy link
Member Author

tkf commented Aug 19, 2021

I don't know how to invoke the original issue so I'm confident enough only with the current change of the PR (= just removing the while loop). But if you want to switch to the Base.process_events approach please go ahead and don't wait for me.

(I was postponing the reply since I wanted to dig into how the Julia runtime usually calls jl_process_events and how it interacts with Downloads.jl. I'm guessing that the yield-in-lock strategy helps the scheduler invoking jl_process_events, by emptying the task queue. Observing this might be tricky but maybe I can use JuliaLang/julia#41685.)

@StefanKarpinski
Copy link
Member

I asked @JeffBezanson about this yesterday and he said that Base.process_events will allow libuv events to be processed but won't allow other Julia tasks to make any progress since the current task doesn't yield. So the question is whether that's good enough to prevent connection drops. To determine that we'd need to come up with a test scenario that reliably exhibits the connection drops — the CI system used to but doesn't seem to anymore as demonstrated by the green CI here. But it's possible that something in Julia itself or in the CI setup has changed and it's no longer a concern.

@ericphanson
Copy link
Contributor

Now that we aren't interacting with the libuv event loop in the same way (#157), is this worth revisiting?

src/Downloads.jl Show resolved Hide resolved
src/Downloads.jl Outdated Show resolved Hide resolved
@StefanKarpinski
Copy link
Member

I've modified this to not yield at all, which doesn't hang in tests and may be good now. Let's merge it, bump Downloads on Julia and see what happens.

@StefanKarpinski StefanKarpinski merged commit 78255d4 into JuliaLang:master May 17, 2022
@ericphanson
Copy link
Contributor

awesome! if it works, I will copy it for AWS.jl (since we use our own global downloader there too) 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants