-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EOFError when parsing headers (dupe of #214) #220
Comments
This is probably #214. Which, IIUC, means we just need to put in appropriate idle/connect timeouts and things should work fine. @samoconnor can you confirm? Nanosoldier is another heavy github-api hitter. |
@ararslan What versions of HTTP.jl and MbedTLS.jl are you using? Are you using retry_non_idempotent=true? @quinnj, that seems probable, but I don’t think we should rush to setting a connection idle timeout. We should think of that as a performance optimisation and try to ensure that the default case is handled as smoothly as possible. @ararslan, please keep in mind that the HTTP.jl layer cannot completely hide the fact that the network is imperfect. There are some cases where it can transparently retry requests but there are others where it cannot (e.g. POST) so robust applications will always need application-logic-aware retry loops (even if they are as crude as crashing and restarting an entire process or container). In the case of POST requests we attempt to retry in as many cases as possible, but if the error happens (from the client point of view) after body of the request has been sent, we cannot retry (to avoid double execution of a non-idempotent operation). |
HTTP 0.6.4 and MbedTLS 0.5.6
I don't know what that is, so probably not
Sure, of course. But it's important to note that Nanosoldier wasn't having this trouble with HTTP prior to the HTTP 0.6.x releases. |
Unfortunately HTTP.jl used to retry much more aggressively than it should, at the risk of applications potentially automatically retrying POST requests where that may not be entirely safe/wise (think financial transactions and such). We're now on the "safe/legal" side of what can be automatically retried, which is causing problems in applications that were relying on this automatic retry functionality, probably unknowingly. Luckily, it's pretty easy to manage: you can use |
The behaviour you're seeing is a known issue with 0.6.4. Also please try MbedTLS v0.5.7 (this includes a deadlock fix that can interact with HTTP.jl's retry-post-whenever-possible logic JuliaLang/MbedTLS.jl@e6b366c) |
Understood. As @quinnj says, we've removed a lot of catch-all exception handling in HTTP.jl and instead try to detect and handle specific situations in line with the semantics defined in the RFC. The GitHub doc says that they have been careful to use HTTP methods with semantics appropriate to the API operations. In the case of the operation in this issue, |
Thanks so much for the info and the quick responses, both of you!
Ah, Nanosoldier didn't upgrade to that automatically because it's running Julia 0.6.0 and HTTP 0.6.6 requires Julia 0.6.2 at a minimum. I had to take Nanosoldier offline so I'll upgrade Julia while I'm at it. |
No problem. Please let us know how it goes... |
Looks like I can't update to MbedTLS 0.5.7 yet: JuliaLang/METADATA.jl#13725 |
IMHO the METADATA PR should be merged as is. |
@ararslan it is still worth trying MbedTLS v0.5.6, that version has a fix for the deadlock condition. v0.5.7 is a cleaned-up/refined version of that fix. |
I merged it. I imagine we'll have an issue or two come in, but better to get things ironed out sooner than later w/ bindeps2. |
Great, thank you both. I'll update to 0.5.7. |
Nanosoldier is running Julia 0.6.2, HTTP 0.6.4, and MbedTLS 0.5.7. It ran successfully a couple times, but a couple hours ago got this same error: JuliaLang/julia#26435 (comment). |
Perhaps GitHub.jl needs to be updated for the recent HTTP changes? |
This looks like the same POST new status API call as before. So, on one hand you should expect that it will fail sometimes because it cannot be automatically retried if the request body was sent. On the other hand, I'm willing to dive down the rabbit hole in case there is something else going on here. Please post the logs, set verbose = 3 and set DEBUG_LEVEL=2 in HTTP.jl/src/HTTP.jl. Note that verbose=3 inserts a debug layer into the HTTP stack so it may change behaviour. If so, try verbose=1 or =2. |
Sorry, just read that in more detail. Please use HTTP 0.6.6 per #220 (comment). |
Sorry, I meant I was using HTTP 0.6.6, that was a typo. |
We need to see the logs with debug info for a couple of hours of use (these will include information on which requests were retried or not and why). Hopefully we would see that |
Where do I set |
So I would need to set it in GitHub.jl, which is what's actually making the request. |
yes, and "set DEBUG_LEVEL=2 in HTTP.jl/src/HTTP.jl", |
|
Is that the whole log for 5 hours? (I only see ~400 lines, and ~18 github API requests). There are no timestamps and very little application context in the log. Can you increase the log level in your application so that it records some information about when/what requests are being processed? I don't see any Requests or Responses in the log above, are you sure you've enabled julia> HTTP.get("http://httpbin.org/ip");nothing
julia> HTTP.get("http://httpbin.org/ip", verbose=2);nothing
HTTP.Messages.Request:
"""
GET /ip HTTP/1.1
Host: httpbin.org
Content-Length: 0
"""
HTTP.Messages.Response:
"""
HTTP/1.1 200 OK
Connection: keep-alive
---snip---
Via: 1.1 vegur
{
"origin": "60.224.142.5"
}
""" Are you still using Julia 0.6.2, HTTP 0.6.6, and MbedTLS 0.5.7 per your note above? Please double-check that your copy of HTTP.jl has the following two Line 98 in 2f4acd2
Line 58 in 2f4acd2
|
Here is the relevant piece of the log:
We can see that The "holding write lock" message comes from the So, at this point the It is worth considering the fact that at the time we logged "holding write lock" the socket was known to be in "closing" (🔜💀) state. The Lines 94 to 110 in b602c4e
Perhaps if we query the Lines 518 to 529 in b602c4e
Can you try putting It is also worth considering the |
Another thought: it's hard to tell from these logs because there are no timestamps, but, if we assume that the root cause is that we're trying to reuse a connection that the other end has given up on, then setting See: #214 (comment) |
Nanosoldier is down for some large-scale on-site maintenance project at MIT, with no estimate for when it will be back up. So for now I unfortunately can't try out any other suggestions or debugging configurations.
It is indeed. That is the contents of the log from when I started the server to when I had to shut it down. The API requests are from GitHub events on the Julia repo that Nanosoldier doesn't ignore.
Improving Nanosoldier's logging has been on my to-do list for a while, but I have no estimate for when I'll have time to work on that, as I'm no longer working full time on Julia.
All I did was change the default value of |
On one hand I'd like to dig deeper here to see if we can learn something that will benefit HTTP.jl. The logs above show a situation where it becomes clear that the connection is dead shortly after the body is sent. It is tempting to try to find a way to pre-empt this situation and avoid sending the body. But, I have a feeling that whatever way we come up with to do this will in-effect involve waiting for a network round-trip delay. It seems that waiting for a round-trip before sending every POST request is not desirable in general. It seems much better to buffer/send requests with the assumption that the connection is ok (which it is most of the time) and deal with the occasional state-aware retry at the application layer (where it must be done anyway, because we can never know if the connection fails for other reasons after the body transmission). To deal with the immediate practical problem in the Nanosoldier's use case I recommend modifying GitHub.jl to:
|
Closing. See JuliaWeb/GitHub.jl#109 (comment) |
Was removed in b602c4e : > Remove yeild() after @async writebody. This should increase the chance > that the startread() realises that the connection is dead before the > body is written. This change appears to have caused problems in other situations: JuliaCloud/AWSS3.jl#26 (comment) The removal was driven by nanosoldier/GitHub.jl issues with POST requests on long-held connections. #220 (comment) GitHub.jl now handles this with the `idle_timeout=` option. JuliaWeb/GitHub.jl#109
When a connection is returned to the (read) pool, add a monitor to it for receiving unexpected data (or EOF), and kill / close the Connection object if any activity occurs before the next write (when it should have simply been waiting idle in the pool) per JuliaLang/MbedTLS.jl#145 (comment) closes #214 closes #199 closes #220 closes JuliaWeb/GitHub.jl#106
* ConnectionPool: monitor idle connections When a connection is returned to the (read) pool, add a monitor to it for receiving unexpected data (or EOF), and kill / close the Connection object if any activity occurs before the next write (when it should have simply been waiting idle in the pool) per JuliaLang/MbedTLS.jl#145 (comment) closes #214 closes #199 closes #220 closes JuliaWeb/GitHub.jl#106 * - Encapsulate read|writebusy/sequence/count logic in new isbusy function. - Move close() on eof() || !isbusy() to new monitor_idle_connection function. - Make monitor_idle_connection() a noop for ::Connection{SSLContext} * require Julia 0.6.3 #236 (comment)
Seen on Nanosoldier:
The text was updated successfully, but these errors were encountered: