Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve HTTP request tracing #866

Open
na-- opened this issue Dec 14, 2018 · 0 comments
Open

Improve HTTP request tracing #866

na-- opened this issue Dec 14, 2018 · 0 comments
Labels
enhancement evaluation needed proposal needs to be validated or tested before fully implementing it in k6 new-http issues that would require (or benefit from) a new HTTP API refactor

Comments

@na--
Copy link
Member

na-- commented Dec 14, 2018

Currently we use Go's net/http/httptrace library to save nanosecond timestamps of the different events that happen during an HTTP request. This mostly works, but due to quirks of the Go standard library (mostly related to HTTP/2 requests), some corner cases aren't especially well handled by the current nanosec-timestamps-with-atomics Tracer implementation, even after the fixes (1, 2) we recently did.

Go's standard library sometimes tries to retry HTTP/2 requests: stdlib code, golang.org/x/net/http2 (which we need for its ConfigureTransport() method because we have to be able to support a custom TLS configuration in the Transport)... To me this seems like a strange behavior to force (it doesn't look configurable at all, so we can't disable it), seemingly at odds with some of the principles of net/http/httptrace ( ClientTrace currently traces a single HTTP request & response during a single round trip and has no hooks that span a series of redirected requests ), so we'll have to figure out a way to deal with it... When there are retries, we likely will even have to emit more than one set of http_req_* k6 metrics.

One potential way to do that would be to implement a more efficient way to save all of the events a the new Tracer (i.e new httptrace.ClientTrace implementation) receives. We did that in the debug branch we used to trace different corner cases, but it was very inefficient. It has to be benchmarked, but probably a simple slice with a lock will likely be fast enough. Alternatively, other more complicated options like lock-free buffers also exist if we need more performance. In any case, once we have all of the events saved in a list, we can sort them by their time and build a state machine that goes backwards from the end of the requests to the first event and emits 1 or more (or 0, if we ignore incomplete requests?) sets of http_req_{blocked,connecting,tls_handhaking,sending,waiting,receiving,...} metrics.

Here are some dumps from the aforementioned tracing in the debug branch that illustrate why the current simple calculations with atomics aren't the sufficient:

We got a connection without making it, but it wasn't reused 😕... Upcoming Golang issue about this behavior...

[2018-12-05 14:53:48.380464338] GotConn ("Reused: false, WasIdle: false, IdleTime: 0s, Type: *tls.Conn, LocalAddr: 192.168.121.108:52166, RemoteAddr: 192.168.121.133:443"),
[2018-12-05 14:53:48.380991897] WroteHeaders (%!q(<nil>)),
[2018-12-05 14:53:48.381239436] WroteRequest ({<nil>}),
[2018-12-05 14:53:57.006467470] GetConn ("abctaylor.com:443"),
[2018-12-05 14:53:57.006627604] ConnectStart ("tcp 192.168.121.133:443"),
[2018-12-05 14:54:27.009040503] ConnectDone ("tcp: 192.168.121.133:443 (err: \"dial tcp 192.168.121.133:443: i/o timeout\")"),
[2018-12-05 14:54:27.009241178] addError ("dial tcp 192.168.121.133:443: i/o timeout"),
[2018-12-05 14:54:27.009492813] Done (%!q(<nil>))

Again:

[2018-12-05 18:45:49.883142128] GotConn ("Reused: false, WasIdle: false, IdleTime: 0s, Type: *tls.Conn, LocalAddr: 192.168.121.108:43134, RemoteAddr: 192.168.121.133:443"),
[2018-12-05 18:45:49.883208106] WroteHeaders (%!q(<nil>)),
[2018-12-05 18:45:49.883209055] WroteRequest ({<nil>}),
[2018-12-05 18:46:21.755232372] GetConn ("abctaylor.com:443"),
[2018-12-05 18:46:21.755286328] ConnectStart ("tcp 192.168.121.133:443"),
[2018-12-05 18:46:33.949411226] ConnectDone ("tcp: 192.168.121.133:443 (err: %!q(<nil>))"),
[2018-12-05 18:46:33.949436019] TLSHandshakeStart (%!q(<nil>)),
[2018-12-05 18:46:48.318323122] Done ("resp is nil, err: context canceled")

Quite a few events there...

[2018-12-04 15:00:00.609935619] GetConn ("fonts.googleapis.com:443"),
[2018-12-04 15:00:00.610622248] ConnectStart ("tcp 192.168.121.133:443"),
[2018-12-04 15:00:03.782462235] ConnectDone ("tcp: 192.168.121.133:443 (err: %!q(<nil>))"),
[2018-12-04 15:00:03.782599864] TLSHandshakeStart (%!q(<nil>)),
[2018-12-04 15:00:08.962812789] TLSHandshakeDone ("Version: 771, HandshakeComplete: true, DidResume: false, NegotiatedProtocol: h2, ServerName: , error: %!s(<nil>)"),
[2018-12-04 15:00:08.962983544] GotConn ("Reused: false, WasIdle: false, IdleTime: 0s, Type: *tls.Conn, LocalAddr: 192.168.121.108:53786, RemoteAddr: 192.168.121.133:443"),
[2018-12-04 15:00:08.963254399] WroteHeaders (%!q(<nil>)),
[2018-12-04 15:00:08.963431396] WroteRequest ({<nil>}),
[2018-12-04 15:00:26.386350521] GetConn ("fonts.googleapis.com:443"),
[2018-12-04 15:00:26.386518593] ConnectStart ("tcp 192.168.121.133:443"),
[2018-12-04 15:00:28.414929317] ConnectDone ("tcp: 192.168.121.133:443 (err: %!q(<nil>))"),
[2018-12-04 15:00:28.415150349] TLSHandshakeStart (%!q(<nil>)),
[2018-12-04 15:00:31.446235803] TLSHandshakeDone ("Version: 771, HandshakeComplete: true, DidResume: false, NegotiatedProtocol: h2, ServerName: , error: %!s(<nil>)"),
[2018-12-04 15:00:31.446355116] GotConn ("Reused: false, WasIdle: false, IdleTime: 0s, Type: *tls.Conn, LocalAddr: 192.168.121.108:59024, RemoteAddr: 192.168.121.133:443"),
[2018-12-04 15:00:31.446547655] WroteHeaders (%!q(<nil>)),
[2018-12-04 15:00:31.446676741] WroteRequest ({<nil>}),
[2018-12-04 15:01:00.610263588] Done (%!q(<nil>))

Started making a connection, but an already established one got freed and we'll reuse it instead!

[2018-12-04 14:59:38.350629010] GetConn ("u.heatmap.it:443"),
[2018-12-04 14:59:38.350939550] ConnectStart ("tcp 192.168.121.133:443"),
[2018-12-04 14:59:38.809262825] ConnectDone ("tcp: 192.168.121.133:443 (err: %!q(<nil>))"),
[2018-12-04 14:59:38.809453059] TLSHandshakeStart (%!q(<nil>)),
[2018-12-04 14:59:41.340380849] TLSHandshakeDone ("Version: 771, HandshakeComplete: true, DidResume: false, NegotiatedProtocol: h2, ServerName: , error: %!s(<nil>)"),
[2018-12-04 14:59:41.340510721] GetConn ("u.heatmap.it:443"),
[2018-12-04 14:59:41.340733656] GotConn ("Reused: true, WasIdle: false, IdleTime: 0s, Type: *tls.Conn, LocalAddr: 192.168.121.108:48264, RemoteAddr: 192.168.121.133:443"),
[2018-12-04 14:59:41.341037530] WroteHeaders (%!q(<nil>)),
[2018-12-04 14:59:41.341522056] WroteRequest ({<nil>}),
[2018-12-04 14:59:42.440481804] GotFirstResponseByte (%!q(<nil>)),
[2018-12-04 14:59:42.440568055] Done (%!q(<nil>))

Once you intentionally break and/or stress your network, you start to see all sorts of strange combinations of events like these...

@na-- na-- added enhancement refactor evaluation needed proposal needs to be validated or tested before fully implementing it in k6 labels Dec 14, 2018
@mstoykov mstoykov added the new-http issues that would require (or benefit from) a new HTTP API label Oct 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement evaluation needed proposal needs to be validated or tested before fully implementing it in k6 new-http issues that would require (or benefit from) a new HTTP API refactor
Projects
None yet
Development

No branches or pull requests

2 participants