Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

$upstream_response_time might contain multiple, comma-separated values #217

Closed
saz opened this issue Oct 27, 2021 · 3 comments · Fixed by #221
Closed

$upstream_response_time might contain multiple, comma-separated values #217

saz opened this issue Oct 27, 2021 · 3 comments · Fixed by #221
Labels

Comments

@saz
Copy link

saz commented Oct 27, 2021

Describe the bug
If there are multiple upstream backends and more than one will be tried by nginx, $upstream_response_time will contain a comma-separated list of times.
If this is happening, the log line will be parsed incorrectly

nginx documentation for $upstream_response_time:

keeps time spent on receiving the response from the upstream server; the time is kept in seconds with millisecond resolution. Times of several responses are separated by commas and colons like addresses in the $upstream_addr variable.

To Reproduce
Add multiple upstream backends (e.g. PHP-FPM) to nginx, break the first one and request a site using this backend.

Expected behavior
Log line is parsed properly and proper metrics are available

Configuration file (remove section, if not applicable):

namespaces:
  - name: nginxlog
    format: "$server_name $remote_addr - $remote_user [$time_local] \"$request\" $status $body_bytes_sent \"$http_referer\" \"$http_user_agent\" $request_time $upstream_response_time $pipe $ssl_protocol/$ssl_cipher $request_id $host"
    source:
      syslog:
        listen_address: "udp://127.0.0.1:8514"
        format: "rfc3164"
        tags: ["nginx"]
    relabel_configs:
      - target_label: "vhost"
        from: "server_name"
      - target_label: "ssl_protocol"
        from: "ssl_protocol"
      - target_label: "user_agent"
        from: "http_user_agent"
        whitelist: ["KeepAliveClient"]

Example log file (remove section, if not applicable):

www.example.com 10.0.0.1 - - [27/Oct/2021:06:00:14 +0200] "GET /websocket HTTP/1.1" 502 552 "-" "SomeUserAgentString" 0.000 0.000, 0.000, 0.000, 0.000 . TLSv1.3/TLS_AES_256_GCM_SHA384 1234567890 www.example.com
www.example.com 10.0.0.2 - - [27/Oct/2021:06:00:15 +0200] "GET /websocket HTTP/1.1" 502 552 "-" "AnotherUserAgent" 0.000 0.000, 0.000, 0.000, 0.000 . TLSv1.3/TLS_AES_256_GCM_SHA384 1234567890 www.example.com
www.example.com 10.0.0.3 - - [27/Oct/2021:06:00:15 +0200] "GET /websocket HTTP/1.1" 502 150 "-" "SomeUserAgentString" 0.001 0.000, 0.000, 0.000, 0.000 . TLSv1.3/TLS_AES_256_GCM_SHA384 1234567890 www.example.com

Metrics output (remove section, if not applicable):
Not exactly matching the log lines above, but the error is clearly visible, looking at ssl_protocol, which should have a value like TLSv1.3 only

nginxlog_http_response_time_seconds_hist_bucket{method="GET",ssl_protocol="0.000, 0.000 . TLSv1.3",status="502",user_agent="other",vhost="www.example.com",le="0.1"} 1116
nginxlog_http_response_time_seconds_hist_bucket{method="GET",ssl_protocol="0.000, 0.000 . TLSv1.3",status="502",user_agent="other",vhost="www.example.com",le="0.25"} 1116
nginxlog_http_response_time_seconds_hist_bucket{method="GET",ssl_protocol="0.000, 0.000 . TLSv1.3",status="502",user_agent="other",vhost="www.example.com",le="0.5"} 1116
nginxlog_http_response_time_seconds_hist_bucket{method="GET",ssl_protocol="0.000, 0.000 . TLSv1.3",status="502",user_agent="other",vhost="www.example.com",le="1"} 1116

Environment:

  • Exporter version: 1.9.0
  • OS (e.g. from /etc/os-release): Ubuntu 20.04.3 LTS
  • Deployment method (e.g. Docker image, deb/rpm package, self-compiled, ...): binary
  • Others:
@saz saz added the bug label Oct 27, 2021
@martin-helmich
Copy link
Owner

Thanks for the report. I've been wondering that the appropriate behaviour would be in this case? Different solutions that come to mind:

  • Use only the last value (which should be the upstream request that actually succeeded)
  • Aggregate all values using some aggregation function (sum/min/max/avg)

Any thoughts?

Apart from that, the $upstream_response_time should probably be quoted in the log format (i.e. "$upstream_response_time") for the parser to pick up on the space-separated values correctly (in your example, it can be seen that the additional upstream response times actually bleed over into the ssl_protocol variable).

@lhanjian
Copy link

@martin-helmich I think use sum is a good idea.
We often need do "$request_time - $upstream_response_time_total" to calculate latency of nginx.

martin-helmich added a commit that referenced this issue Nov 20, 2021
martin-helmich added a commit that referenced this issue Nov 20, 2021
…ments (#221)

* Add failing test case for #217

* Correctly handle $upstream_response_time values with multiple measurements

* Refactor to reduce duplication
@martin-helmich
Copy link
Owner

@lhanjian Thanks for your suggestion, which I tend to agree with. #221 adds support for multiple values, which are summed into a single observation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants