$upstream_response_time might contain multiple, comma-separated values #217

saz · 2021-10-27T09:18:57Z

Describe the bug
If there are multiple upstream backends and more than one will be tried by nginx, $upstream_response_time will contain a comma-separated list of times.
If this is happening, the log line will be parsed incorrectly

nginx documentation for $upstream_response_time:

keeps time spent on receiving the response from the upstream server; the time is kept in seconds with millisecond resolution. Times of several responses are separated by commas and colons like addresses in the $upstream_addr variable.

To Reproduce
Add multiple upstream backends (e.g. PHP-FPM) to nginx, break the first one and request a site using this backend.

Expected behavior
Log line is parsed properly and proper metrics are available

Configuration file (remove section, if not applicable):

namespaces:
  - name: nginxlog
    format: "$server_name $remote_addr - $remote_user [$time_local] \"$request\" $status $body_bytes_sent \"$http_referer\" \"$http_user_agent\" $request_time $upstream_response_time $pipe $ssl_protocol/$ssl_cipher $request_id $host"
    source:
      syslog:
        listen_address: "udp://127.0.0.1:8514"
        format: "rfc3164"
        tags: ["nginx"]
    relabel_configs:
      - target_label: "vhost"
        from: "server_name"
      - target_label: "ssl_protocol"
        from: "ssl_protocol"
      - target_label: "user_agent"
        from: "http_user_agent"
        whitelist: ["KeepAliveClient"]

Example log file (remove section, if not applicable):

www.example.com 10.0.0.1 - - [27/Oct/2021:06:00:14 +0200] "GET /websocket HTTP/1.1" 502 552 "-" "SomeUserAgentString" 0.000 0.000, 0.000, 0.000, 0.000 . TLSv1.3/TLS_AES_256_GCM_SHA384 1234567890 www.example.com
www.example.com 10.0.0.2 - - [27/Oct/2021:06:00:15 +0200] "GET /websocket HTTP/1.1" 502 552 "-" "AnotherUserAgent" 0.000 0.000, 0.000, 0.000, 0.000 . TLSv1.3/TLS_AES_256_GCM_SHA384 1234567890 www.example.com
www.example.com 10.0.0.3 - - [27/Oct/2021:06:00:15 +0200] "GET /websocket HTTP/1.1" 502 150 "-" "SomeUserAgentString" 0.001 0.000, 0.000, 0.000, 0.000 . TLSv1.3/TLS_AES_256_GCM_SHA384 1234567890 www.example.com

Metrics output (remove section, if not applicable):
Not exactly matching the log lines above, but the error is clearly visible, looking at ssl_protocol, which should have a value like TLSv1.3 only

nginxlog_http_response_time_seconds_hist_bucket{method="GET",ssl_protocol="0.000, 0.000 . TLSv1.3",status="502",user_agent="other",vhost="www.example.com",le="0.1"} 1116
nginxlog_http_response_time_seconds_hist_bucket{method="GET",ssl_protocol="0.000, 0.000 . TLSv1.3",status="502",user_agent="other",vhost="www.example.com",le="0.25"} 1116
nginxlog_http_response_time_seconds_hist_bucket{method="GET",ssl_protocol="0.000, 0.000 . TLSv1.3",status="502",user_agent="other",vhost="www.example.com",le="0.5"} 1116
nginxlog_http_response_time_seconds_hist_bucket{method="GET",ssl_protocol="0.000, 0.000 . TLSv1.3",status="502",user_agent="other",vhost="www.example.com",le="1"} 1116

Environment:

Exporter version: 1.9.0
OS (e.g. from /etc/os-release): Ubuntu 20.04.3 LTS
Deployment method (e.g. Docker image, deb/rpm package, self-compiled, ...): binary
Others:

The text was updated successfully, but these errors were encountered:

martin-helmich · 2021-10-27T16:25:56Z

Thanks for the report. I've been wondering that the appropriate behaviour would be in this case? Different solutions that come to mind:

Use only the last value (which should be the upstream request that actually succeeded)
Aggregate all values using some aggregation function (sum/min/max/avg)

Any thoughts?

Apart from that, the $upstream_response_time should probably be quoted in the log format (i.e. "$upstream_response_time") for the parser to pick up on the space-separated values correctly (in your example, it can be seen that the additional upstream response times actually bleed over into the ssl_protocol variable).

lhanjian · 2021-11-17T15:05:16Z

@martin-helmich I think use sum is a good idea.
We often need do "$request_time - $upstream_response_time_total" to calculate latency of nginx.

…ments (#221) * Add failing test case for #217 * Correctly handle $upstream_response_time values with multiple measurements * Refactor to reduce duplication

martin-helmich · 2021-11-20T12:22:48Z

@lhanjian Thanks for your suggestion, which I tend to agree with. #221 adds support for multiple values, which are summed into a single observation.

saz added the bug label Oct 27, 2021

martin-helmich added a commit that referenced this issue Nov 20, 2021

Add failing test case for #217

71a82c2

martin-helmich mentioned this issue Nov 20, 2021

Correctly handle $upstream_response_time values with multiple measurements #221

Merged

martin-helmich closed this as completed in #221 Nov 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

$upstream_response_time might contain multiple, comma-separated values #217

$upstream_response_time might contain multiple, comma-separated values #217

saz commented Oct 27, 2021 •

edited

Loading

martin-helmich commented Oct 27, 2021

lhanjian commented Nov 17, 2021

martin-helmich commented Nov 20, 2021

$upstream_response_time might contain multiple, comma-separated values #217

$upstream_response_time might contain multiple, comma-separated values #217

Comments

saz commented Oct 27, 2021 • edited Loading

martin-helmich commented Oct 27, 2021

lhanjian commented Nov 17, 2021

martin-helmich commented Nov 20, 2021

saz commented Oct 27, 2021 •

edited

Loading