Metrics does not count requests on ingresses without host #3713

jonaz · 2019-01-31T13:07:02Z

Is this a request for help? Yes , bug with possible solution.

What keywords did you search in NGINX Ingress controller issues before filing this one? missing metrics (#3053 looked promising, but wrong version. according to git commit this bug was introduced in 0.20.0)

Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

NGINX Ingress controller version: 0.22.0 (bug introduced in 0.20.0)

Kubernetes version (use kubectl version): 1.7.14

Environment:

Cloud provider or hardware configuration: bare metal
OS (e.g. from /etc/os-release): coreos 1967.4.0
Kernel (e.g. uname -a): 4.14.96-coreos
Install tools:
Others:

What happened: ingresses without specific host does not report metrics.

What you expected to happen: All ingresses should have metrics.

How to reproduce it (as minimally and precisely as possible):
Create an ingress without host set.

Anything else we need to know:

Hi we just upgraded to 0.22.0 from 0.15.0 and now have the issue with dissapearing metrics.

I looked a the code and found this commit which introduces the problem:
9766ad8

If you have an ingress that does not specify host it will never find a match and not increase the metric counter.

So all ingresses without host will be without metrics now :(

I'm not 100% familiar with the codebase and not quite sure how to solve it since we cannot know if ingress is missing host at that time. Perhaps hosts field on SocketCollector could be a

map[ingressName]struct{
    Wildcard bool
    Hosts sets.String
}

In that case we can know when not to skip because its a wildcard?

msg in handleMessage in SocketCollector looks like on a request on an ingress that does not have host. It looks korrect.

[{
    "requestLength": 1190,
    "ingress": "kafka-http-ingress",
    "status": "201",
    "service": "kafka-http",
    "requestTime": 0.024,
    "namespace": "default",
    "host": "route-utv.fnox.se",
    "method": "POST",
    "upstreamResponseTime": 0.024,
    "upstreamResponseLength": 4,
    "upstreamLatency": 0.014,
    "path": "\/internalapi\/kafka",
    "responseLength": 213
}]

The text was updated successfully, but these errors were encountered:

aledbf · 2019-01-31T13:18:02Z

@jonaz actually this change was intentional. Please check #3116 . How we can differentiate a valid request from a DOS?

jonaz · 2019-01-31T14:23:16Z

@aledbf Perhaps we could match the prefix of the request path against the paths on the ingress? In addition to the host filtering. In that case we will not have more metrics series than host+path combinations of all ingresses?

Right now we have lost our ability to monitor/alert on a majority of our services request failuere rate... which is really bad for us in the ops department. We use a single domain (for CORS reasons) and just have ingresses routing on path for about 120 microservices.

jonaz · 2019-02-04T13:24:54Z

@aledbf would you accept PR with proposed fix or do we need to maintain a fork or move to another ingress controller?

komljen · 2019-03-08T10:56:51Z

@jonaz Did you find a solution for this?

jonaz · 2019-03-09T08:46:38Z

@komljen no, still waiting for answer frmo @aledbf if my proposed solution would be accepted.

I guess we have to go back to Traefik in the mean time.

adamelliotfields · 2019-03-22T23:23:44Z

Just wanted to thank @jonaz for opening this issue, as this was the reason Grafana wasn't showing anything for me.

It might be worth mentioning this in the monitoring section of the docs. I'd imagine others might assume a fan-out without a host would generate metrics.

dotLou · 2019-05-02T12:28:39Z

Would it make sense to change the code here to export all ingress metrics when per-host metrics are disabled though? We're losing a lot of metrics that are essential to monitoring this in production because the metrics are just not passed on.

Here's the flag I'm talking about: #3594

Here's the code in question I'm suggesting we change:

ingress-nginx/internal/ingress/metric/collectors/socket.go

Lines 224 to 227 in ddedd16

    
           if !sc.hosts.Has(stats.Host) { 
        
           	klog.V(3).Infof("skiping metric for host %v that is not being served", stats.Host) 
        
           	continue 
        
           }

jonaz · 2019-05-02T12:46:43Z

My solution above would protect against DDOS and also support metrics on all paths+hosts. But no word from maintainers here yet.

dotLou · 2019-05-07T11:18:44Z

My solution above would protect against DDOS and also support metrics on all paths+hosts. But no word from maintainers here yet.

I like your solution in terms of host labels being present on the metrics, as it definitely protects against high cardinality in case of a DDOS. Having said that though, when metrics-per-host=false, the benefit IMO is that there's no longer a DDOS risk in terms of metrics label cardinality. At least from my point of view, in that case anyway, I would like to get all metrics, including those from any potential DDOS; In fact, a DDOS would become identifiable by a big increase of 404s on the metrics with this capability.

My thinking is that maybe we need both; if metrics-per-host=true, regex check to determine whether to export or not. If metrics-per-host=false, export everything. Alternatively, even if the path doesn't match and metrics-per-host=true, maybe we could remove the host label for things that don't match a known host (this way, we get all the metrics in all cases). Thoughts? Did I understand your suggestion properly?

It would certainly be useful to get some thoughts from some of the maintainers here; I'm interested in submitting at least one PR to do one or the other approach, but I don't want to do that work if there's no interest in accepting such a change.

choffmeister · 2019-05-31T06:51:39Z

Are there any concrete plans on the timeline for this? We also have the issue, that we basically are blind in terms of nginx metrics, as we give every customer a custom subdomain (so we use a wildcard domain and hence have not metrics). IMHO it would be a good solution, to just record the wildcard host. For example:

We have an ingress for host *.domain.com. If a customer accesses our system via customer1.domain.com, then the nginx could just record this request for *.domain.com. The same for every other xxx.domain.com. All requests handled by the one wildcard ingress go into one bucket. Hence there would be no problem with unbounded cardinality in the metrics and so no DDOS problem. But we could still distinguish requests for the different ingresses.

I would try to help improve this, but I have no experience with Go :(

choffmeister · 2019-05-31T09:04:55Z

I created a PR #4139 to fix that one gets metrics at least when running with --metrics-per-host=false.

alexgavrisco · 2019-06-04T12:39:30Z

I totally agree with @choffmeister - there are quite a few use cases where you have to use wildcards and still get metrics (without tagging them with host).
There must be a way to get metrics even for ingress with a wildcard, without getting a DOS of the Prometheus/etc.

However, I think that this behavior must be controlled via a separate option/flag for several reasons:

for folks that use this metrics-per-host, changing its behavior is a breaking change
there are use cases when you might want to get metrics for the wildcard and tag them with the hostname (e.g. another LB in front of nginx).

My suggestion would be a flag that disables filtering metrics for ingress with wildcard and keep the metrics-per-host with its current behavior.

@aledbf What do you think about this approach?

mosheBelostotsky · 2019-06-26T08:48:31Z

Hi @jonaz Any news with this issue?
It's very important for our use-case...

rlees85 · 2019-07-01T10:52:00Z

Shame the PR seems to have gone quiet.... its also blocking us from upgrading from 0.18 so looking forward to it being merged

hairyhenderson · 2019-07-04T21:08:18Z

@jonaz @aledbf any progress on this? We're essentially flying blind with our ingress right now because of this bug...

fejta-bot · 2019-10-02T22:04:36Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

essh · 2019-10-02T22:09:33Z

/remove-lifecycle stale

jonaz · 2019-10-03T05:35:06Z

@jonaz @aledbf any progress on this? We're essentially flying blind with our ingress right now because of this bug...

@hairyhenderson
since no reply from @aledbf regarding my implementation suggestion we are migrating to traefik 2.0 instead.

jacksontj · 2019-11-13T04:06:42Z

@aledbf I also just ran into this issue (today as a matter-of-fact). IMO the linked PR (#4139) is a reasonable way to go. In the case that someone has a wildcard ingress IMO it makes sense to just exclude the host label -- since then I can still get metrics based on the ingress name (which is also a fixed cardinality).

Is there anything remaining to get the PR merged in?

rlees85 · 2019-11-17T13:36:14Z

@aledbf I also just ran into this issue (today as a matter-of-fact). IMO the linked PR (#4139) is a reasonable way to go. In the case that someone has a wildcard ingress IMO it makes sense to just exclude the host label -- since then I can still get metrics based on the ingress name (which is also a fixed cardinality).

Is there anything remaining to get the PR merged in?

Very unlikely to ever happen. I'd recommend you name all of your ingresses (this is the approach I took to get around this issue). It isn't that hard, especially when using some form of automation to deploy Kubernetes apps (Helm or in my case Terraform). Sticky Session also starts working again when doing this :)

Long term, traefik might be the way to go, depending on if or not ingress-nginx picks up a new maintainer with aledbf stepping down.

Disclaimer: this isn't a dig, ingress-nginx is a great piece of work and I highly appreciate the work aledbf has done on it. Just trying to point out a workaround and the fact things are bleak right now but hopefully someone/some others will step up.

fejta-bot · 2020-02-15T14:02:30Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

jacksontj · 2020-02-15T18:59:34Z

/remove-lifecycle stale

shivpathak · 2020-04-25T10:14:03Z

I am also experiencing the same issue not getting nginx_ingress_controller_requests metrics if I don't define the host
e.g.

This doesn't generate metrics for nginx_ingress_controller_requests

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: service-nginx
  name: observability-es
  namespace: capabilities
spec:
  rules:
  - http:
      paths:
      - backend:
          serviceName: elasticsearch-es-http
          servicePort: 9200
        path: /

This generate metrics for nginx_ingress_controller_requests

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: service-nginx
  name: observability-es
  namespace: capabilities
spec:
  rules:
  - host: dev.microservices.test.test
    http:
      paths:
      - backend:
          serviceName: elasticsearch-es-http
          servicePort: 9200
        path: /

Would be great if we could get the fix for this.

moljor · 2020-04-28T13:14:28Z

I even lose my metrics when I made my host a wildcard.
So the host exists it just changed from e.g. "prod.something.com" to "*.something.com".

DanOfir · 2020-04-29T20:46:13Z

I also cannot see metrics of hosts with wildcard, there is a workaround?

hairyhenderson · 2020-04-30T01:49:33Z

@DanOfir no, this has effectively stalled, and caused more than a few people to simply switch to a different ingress controller. See #4139 (comment) for some related conversation.

fejta-bot · 2020-07-29T02:14:17Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

essh · 2020-07-29T02:15:53Z

/remove-lifecycle stale

aledbf · 2020-08-27T02:32:45Z

Closing. Fixed in master #4139

curantes mentioned this issue Mar 6, 2019

Metrics not updated for wildcard domains #3844

Closed

choffmeister mentioned this issue May 31, 2019

Always collect metrics when --metrics-per-host=false #4139

Merged

curantes mentioned this issue May 31, 2019

Missing prometheus metrics in versions 0.20.0 and later #4066

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 2, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 2, 2019

stephanepiel mentioned this issue Feb 10, 2020

Custom option for gathering prometheus metrics for ingresses without host specified #5048

Closed

8 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 15, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 15, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 29, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 29, 2020

aledbf closed this as completed Aug 27, 2020

pchang388 mentioned this issue Jun 16, 2021

my ingress controller metrics api gives few part of metrics #7025

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics does not count requests on ingresses without host #3713

Metrics does not count requests on ingresses without host #3713

jonaz commented Jan 31, 2019

aledbf commented Jan 31, 2019

jonaz commented Jan 31, 2019 •

edited

Loading

jonaz commented Feb 4, 2019

komljen commented Mar 8, 2019

jonaz commented Mar 9, 2019

adamelliotfields commented Mar 22, 2019

dotLou commented May 2, 2019 •

edited

Loading

jonaz commented May 2, 2019

dotLou commented May 7, 2019 •

edited

Loading

choffmeister commented May 31, 2019

choffmeister commented May 31, 2019

alexgavrisco commented Jun 4, 2019

mosheBelostotsky commented Jun 26, 2019

rlees85 commented Jul 1, 2019

hairyhenderson commented Jul 4, 2019

fejta-bot commented Oct 2, 2019

essh commented Oct 2, 2019

jonaz commented Oct 3, 2019

jacksontj commented Nov 13, 2019

rlees85 commented Nov 17, 2019

fejta-bot commented Feb 15, 2020

jacksontj commented Feb 15, 2020

shivpathak commented Apr 25, 2020 •

edited

Loading

moljor commented Apr 28, 2020

DanOfir commented Apr 29, 2020

hairyhenderson commented Apr 30, 2020

fejta-bot commented Jul 29, 2020

essh commented Jul 29, 2020

aledbf commented Aug 27, 2020

Metrics does not count requests on ingresses without host #3713

Metrics does not count requests on ingresses without host #3713

Comments

jonaz commented Jan 31, 2019

aledbf commented Jan 31, 2019

jonaz commented Jan 31, 2019 • edited Loading

jonaz commented Feb 4, 2019

komljen commented Mar 8, 2019

jonaz commented Mar 9, 2019

adamelliotfields commented Mar 22, 2019

dotLou commented May 2, 2019 • edited Loading

jonaz commented May 2, 2019

dotLou commented May 7, 2019 • edited Loading

choffmeister commented May 31, 2019

choffmeister commented May 31, 2019

alexgavrisco commented Jun 4, 2019

mosheBelostotsky commented Jun 26, 2019

rlees85 commented Jul 1, 2019

hairyhenderson commented Jul 4, 2019

fejta-bot commented Oct 2, 2019

essh commented Oct 2, 2019

jonaz commented Oct 3, 2019

jacksontj commented Nov 13, 2019

rlees85 commented Nov 17, 2019

fejta-bot commented Feb 15, 2020

jacksontj commented Feb 15, 2020

shivpathak commented Apr 25, 2020 • edited Loading

moljor commented Apr 28, 2020

DanOfir commented Apr 29, 2020

hairyhenderson commented Apr 30, 2020

fejta-bot commented Jul 29, 2020

essh commented Jul 29, 2020

aledbf commented Aug 27, 2020

jonaz commented Jan 31, 2019 •

edited

Loading

dotLou commented May 2, 2019 •

edited

Loading

dotLou commented May 7, 2019 •

edited

Loading

shivpathak commented Apr 25, 2020 •

edited

Loading