Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Istio][Istiod Metrics]: Metrics are incorrectly dropped because of TSDS Dimension issue #11513

Open
BenB196 opened this issue Oct 24, 2024 · 2 comments
Labels
Integration:istio Istio needs:triage Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team [elastic/obs-cloudnative-monitoring]

Comments

@BenB196
Copy link
Contributor

BenB196 commented Oct 24, 2024

Integration Name

Istio [istio]

Dataset Name

istio.istiod_metrics

Integration Version

0.6.0

Agent Version

8.14.3

Agent Output Type

logstash

Elasticsearch Version

8.15.3

OS Version and Architecture

Container

Software/API Version

Istio 1.23.1

Error Message

No response

Event Original

No response

What did you do?

I recently was looking into an issue and noticed that Logstash was reporting a high number of document conflicts with Istio.

What did you see?

Istio Metrics pipeline incorrectly overrides the istio.istiod.labels.job value and causes a high number of document conflicts.

Here are 2 events that were considered "duplicates", but in reality, no events exist in Elastic that would have matched this if the job label wasn't overwritten from its original value.

[2024-10-24T20:55:10,747][WARN ][logstash.outputs.elasticsearch][elastic-agent][elastic_agent_elasticsearch_output] Failed action {:status=>409, :action=>["create", {:_id=>nil, :_index=>"metrics-istio.istiod_metrics-private.default.production", :routing=>nil}, {"metricset"=>{"name"=>"collector", "period"=>10000}, "elastic_agent"=>{"version"=>"8.14.3", "id"=>"eb215fef-a3be-4c7e-99dd-4ef14fa09422", "snapshot"=>false}, "prometheus"=>{"pilot_k8s_reg_events"=>{"rate"=>0, "counter"=>122}, "labels"=>{"type"=>"EndpointSlice", "event"=>"add", "instance"=>"istiod.istio-system:15014", "job"=>"prometheus"}}, "@timestamp"=>2024-10-24T20:55:09.762Z, "service"=>{"type"=>"prometheus", "address"=>"http://istiod.istio-system:15014/metrics"}, "agent"=>{"type"=>"metricbeat", "version"=>"8.14.3", "id"=>"eb215fef-a3be-4c7e-99dd-4ef14fa09422", "name"=>"monitoring-cwzmb", "ephemeral_id"=>"d0144499-90a2-4b62-a372-5d764ac3b1bf"}, "ecs"=>{"version"=>"8.0.0"}, "tags"=>["beats_input_raw_event"], "@version"=>"1", "event"=>{"module"=>"prometheus", "dataset"=>"istio.istiod_metrics", "duration"=>7346460}, "data_stream"=>{"type"=>"metrics", "dataset"=>"istio.istiod_metrics", "namespace"=>"private.default.production"}, "host"=>{"hostname"=>"monitoring-cwzmb", "os"=>{"type"=>"linux", "version"=>"20.04.6 LTS (Focal Fossa)", "name"=>"Ubuntu", "platform"=>"ubuntu", "family"=>"debian", "kernel"=>"5.4.0-137-generic", "codename"=>"focal"}, "architecture"=>"x86_64", "containerized"=>true, "id"=>"96912ebd3bd4409194c45e17fda36045", "name"=>"monitoring-cwzmb"}}], :response=>{"create"=>{"status"=>409, "error"=>{"type"=>"version_conflict_engine_exception", "reason"=>"[khWTx421ZP8kG17gAAABksBP0sI][LEIgNgvoLGR0pnOGLdGZAmS6WlVcdve3os_9icbU8DnL5pm1z9wY3r_si41D@2024-10-24T20:55:09.762Z]: version conflict, document already exists (current version [1])", "index_uuid"=>"_UXCMxmYQSmzU4EePR_Gkw", "shard"=>"0", "index"=>".ds-metrics-istio.istiod_metrics-private.default.production-2024.10.24-000107"}}}}
[2024-10-24T20:55:10,747][WARN ][logstash.outputs.elasticsearch][elastic-agent][elastic_agent_elasticsearch_output] Failed action {:status=>409, :action=>["create", {:_id=>nil, :_index=>"metrics-istio.istiod_metrics-private.default.production", :routing=>nil}, {"metricset"=>{"name"=>"collector", "period"=>10000}, "elastic_agent"=>{"version"=>"8.14.3", "id"=>"eb215fef-a3be-4c7e-99dd-4ef14fa09422", "snapshot"=>false}, "prometheus"=>{"pilot_k8s_reg_events"=>{"rate"=>0, "counter"=>36}, "labels"=>{"type"=>"Services", "event"=>"update", "instance"=>"istiod.istio-system:15014", "job"=>"prometheus"}}, "@timestamp"=>2024-10-24T20:55:09.762Z, "service"=>{"type"=>"prometheus", "address"=>"http://istiod.istio-system:15014/metrics"}, "agent"=>{"type"=>"metricbeat", "version"=>"8.14.3", "id"=>"eb215fef-a3be-4c7e-99dd-4ef14fa09422", "name"=>"monitoring-cwzmb", "ephemeral_id"=>"d0144499-90a2-4b62-a372-5d764ac3b1bf"}, "ecs"=>{"version"=>"8.0.0"}, "tags"=>["beats_input_raw_event"], "@version"=>"1", "event"=>{"module"=>"prometheus", "dataset"=>"istio.istiod_metrics", "duration"=>7341131}, "data_stream"=>{"type"=>"metrics", "dataset"=>"istio.istiod_metrics", "namespace"=>"private.default.production"}, "host"=>{"hostname"=>"monitoring-cwzmb", "os"=>{"type"=>"linux", "version"=>"20.04.6 LTS (Focal Fossa)", "name"=>"Ubuntu", "kernel"=>"5.4.0-137-generic", "codename"=>"focal", "platform"=>"ubuntu", "family"=>"debian"}, "architecture"=>"x86_64", "containerized"=>true, "id"=>"96912ebd3bd4409194c45e17fda36045", "name"=>"monitoring-cwzmb"}}], :response=>{"create"=>{"status"=>409, "error"=>{"type"=>"version_conflict_engine_exception", "reason"=>"[0bDHwckAnL0f-jhCAAABksBP0sI][LEIgNgvoLGR0pnOGLdGZAmS6WlVcdve3okkBF1VJ6DEQ7exqjjeLtCZ7P4jb@2024-10-24T20:55:09.762Z]: version conflict, document already exists (current version [1])", "index_uuid"=>"_UXCMxmYQSmzU4EePR_Gkw", "shard"=>"0", "index"=>".ds-metrics-istio.istiod_metrics-private.default.production-2024.10.24-000107"}}}}

What did you expect to see?

I expect to see these documents properly ingested.

Anything else?

The issue appears that the Istio labels are used to generate a fingerprint:

- fingerprint:
fields: [ "istio.istiod.labels" ]
target_field: "istio.istiod.labels_id"
ignore_missing: true

Which is then used as a TSDS dimension:

- name: labels_id
type: keyword
dimension: true
description: Fingerprint generated by the labels.

The problem is, is that one of the key "dimension" labels is the job label, is always overwritten (before generating the fingerprint):

- set:
field: istio.istiod.labels.job
value: istio
override: true

It's not clear why this value is overwritten in the first place, but with the change to TSDS and dimensions, it now seems to cause a high number of Istiod Metrics to be dropped.

@BenB196 BenB196 changed the title [Integration Name]: Brief description of the issue [Istio]: Metrics are incorrectly dropped because of TSDS Dimension issue Oct 24, 2024
@BenB196 BenB196 changed the title [Istio]: Metrics are incorrectly dropped because of TSDS Dimension issue [Istio][Istiod Metrics]: Metrics are incorrectly dropped because of TSDS Dimension issue Oct 24, 2024
@andrewkroh andrewkroh added Integration:istio Istio Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team [elastic/obs-cloudnative-monitoring] labels Oct 24, 2024
@BenB196
Copy link
Contributor Author

BenB196 commented Oct 25, 2024

Looking at the history of:

- set:
field: istio.istiod.labels.job
value: istio
override: true

Added as part of the original PR #4253

It's not clear why this was added, I suspect that this could be removed, and this issue could be resolved.

@BenB196
Copy link
Contributor Author

BenB196 commented Oct 26, 2024

Looking at this a bit more closely, I'm not actually sure if this is a "bug" or intended.

Using a more specific example:

"_source": {
  "@timestamp": "2024-10-26T12:50:14.571Z",
  "@version": "1",
  "agent": {
    "ephemeral_id": "012b62b9-8748-4257-b148-4e82191cfdd8",
    "id": "d3c8a4ad-d4c1-41a6-bc4d-32942f79f522",
    "name": "monitoring-fq875",
    "type": "metricbeat",
    "version": "8.14.3"
  },
  "data_stream": {
    "dataset": "istio.istiod_metrics",
    "namespace": "private.default.production",
    "type": "metrics"
  },
  "ecs": {
    "version": "8.6.0"
  },
  "elastic_agent": {
    "id": "d3c8a4ad-d4c1-41a6-bc4d-32942f79f522",
    "snapshot": false,
    "version": "8.14.3"
  },
  "event": {
    "agent_id_status": "auth_metadata_missing",
    "dataset": "istio.istiod_metrics",
    "duration": 7565596,
    "ingested": "2024-10-26T12:50:25Z",
    "kind": "metric",
    "module": "istio"
  },
  "host": {
    "architecture": "x86_64",
    "containerized": true,
    "hostname": "monitoring-fq875",
    "id": "047f4adf0d834eaa883d97a880781760",
    "name": "monitoring-fq875",
    "os": {
      "codename": "focal",
      "family": "debian",
      "kernel": "5.4.0-137-generic",
      "name": "Ubuntu",
      "platform": "ubuntu",
      "type": "linux",
      "version": "20.04.6 LTS (Focal Fossa)"
    }
  },
  "istio": {
    "istiod": {
      "labels": {
        "instance": "istiod.istio-system:15014",
        "job": "prometheus",
        "version": "1.23.1"
      },
      "labels_id": "rhdvqrHt7hTr7GH5lFq2mD31JGA=",
      "metrics": {
        "pilot_xds": {
          "value": 5
        }
      }
    }
  },
  "metricset": {
    "period": 10000
  },
  "tags": "beats_input_raw_event"
}
[2024-10-26T12:50:25,557][WARN ][logstash.outputs.elasticsearch][elastic-agent][elastic_agent_elasticsearch_output] Failed action {:status=>409, :action=>["create", {:_id=>nil, :_index=>"metrics-istio.istiod_metrics-private.default.production", :routing=>nil}, {"prometheus"=>{"labels"=>{"version"=>"1.23.1", "instance"=>"istiod.istio-system:15014", "job"=>"prometheus"}, "pilot_xds"=>{"value"=>5}}, "event"=>{"module"=>"prometheus", "dataset"=>"istio.istiod_metrics", "duration"=>8044930}, "tags"=>["beats_input_raw_event"], "@timestamp"=>2024-10-26T12:50:14.571Z, "ecs"=>{"version"=>"8.0.0"}, "@version"=>"1", "agent"=>{"ephemeral_id"=>"012b62b9-8748-4257-b148-4e82191cfdd8", "version"=>"8.14.3", "id"=>"d3c8a4ad-d4c1-41a6-bc4d-32942f79f522", "name"=>"monitoring-fq875", "type"=>"metricbeat"}, "metricset"=>{"name"=>"collector", "period"=>10000}, "data_stream"=>{"namespace"=>"private.default.production", "dataset"=>"istio.istiod_metrics", "type"=>"metrics"}, "service"=>{"type"=>"prometheus", "address"=>"http://istiod.istio-system:15014/metrics"}, "host"=>{"hostname"=>"monitoring-fq875", "containerized"=>true, "architecture"=>"x86_64", "id"=>"047f4adf0d834eaa883d97a880781760", "name"=>"monitoring-fq875", "os"=>{"version"=>"20.04.6 LTS (Focal Fossa)", "name"=>"Ubuntu", "codename"=>"focal", "type"=>"linux", "platform"=>"ubuntu", "family"=>"debian", "kernel"=>"5.4.0-137-generic"}}, "elastic_agent"=>{"version"=>"8.14.3", "id"=>"d3c8a4ad-d4c1-41a6-bc4d-32942f79f522", "snapshot"=>false}}], :response=>{"create"=>{"status"=>409, "error"=>{"type"=>"version_conflict_engine_exception", "reason"=>"[yFVUdPrnE3bEh5JiAAABksjglas][LEIgNgvoLGR0pnOGLdGZAmTvTk69cY8zMZwD0_9-b9Zq6XJvE2Y_RiZOv_8u@2024-10-26T12:50:14.571Z]: version conflict, document already exists (current version [1])", "index_uuid"=>"_UXCMxmYQSmzU4EePR_Gkw", "shard"=>"0", "index"=>".ds-metrics-istio.istiod_metrics-private.default.production-2024.10.24-000107"}}}}

These 2 events are almost identical, the only difference is that the event.duration value is different:

"duration": 7565596, -> "duration"=>8044930

It'd seem really weird to add duration as a TSDS dimension, but that seems to be the only difference between these 2 events. I'm not sure if this should really be a "bug" that gets fixed, or left as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Integration:istio Istio needs:triage Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team [elastic/obs-cloudnative-monitoring]
Projects
None yet
Development

No branches or pull requests

2 participants