Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INCIDENT-23437 Always resolve hostname regardless of how dbm or disable_generic_tags are set #16201

Merged
merged 2 commits into from
Nov 20, 2023

Conversation

alexandre-normand
Copy link
Contributor

What does this PR do?

This change makes sure that the check's resolved_hostname always represents the database instance. The previous behavior of the resolved_hostname property was to use the agent hostname as the host when dbm was disabled or disable_generic_tags was enabled. This caused issues when, with agent 7.48.0, we started sending database_instance metadata. With the agent_hostname being used instance of each instance's host value, agents with multiple databases being monitored would end up creating and clobbering the same database_instance which, in turn, caused tag interference between database instances monitored by the same agent.

Very importantly, that problem surfaced the broken experience when dbm is disabled in regards to the host used on integration metrics: instead of having the postgres metrics tagged by the database instance's host, they were tagged by the agent hostname which would only be correct when the agent is running on the same host as the database.

Here is an example showing this deployment of a GCP cloud sql database instance with dbm disabled (stolen from the sister PR to the postgres integration but this applies to mysql as well):

  - dbms: postgres
    dbms_version: cloudsql-postgres-15-sandbox
    cloud: true
    instances:
      - host: '10.x.x.x'
        dbm: false
        port: 5432
        ssl: "allow"   
        autodiscovery_exclude:
          - cloudsqladmin

This, running in kubernetes, was deployed on a pod / host named postgres-f06e6baf-7cb8dcd888-tmds4.
Looking at the metrics emitted by the integration, we can see that the host tag is postgres-f06e6baf-7cb8dcd888-tmds4, the agent host, rather than 10.x.x.x`.
Screenshot 2023-11-13 at 3 53 47 PM

Note that this is going to be a breaking change but given the currently broken experience, this is something we want to do in order to improve the experience and to avoid potential future incidents resulting in the confusing implementation of resolved_hostname.

Motivation

Additional Notes

  • We considered an alternative of creating a new property (database_instance_name) that would be used for everything except the core integration metrics. It would essentially be the same implementation as the resolved_hostname property from this PR but would leave resolved_hostname as-is. We decided to go with the better approach of fixing the broken experience at the risk of causing friction for users relying on that broken behavior.

  • It's also worth noting that the experience for aws rds instances was less broken than others because of some custom inference of additional host, hostname and aws tags here. The metric points were still also tagged with the monitoring agent's hostname which was not correct but easier to deal with when filtering on aws rds instance was possible.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Changelog entries must be created for modifications to shipped code
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.

@ghost ghost added the integration/mysql label Nov 14, 2023
Copy link

github-actions bot commented Nov 14, 2023

Test Results

     16 files       16 suites   15m 0s ⏱️
   138 tests    138 ✔️ 0 💤 0
1 104 runs  1 095 ✔️ 9 💤 0

Results for commit 7862764.

♻️ This comment has been updated with latest results.

Copy link

codecov bot commented Nov 14, 2023

Codecov Report

Merging #16201 (7862764) into master (305f739) will increase coverage by 0.10%.
Report is 11 commits behind head on master.
The diff coverage is 100.00%.

Additional details and impacted files
Flag Coverage Δ
activemq ?
cassandra ?
confluent_platform ?
hive ?
hivemq ?
hudi ?
ignite ?
jboss_wildfly ?
kafka ?
mysql 87.39% <100.00%> (-0.04%) ⬇️
presto ?
solr ?
tomcat ?
weblogic ?

Flags with carried forward coverage won't be shown. Click here to find out more.

@alexandre-normand alexandre-normand marked this pull request as ready for review November 14, 2023 03:28
@alexandre-normand alexandre-normand requested review from a team as code owners November 14, 2023 03:28
@alexandre-normand alexandre-normand merged commit 890c145 into master Nov 20, 2023
35 checks passed
@alexandre-normand alexandre-normand deleted the alex.normand/INCIDENT-23437-mysql branch November 20, 2023 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants