Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add xDS-related metrics #4634

Merged
merged 2 commits into from
Oct 10, 2019
Merged

Conversation

csssuf
Copy link
Contributor

@csssuf csssuf commented Sep 25, 2019

What does this PR do?

The xDS/config update-related metrics in the Envoy integration are
currently partially out-of-sync with what Envoy reports. Shore up these
metrics to be accurate.

Motivation

Monitoring communication between Envoys and their control plane.

Review checklist (to be filled by reviewers)

  • PR title must be written as a CHANGELOG entry (see why)
  • Files changes must correspond to the primary purpose of the PR as described in the title (small unrelated changes should have their own PR)
  • PR must have changelog/ and integration/ labels attached
  • Feature or bugfix must have tests
  • Git history must be clean
  • If PR adds a configuration option, it must be added to the configuration file.

The xDS/config update-related metrics in the Envoy integration are
currently partially out-of-sync with what Envoy reports. Shore up these
metrics to be accurate.
(),
),
'method': 'monotonic_count',
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR. Looks good to me overall.

Could you manage to test all/some of those new metrics in e2e ? By adding metrics here

'envoy.cluster.bind_errors',
'envoy.cluster.lb_healthy_panic',
'envoy.cluster.lb_local_cluster_not_ok',
'envoy.cluster.lb_recalculate_zone_structures',
'envoy.cluster.lb_subsets_active',
'envoy.cluster.lb_subsets_created',
'envoy.cluster.lb_subsets_fallback',
'envoy.cluster.lb_subsets_removed',
'envoy.cluster.lb_subsets_selected',
'envoy.cluster.lb_zone_cluster_too_small',
'envoy.cluster.lb_zone_no_capacity_left',
'envoy.cluster.lb_zone_number_differs',
'envoy.cluster.lb_zone_routing_all_directly',
'envoy.cluster.lb_zone_routing_cross_zone',
'envoy.cluster.lb_zone_routing_sampled',
'envoy.cluster.max_host_weight',
'envoy.cluster.membership_change',
'envoy.cluster.membership_healthy',
'envoy.cluster.membership_total',
'envoy.cluster.retry_or_shadow_abandoned',
'envoy.cluster.update_attempt',
'envoy.cluster.update_empty',
'envoy.cluster.update_failure',
'envoy.cluster.update_success',
'envoy.cluster.upstream_cx_active',
'envoy.cluster.upstream_cx_close_notify',
'envoy.cluster.upstream_cx_connect_attempts_exceeded',
'envoy.cluster.upstream_cx_connect_fail',
'envoy.cluster.upstream_cx_connect_timeout',
'envoy.cluster.upstream_cx_destroy',
'envoy.cluster.upstream_cx_destroy_local',
'envoy.cluster.upstream_cx_destroy_local_with_active_rq',
'envoy.cluster.upstream_cx_destroy_remote',
'envoy.cluster.upstream_cx_destroy_remote_with_active_rq',
'envoy.cluster.upstream_cx_destroy_with_active_rq',
'envoy.cluster.upstream_cx_http1_total',
'envoy.cluster.upstream_cx_http2_total',
'envoy.cluster.upstream_cx_max_requests',
'envoy.cluster.upstream_cx_none_healthy',
'envoy.cluster.upstream_cx_overflow',
'envoy.cluster.upstream_cx_protocol_error',
'envoy.cluster.upstream_cx_rx_bytes_buffered',
'envoy.cluster.upstream_cx_rx_bytes_total',
'envoy.cluster.upstream_cx_total',
'envoy.cluster.upstream_cx_tx_bytes_buffered',
'envoy.cluster.upstream_cx_tx_bytes_total',
'envoy.cluster.upstream_flow_control_backed_up_total',
'envoy.cluster.upstream_flow_control_drained_total',
'envoy.cluster.upstream_flow_control_paused_reading_total',
'envoy.cluster.upstream_flow_control_resumed_reading_total',
'envoy.cluster.upstream_rq_active',
'envoy.cluster.upstream_rq_cancelled',
'envoy.cluster.upstream_rq_completed',
'envoy.cluster.upstream_rq_maintenance_mode',
'envoy.cluster.upstream_rq_pending_active',
'envoy.cluster.upstream_rq_pending_failure_eject',
'envoy.cluster.upstream_rq_pending_overflow',
'envoy.cluster.upstream_rq_pending_total',
'envoy.cluster.upstream_rq_per_try_timeout',
'envoy.cluster.upstream_rq_retry',
'envoy.cluster.upstream_rq_retry_overflow',
'envoy.cluster.upstream_rq_retry_success',
'envoy.cluster.upstream_rq_rx_reset',
'envoy.cluster.upstream_rq_timeout',
'envoy.cluster.upstream_rq_total',
'envoy.cluster.upstream_rq_tx_reset',
'envoy.cluster.version',
'envoy.cluster_manager.active_clusters',
'envoy.cluster_manager.cluster_added',
'envoy.cluster_manager.cluster_modified',
'envoy.cluster_manager.cluster_removed',
'envoy.cluster_manager.warming_clusters',
'envoy.http.downstream_cx_active',
'envoy.http.downstream_cx_destroy',
'envoy.http.downstream_cx_destroy_active_rq',
'envoy.http.downstream_cx_destroy_local',
'envoy.http.downstream_cx_destroy_local_active_rq',
'envoy.http.downstream_cx_destroy_remote',
'envoy.http.downstream_cx_destroy_remote_active_rq',
'envoy.http.downstream_cx_drain_close',
'envoy.http.downstream_cx_http1_active',
'envoy.http.downstream_cx_http1_total',
'envoy.http.downstream_cx_http2_active',
'envoy.http.downstream_cx_http2_total',
'envoy.http.downstream_cx_idle_timeout',
'envoy.http.downstream_cx_protocol_error',
'envoy.http.downstream_cx_rx_bytes_buffered',
'envoy.http.downstream_cx_rx_bytes_total',
'envoy.http.downstream_cx_ssl_active',
'envoy.http.downstream_cx_ssl_total',
'envoy.http.downstream_cx_total',
'envoy.http.downstream_cx_tx_bytes_buffered',
'envoy.http.downstream_cx_tx_bytes_total',
'envoy.http.downstream_flow_control_paused_reading_total',
'envoy.http.downstream_flow_control_resumed_reading_total',
'envoy.http.downstream_rq_1xx',
'envoy.http.downstream_rq_2xx',
'envoy.http.downstream_rq_3xx',
'envoy.http.downstream_rq_4xx',
'envoy.http.downstream_rq_5xx',
'envoy.http.downstream_rq_active',
'envoy.http.downstream_rq_http1_total',
'envoy.http.downstream_rq_http2_total',
'envoy.http.downstream_rq_non_relative_path',
'envoy.http.downstream_rq_response_before_rq_complete',
'envoy.http.downstream_rq_rx_reset',
'envoy.http.downstream_rq_too_large',
'envoy.http.downstream_rq_total',
'envoy.http.downstream_rq_tx_reset',
'envoy.http.downstream_rq_ws_on_non_ws_route',
'envoy.http.no_cluster',
'envoy.http.no_route',
'envoy.http.rq_direct_response',
'envoy.http.rq_redirect',
'envoy.http.rq_total',
'envoy.http.rs_too_large',
'envoy.http.tracing.client_enabled',
'envoy.http.tracing.health_check',
'envoy.http.tracing.not_traceable',
'envoy.http.tracing.random_sampling',
'envoy.http.tracing.service_forced',
'envoy.listener.downstream_cx_active',
'envoy.listener.downstream_cx_destroy',
'envoy.listener.downstream_cx_total',
'envoy.listener.http.downstream_rq_1xx',
'envoy.listener.http.downstream_rq_2xx',
'envoy.listener.http.downstream_rq_3xx',
'envoy.listener.http.downstream_rq_4xx',
'envoy.listener.http.downstream_rq_5xx',
'envoy.listener_manager.listener_added',
'envoy.listener_manager.listener_create_failure',
'envoy.listener_manager.listener_create_success',
'envoy.listener_manager.listener_modified',
'envoy.listener_manager.listener_removed',
'envoy.listener_manager.total_listeners_active',
'envoy.listener_manager.total_listeners_draining',
'envoy.listener_manager.total_listeners_warming',
'envoy.runtime.load_error',
'envoy.runtime.load_success',
'envoy.runtime.num_keys',
'envoy.runtime.override_dir_exists',
'envoy.runtime.override_dir_not_exists',
'envoy.server.days_until_first_cert_expiring',
'envoy.server.live',
'envoy.server.memory_allocated',
'envoy.server.memory_heap_size',
'envoy.server.parent_connections',
'envoy.server.total_connections',
'envoy.server.uptime',
'envoy.server.version',

That would probably need some changes in setup files here: https://github.com/DataDog/integrations-core/tree/81792d0e48f8083fb288a411662ba4f1a39ba894/envoy/tests/docker/default

Copy link
Contributor Author

@csssuf csssuf Oct 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed another commit which adds this, but I'm seeing test failures with the newly-added metrics despite verifying that those metrics are properly parsed in the unit tests from my last commit, and verifying that the Envoy instance is now reporting those metrics with the added controlplane implementation. Do you know what might be causing that?

EDIT: Huh, looks like they passed just fine in CI. Guess it was a quirk on my machine 😄

@csssuf csssuf force-pushed the envoy-add-control_plane-metrics branch from 3b62c50 to dc92e48 Compare October 4, 2019 20:30
Copy link
Contributor

@ofek ofek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csssuf You did an absolutely excellent job here, thanks!!!

@ofek ofek changed the title Add xDS-related metrics to Envoy integration Add xDS-related metrics Oct 10, 2019
@ofek ofek merged commit 64266f9 into DataDog:master Oct 10, 2019
@csssuf csssuf deleted the envoy-add-control_plane-metrics branch October 11, 2019 01:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants