Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document Antrea component metrics #726

Merged
merged 1 commit into from
Oct 7, 2020

Conversation

ksamoray
Copy link
Contributor

Add the list of metrics which are exposed by Antrea agent and controller
in the integration doc for reference.
The metric list and descriptions added were pulled out of the metrics help strings from the code.

@antrea-bot
Copy link
Collaborator

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

  • /test-e2e: to trigger e2e tests.
  • /skip-e2e: to skip e2e tests.
  • /test-conformance: to trigger conformance tests.
  • /skip-conformance: to skip conformance tests.
  • /test-networkpolicy: to trigger networkpolicy tests.
  • /skip-networkpolicy: to skip networkpolicy tests.
  • /test-all: to trigger all tests.
  • /skip-all: to skip all tests.

These commands can only be run by members of the vmware-tanzu organization.

@ksamoray ksamoray linked an issue May 21, 2020 that may be closed by this pull request
@ksamoray
Copy link
Contributor Author

/test-all

@ksamoray ksamoray force-pushed the prometheus-metric-doc branch from 0212cba to 5d1fb09 Compare May 22, 2020 17:45
by the Antrea components.

Below is a list of metrics, provided by the components and by 3rd parties.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Kobi, These help strings are visible through API right? Are these not visible from Prometheus server/Grafana? In addition, some of them are standard metrics and not Antrea specific.
Overall question: Is it useful to add these in Antrea documentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srikartati Hi,
You're correct: you could obtain that info via API and they're visible via Prometheus/Grafana.
However, from a user/admin perspective, there's no difference between Antrea and 3rd party metrics: they're all used for monitoring the Antrea solution. When a certain metric is required for monitoring nobody cares if it was provided by the Antrea codebase, Go, or something else. And it's much easier to look these up within a doc than browsing through listboxes in Prometheus.
Listing metrics within a document is very common, and as there were multiple questions regarding supported metrics, I think that it's a good idea to have such doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ksamoray I see a couple of issues:

  1. Every time a metric is added. This has to be continuously extended. This document may blow up in content and size.
  2. I feel most of the metrics "help" field is almost similar to what name is offering and not offering anything beyond.

I think the questions on slack came up because of missing network policy documentation, which will be added in the near future.
IMO, a statement about looking for "help" field through API/listboxes in Prometheus is probably good enough. I will leave it you and others to decide.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @srikartati,
Yeah that is correct. Documentation has to be maintained - and maybe organized differently as here or here but that doesn't make it less needed.
About (2): that only implies that our "help" requires more attention :)
Can you imagine, when we will hopefully have a whole lot of metrics, that a user will have to scroll a listbox through the help text of hundreds of metrics, just to find how the CPU utilization metric is called? That is not useful at all IMO.
I think that the fact that every product in the market offers such doc speaks for itself...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the links. They may also have the additional purpose of showcasing the metrics they are offering :)
Do you think putting metrics in tables is a better representation?
I haven't tried this representation myself but the example instruction seems not so difficult: https://help.github.com/en/github/writing-on-github/organizing-information-with-tables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srikartati tried a table... tables look awkward in a markdown doc due to the 80 char limit.
Maybe if we ever move this info into the website :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird that multiple lines in the description column will look awkward. Table format would have been nice!
NIT: I looked at the rich diff. Bullet points can be removed as metrics are anyway in bold.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the metrics names are long, e.g antrea_controller_network_policy_sync_duration_milliseconds
which doesn't leave much room for description under 80 char limit...
I'll remove the bullets.

@ksamoray ksamoray force-pushed the prometheus-metric-doc branch from 5d1fb09 to 635657c Compare May 24, 2020 11:50
@ksamoray
Copy link
Contributor Author

Also removed a section from the doc which is redundant (ClusterRole has been removed from the yml)

@ksamoray ksamoray force-pushed the prometheus-metric-doc branch from 635657c to 8c5ce3a Compare May 28, 2020 11:47
@antrea-bot
Copy link
Collaborator

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

  • /test-e2e: to trigger e2e tests.
  • /skip-e2e: to skip e2e tests.
  • /test-conformance: to trigger conformance tests.
  • /skip-conformance: to skip conformance tests.
  • /test-networkpolicy: to trigger networkpolicy tests.
  • /skip-networkpolicy: to skip networkpolicy tests.
  • /test-windows-conformance: to trigger windows conformance tests.
  • /skip-windows-conformance: to skip windows conformance tests.
  • /test-all: to trigger all tests.
  • /skip-all: to skip all tests.

These commands can only be run by members of the vmware-tanzu organization.

srikartati
srikartati previously approved these changes May 28, 2020
@ksamoray
Copy link
Contributor Author

/test-all

@ksamoray
Copy link
Contributor Author

/test-e2e

1 similar comment
@ksamoray
Copy link
Contributor Author

/test-e2e

@ksamoray
Copy link
Contributor Author

ksamoray commented Jun 1, 2020

/test-all

@ksamoray ksamoray force-pushed the prometheus-metric-doc branch from 4a7264e to 28327fc Compare July 2, 2020 11:24
@antrea-bot
Copy link
Collaborator

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

  • /test-e2e: to trigger e2e tests.
  • /skip-e2e: to skip e2e tests.
  • /test-conformance: to trigger conformance tests.
  • /skip-conformance: to skip conformance tests.
  • /test-whole-conformance: to trigger all conformance tests on linux.
  • /skip-whole-conformance: to skip all conformance tests on linux.
  • /test-networkpolicy: to trigger networkpolicy tests.
  • /skip-networkpolicy: to skip networkpolicy tests.
  • /test-windows-conformance: to trigger windows conformance tests.
  • /skip-windows-conformance: to skip windows conformance tests.
  • /test-all: to trigger all tests (except whole conformance).
  • /skip-all: to skip all tests (except whole conformance).

These commands can only be run by members of the vmware-tanzu organization.

@ksamoray ksamoray force-pushed the prometheus-metric-doc branch from 28327fc to 83cb311 Compare July 2, 2020 11:29
@ksamoray
Copy link
Contributor Author

ksamoray commented Jul 2, 2020

@antoninbas @jianjuns @McCodeman
I have updated to the current metrics which are exposed by the agent and controller.
Is it worthwhile to automate the creation of this doc section so it'll be easier to refresh it whenever needed?

jianjuns
jianjuns previously approved these changes Jul 2, 2020
@jianjuns
Copy link
Contributor

jianjuns commented Jul 2, 2020

@antoninbas @jianjuns @McCodeman
I have updated to the current metrics which are exposed by the agent and controller.
Is it worthwhile to automate the creation of this doc section so it'll be easier to refresh it whenever needed?

Thanks @ksamoray! Yes it would be ideal to automate generate this doc. Do you already have an idea?

The value of the gauge is always set to 1.

## Common Metrics Provided by Infrastructure
### Apiserver Metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was chatting with @antoninbas on this PR. He has a good point it is even harder to maintain 3rd party metrics here. If we can not point to any links for these metrics, we should consider auto-generating the list.
At least we can mention here, the list can change as the 3rd party components versions change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jianjuns, this list has been mostly autogenerated :)
I can complete it with a few things and commit the generation script into the repo (not sure where this belong though)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hack/

There should also be a Github CI job to make sure that the list is up-to-date.

@antrea-bot
Copy link
Collaborator

Can one of the admins verify this patch?

@ksamoray ksamoray force-pushed the prometheus-metric-doc branch from 45589a8 to 0b67800 Compare August 4, 2020 10:33
@antrea-bot
Copy link
Collaborator

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

  • /test-e2e: to trigger e2e tests.
  • /skip-e2e: to skip e2e tests.
  • /test-conformance: to trigger conformance tests.
  • /skip-conformance: to skip conformance tests.
  • /test-whole-conformance: to trigger all conformance tests on linux.
  • /skip-whole-conformance: to skip all conformance tests on linux.
  • /test-networkpolicy: to trigger networkpolicy tests.
  • /skip-networkpolicy: to skip networkpolicy tests.
  • /test-windows-conformance: to trigger windows conformance tests.
  • /skip-windows-conformance: to skip windows conformance tests.
  • /test-windows-networkpolicy: to trigger windows networkpolicy tests.
  • /skip-windows-networkpolicy: to skip windows networkpolicy tests.
  • /test-hw-offload: to trigger ovs hardware offload test.
  • /skip-hw-offload: to skip ovs hardware offload test.
  • /test-all: to trigger all tests (except whole conformance).
  • /skip-all: to skip all tests (except whole conformance).

These commands can only be run by members of the vmware-tanzu organization.

@antoninbas
Copy link
Contributor

@ksamoray could we revive this PR? We need a way to make sure that the doc is always up-to-date.

@ksamoray ksamoray force-pushed the prometheus-metric-doc branch from 0b67800 to 5205d86 Compare September 26, 2020 15:34
@antrea-bot
Copy link
Collaborator

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

  • /test-e2e: to trigger e2e tests.
  • /skip-e2e: to skip e2e tests.
  • /test-conformance: to trigger conformance tests.
  • /skip-conformance: to skip conformance tests.
  • /test-whole-conformance: to trigger all conformance tests on linux.
  • /skip-whole-conformance: to skip all conformance tests on linux.
  • /test-networkpolicy: to trigger networkpolicy tests.
  • /skip-networkpolicy: to skip networkpolicy tests.
  • /test-windows-conformance: to trigger windows conformance tests.
  • /skip-windows-conformance: to skip windows conformance tests.
  • /test-windows-networkpolicy: to trigger windows networkpolicy tests.
  • /skip-windows-networkpolicy: to skip windows networkpolicy tests.
  • /test-hw-offload: to trigger ovs hardware offload test.
  • /skip-hw-offload: to skip ovs hardware offload test.
  • /test-all: to trigger all tests (except whole conformance).
  • /skip-all: to skip all tests (except whole conformance).

@ksamoray
Copy link
Contributor Author

@ksamoray could we revive this PR? We need a way to make sure that the doc is always up-to-date.

@antoninbas I've rebased the patch. I do need to add some test to validate that metrics are still valid.
I'm not sure how I could implement this as the test will need access to the document: in the make script I worked around this by adding it as a parameter, and I'm not sure that I like it.
Maybe you have a suggestion?

@codecov-commenter
Copy link

codecov-commenter commented Sep 26, 2020

Codecov Report

Merging #726 into master will decrease coverage by 0.15%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #726      +/-   ##
==========================================
- Coverage   64.63%   64.48%   -0.16%     
==========================================
  Files         157      159       +2     
  Lines       12626    12657      +31     
==========================================
+ Hits         8161     8162       +1     
- Misses       3620     3644      +24     
- Partials      845      851       +6     
Flag Coverage Δ
#integration-tests 44.83% <ø> (ø)
#kind-e2e-tests 50.55% <ø> (-0.26%) ⬇️
#unit-tests 42.14% <ø> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/controller/networkpolicy/tier.go 90.00% <0.00%> (-10.00%) ⬇️
pkg/apiserver/certificate/certificate.go 71.05% <0.00%> (-6.58%) ⬇️
pkg/monitor/controller.go 48.52% <0.00%> (-2.95%) ⬇️
pkg/antctl/command_list.go 31.48% <0.00%> (-1.22%) ⬇️
pkg/antctl/command_definition.go 41.96% <0.00%> (-0.44%) ⬇️
pkg/antctl/antctl.go 100.00% <0.00%> (ø)
pkg/apiserver/handlers/loglevel/handler.go 38.46% <0.00%> (ø)
pkg/log/log_level.go 10.00% <0.00%> (ø)
pkg/apiserver/apiserver.go 84.33% <0.00%> (+0.19%) ⬆️
pkg/agent/apiserver/apiserver.go 66.12% <0.00%> (+0.55%) ⬆️
... and 1 more

@ksamoray ksamoray force-pushed the prometheus-metric-doc branch from 5205d86 to acba009 Compare September 27, 2020 12:55
@ksamoray
Copy link
Contributor Author

OK added some validation for the doc content, LMK what you think.

@ksamoray ksamoray force-pushed the prometheus-metric-doc branch from acba009 to bc67564 Compare September 27, 2020 13:19
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding your earlier question, I personally would recommend a "self-contained" script that takes care of creating a Kind cluster (using kind-setup.sh), deploys Antrea with metrics enabled and gets the list of metrics. I like it better than integrating the docs verification in the e2e tests and using an environment variable to access the file.

I understand that actually running Antrea in a cluster is pretty much the only way to determine the list of metrics. I hope that moving forward we won't run into issues where we have Antrea metrics which are platform specific.

Comment on lines 121 to 123
Antrea Controller and Agents expose various metrics, some are provided by the
Antrea components and others are provided by 3rd party components which are used
by the Antrea components.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Antrea Controller and Agents expose various metrics, some are provided by the
Antrea components and others are provided by 3rd party components which are used
by the Antrea components.
Antrea Controller and Agents expose various metrics, some of which are provided by the
Antrea components and others which are provided by 3rd party components used
by the Antrea components.

@@ -0,0 +1,86 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits:

  • please consider this shebang instead: #!/usr/bin/env bash
  • other scripts in the hack folder tend to use dashes in their name instead of underscores

Comment on lines 5 to 7
pod_yaml=$(kubectl get pod -n kube-system $pod_name -o yaml)
host_ip=$(awk '/[[:space:]]hostIP:/{print $2}' <<< "$pod_yaml")
host_port=$(awk '/[[:space:]]hostPort:/{print $2}' <<< "$pod_yaml")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubectl supports:

kubectl get pod -n kube-system $pod_name --template={{.status.hostIP}}

which is nicer than using awk from a readability / maintainability perspective IMO. Same for hostPort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will use -o jsonpath. Would like to use this for the token/cert retrieval, and --template doesn't handle ca.crt very well (dot in the key).
So I'll use the same flag for all kubectl calls.

@ksamoray ksamoray force-pushed the prometheus-metric-doc branch 5 times, most recently from fd55e3a to 25977de Compare October 1, 2020 16:34
@ksamoray ksamoray requested a review from antoninbas October 1, 2020 17:16
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some pretty minor comments
approach looks good to me now, I am pleased to see it only takes 3 minutes to update the doc!

# See the License for the specific language governing permissions and
# limitations under the License.

# The script runs kind e2e tests with different traffic encapsulation modes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment inaccurate

Comment on lines 32 to 33
MAKE_CMD=$(dirname $0)"/../../hack/make-metrics-doc.sh"
METRICS_DOC=$(dirname $0)"/../../docs/prometheus-integration.md"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend THIS_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )" instead of dirname $0, which IIRC does not work well in some scenarios.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. $(dirname $0) is used in other places as well, maybe worths changing as well.

cmp -s $METRICS_DOC $METRICS_TMP_DOC
result=$?
if [ $result -ne 0 ]; then
echo "Error: Prometheus metrics document should be updated"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message will be displayed by CI in case of failure and I think we should make it more useful for users. For example:

    echo "Error: Prometheus metrics document should be updated"
    echo "You can update it by building the Antrea Docker image locally (with `make`), running ./hack/make-metrics-doc.sh and committing the changes"

@@ -0,0 +1,45 @@
#!/usr/bin/env bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "validate-metrics-doc.sh" for the script name?

@@ -165,3 +165,34 @@ jobs:
- name: Run test
run: |
./ci/kind/test-upgrade-antrea.sh --from-version-n-minus 2 --controller-only

test-prometheus-metrics-doc:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "validate-prometheus-metrics-doc"?

hack/make-metrics-doc.sh Show resolved Hide resolved
@@ -116,3 +116,165 @@ The configuration file above can be used to deploy Prometheus Server with
scraping configuration for Antrea services.
To deploy this configuration use
`kubectl apply -f build/yamls/antrea-prometheus.yml`

# Antrea Prometheus Metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is the ideal place, but the doc should mention how devs who add metrics can re-generate the list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc is very user-oriented, while doc generation is targeted to developers.
As we have a bunch of other generatables, maybe it's worthwhile to create a dev-oriented code enlisting all of these (e.g yamls via make manifest, testing mocks etc). I can create an issue for this and then handle it separately.
BTW do we need a make target for the metrics generation as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have a doc that may work already: https://github.com/vmware-tanzu/antrea/blob/master/docs/code-generation.md, although the title may need to be updated

I don't think a make target is needed right now, but we can add one in the future

.github/workflows/kind_upgrade.yml Outdated Show resolved Hide resolved
docs/prometheus-integration.md Show resolved Hide resolved
@ksamoray ksamoray force-pushed the prometheus-metric-doc branch 3 times, most recently from b8cbc85 to 76bd3b9 Compare October 4, 2020 07:34
antoninbas
antoninbas previously approved these changes Oct 5, 2020
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

# See the License for the specific language governing permissions and
# limitations under the License.

# The script validated the Prometheus metrics list within docs/prometheus-integration.md againt a Kind cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/validated/validates

@@ -116,3 +116,165 @@ The configuration file above can be used to deploy Prometheus Server with
scraping configuration for Antrea services.
To deploy this configuration use
`kubectl apply -f build/yamls/antrea-prometheus.yml`

# Antrea Prometheus Metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have a doc that may work already: https://github.com/vmware-tanzu/antrea/blob/master/docs/code-generation.md, although the title may need to be updated

I don't think a make target is needed right now, but we can add one in the future

@ksamoray ksamoray force-pushed the prometheus-metric-doc branch 2 times, most recently from 4d68225 to 5da161c Compare October 6, 2020 10:18
Add the list of metrics which are exposed by Antrea agent and controller
in the integration doc for reference.
@ksamoray ksamoray force-pushed the prometheus-metric-doc branch from 5da161c to da52dd8 Compare October 6, 2020 13:25
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antoninbas
Copy link
Contributor

/skip-all

@ksamoray ksamoray merged commit fe6e7f6 into antrea-io:master Oct 7, 2020
@ksamoray ksamoray deleted the prometheus-metric-doc branch November 16, 2020 14:42
Copy link

@ckfwer ckfwer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MMH GmbH

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Document Antrea Prometheus metrics
8 participants