-
Notifications
You must be signed in to change notification settings - Fork 386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document Antrea component metrics #726
Conversation
Thanks for your PR. The following commands are available:
These commands can only be run by members of the vmware-tanzu organization. |
/test-all |
0212cba
to
5d1fb09
Compare
by the Antrea components. | ||
|
||
Below is a list of metrics, provided by the components and by 3rd parties. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Kobi, These help strings are visible through API right? Are these not visible from Prometheus server/Grafana? In addition, some of them are standard metrics and not Antrea specific.
Overall question: Is it useful to add these in Antrea documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srikartati Hi,
You're correct: you could obtain that info via API and they're visible via Prometheus/Grafana.
However, from a user/admin perspective, there's no difference between Antrea and 3rd party metrics: they're all used for monitoring the Antrea solution. When a certain metric is required for monitoring nobody cares if it was provided by the Antrea codebase, Go, or something else. And it's much easier to look these up within a doc than browsing through listboxes in Prometheus.
Listing metrics within a document is very common, and as there were multiple questions regarding supported metrics, I think that it's a good idea to have such doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ksamoray I see a couple of issues:
- Every time a metric is added. This has to be continuously extended. This document may blow up in content and size.
- I feel most of the metrics "help" field is almost similar to what name is offering and not offering anything beyond.
I think the questions on slack came up because of missing network policy documentation, which will be added in the near future.
IMO, a statement about looking for "help" field through API/listboxes in Prometheus is probably good enough. I will leave it you and others to decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @srikartati,
Yeah that is correct. Documentation has to be maintained - and maybe organized differently as here or here but that doesn't make it less needed.
About (2): that only implies that our "help" requires more attention :)
Can you imagine, when we will hopefully have a whole lot of metrics, that a user will have to scroll a listbox through the help text of hundreds of metrics, just to find how the CPU utilization metric is called? That is not useful at all IMO.
I think that the fact that every product in the market offers such doc speaks for itself...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the links. They may also have the additional purpose of showcasing the metrics they are offering :)
Do you think putting metrics in tables is a better representation?
I haven't tried this representation myself but the example instruction seems not so difficult: https://help.github.com/en/github/writing-on-github/organizing-information-with-tables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srikartati tried a table... tables look awkward in a markdown doc due to the 80 char limit.
Maybe if we ever move this info into the website :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird that multiple lines in the description column will look awkward. Table format would have been nice!
NIT: I looked at the rich diff. Bullet points can be removed as metrics are anyway in bold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the metrics names are long, e.g antrea_controller_network_policy_sync_duration_milliseconds
which doesn't leave much room for description under 80 char limit...
I'll remove the bullets.
5d1fb09
to
635657c
Compare
Also removed a section from the doc which is redundant (ClusterRole has been removed from the yml) |
635657c
to
8c5ce3a
Compare
Thanks for your PR. The following commands are available:
These commands can only be run by members of the vmware-tanzu organization. |
/test-all |
/test-e2e |
1 similar comment
/test-e2e |
8c5ce3a
to
4a7264e
Compare
/test-all |
4a7264e
to
28327fc
Compare
Thanks for your PR. The following commands are available:
These commands can only be run by members of the vmware-tanzu organization. |
28327fc
to
83cb311
Compare
@antoninbas @jianjuns @McCodeman |
Thanks @ksamoray! Yes it would be ideal to automate generate this doc. Do you already have an idea? |
docs/prometheus-integration.md
Outdated
The value of the gauge is always set to 1. | ||
|
||
## Common Metrics Provided by Infrastructure | ||
### Apiserver Metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was chatting with @antoninbas on this PR. He has a good point it is even harder to maintain 3rd party metrics here. If we can not point to any links for these metrics, we should consider auto-generating the list.
At least we can mention here, the list can change as the 3rd party components versions change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jianjuns, this list has been mostly autogenerated :)
I can complete it with a few things and commit the generation script into the repo (not sure where this belong though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hack/
There should also be a Github CI job to make sure that the list is up-to-date.
83cb311
to
45589a8
Compare
Can one of the admins verify this patch? |
45589a8
to
0b67800
Compare
Thanks for your PR. The following commands are available:
These commands can only be run by members of the vmware-tanzu organization. |
@ksamoray could we revive this PR? We need a way to make sure that the doc is always up-to-date. |
0b67800
to
5205d86
Compare
Thanks for your PR. The following commands are available:
|
@antoninbas I've rebased the patch. I do need to add some test to validate that metrics are still valid. |
Codecov Report
@@ Coverage Diff @@
## master #726 +/- ##
==========================================
- Coverage 64.63% 64.48% -0.16%
==========================================
Files 157 159 +2
Lines 12626 12657 +31
==========================================
+ Hits 8161 8162 +1
- Misses 3620 3644 +24
- Partials 845 851 +6
Flags with carried forward coverage won't be shown. Click here to find out more.
|
5205d86
to
acba009
Compare
OK added some validation for the doc content, LMK what you think. |
acba009
to
bc67564
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding your earlier question, I personally would recommend a "self-contained" script that takes care of creating a Kind cluster (using kind-setup.sh
), deploys Antrea with metrics enabled and gets the list of metrics. I like it better than integrating the docs verification in the e2e tests and using an environment variable to access the file.
I understand that actually running Antrea in a cluster is pretty much the only way to determine the list of metrics. I hope that moving forward we won't run into issues where we have Antrea metrics which are platform specific.
docs/prometheus-integration.md
Outdated
Antrea Controller and Agents expose various metrics, some are provided by the | ||
Antrea components and others are provided by 3rd party components which are used | ||
by the Antrea components. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Antrea Controller and Agents expose various metrics, some are provided by the | |
Antrea components and others are provided by 3rd party components which are used | |
by the Antrea components. | |
Antrea Controller and Agents expose various metrics, some of which are provided by the | |
Antrea components and others which are provided by 3rd party components used | |
by the Antrea components. |
hack/make_metricsdoc.sh
Outdated
@@ -0,0 +1,86 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nits:
- please consider this shebang instead:
#!/usr/bin/env bash
- other scripts in the hack folder tend to use dashes in their name instead of underscores
hack/make_metricsdoc.sh
Outdated
pod_yaml=$(kubectl get pod -n kube-system $pod_name -o yaml) | ||
host_ip=$(awk '/[[:space:]]hostIP:/{print $2}' <<< "$pod_yaml") | ||
host_port=$(awk '/[[:space:]]hostPort:/{print $2}' <<< "$pod_yaml") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubectl supports:
kubectl get pod -n kube-system $pod_name --template={{.status.hostIP}}
which is nicer than using awk
from a readability / maintainability perspective IMO. Same for hostPort
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will use -o jsonpath. Would like to use this for the token/cert retrieval, and --template doesn't handle ca.crt very well (dot in the key).
So I'll use the same flag for all kubectl calls.
fd55e3a
to
25977de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some pretty minor comments
approach looks good to me now, I am pleased to see it only takes 3 minutes to update the doc!
ci/kind/test-metrics-doc-kind.sh
Outdated
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
# The script runs kind e2e tests with different traffic encapsulation modes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment inaccurate
ci/kind/test-metrics-doc-kind.sh
Outdated
MAKE_CMD=$(dirname $0)"/../../hack/make-metrics-doc.sh" | ||
METRICS_DOC=$(dirname $0)"/../../docs/prometheus-integration.md" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend THIS_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
instead of dirname $0
, which IIRC does not work well in some scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. $(dirname $0) is used in other places as well, maybe worths changing as well.
ci/kind/test-metrics-doc-kind.sh
Outdated
cmp -s $METRICS_DOC $METRICS_TMP_DOC | ||
result=$? | ||
if [ $result -ne 0 ]; then | ||
echo "Error: Prometheus metrics document should be updated" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message will be displayed by CI in case of failure and I think we should make it more useful for users. For example:
echo "Error: Prometheus metrics document should be updated"
echo "You can update it by building the Antrea Docker image locally (with `make`), running ./hack/make-metrics-doc.sh and committing the changes"
ci/kind/test-metrics-doc-kind.sh
Outdated
@@ -0,0 +1,45 @@ | |||
#!/usr/bin/env bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe "validate-metrics-doc.sh" for the script name?
.github/workflows/kind_upgrade.yml
Outdated
@@ -165,3 +165,34 @@ jobs: | |||
- name: Run test | |||
run: | | |||
./ci/kind/test-upgrade-antrea.sh --from-version-n-minus 2 --controller-only | |||
|
|||
test-prometheus-metrics-doc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe "validate-prometheus-metrics-doc"?
@@ -116,3 +116,165 @@ The configuration file above can be used to deploy Prometheus Server with | |||
scraping configuration for Antrea services. | |||
To deploy this configuration use | |||
`kubectl apply -f build/yamls/antrea-prometheus.yml` | |||
|
|||
# Antrea Prometheus Metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this is the ideal place, but the doc should mention how devs who add metrics can re-generate the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc is very user-oriented, while doc generation is targeted to developers.
As we have a bunch of other generatables, maybe it's worthwhile to create a dev-oriented code enlisting all of these (e.g yamls via make manifest, testing mocks etc). I can create an issue for this and then handle it separately.
BTW do we need a make target for the metrics generation as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have a doc that may work already: https://github.com/vmware-tanzu/antrea/blob/master/docs/code-generation.md, although the title may need to be updated
I don't think a make target is needed right now, but we can add one in the future
b8cbc85
to
76bd3b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ci/kind/validate-metrics-doc.sh
Outdated
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
# The script validated the Prometheus metrics list within docs/prometheus-integration.md againt a Kind cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/validated/validates
@@ -116,3 +116,165 @@ The configuration file above can be used to deploy Prometheus Server with | |||
scraping configuration for Antrea services. | |||
To deploy this configuration use | |||
`kubectl apply -f build/yamls/antrea-prometheus.yml` | |||
|
|||
# Antrea Prometheus Metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have a doc that may work already: https://github.com/vmware-tanzu/antrea/blob/master/docs/code-generation.md, although the title may need to be updated
I don't think a make target is needed right now, but we can add one in the future
4d68225
to
5da161c
Compare
Add the list of metrics which are exposed by Antrea agent and controller in the integration doc for reference.
5da161c
to
da52dd8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/skip-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MMH GmbH
Add the list of metrics which are exposed by Antrea agent and controller
in the integration doc for reference.
The metric list and descriptions added were pulled out of the metrics help strings from the code.