Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"wake up" internal prometheus scrapper metrics (up / scrape_xxxx) #3116

Merged
merged 22 commits into from
Jun 21, 2021

Conversation

gillg
Copy link
Contributor

@gillg gillg commented May 6, 2021

Description:
This duplicates the target of #2918 but in a completly different approach.
It solves #3089 at least but also other related bugs about prometheus autogenerated metrics by its scrapper : https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series

Link to tracking Issue:
#3089

Resolves :
open-telemetry/prometheus-interoperability-spec#8
Maybe open-telemetry/prometheus-interoperability-spec#41 ?

Testing:
No new tests because we stay standard with other metrics.
I just "wake up" dormant metrics without metadata.

Documentation:
Nothing more, new metrics up and scrape_xxxx will be available internaly, and so at the exporter side.

@gillg gillg requested a review from a team May 6, 2021 07:26
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented May 6, 2021

CLA Signed

The committers are authorized under a signed CLA.

@gillg
Copy link
Contributor Author

gillg commented May 6, 2021

@odeke-em I was no able to contribute on your PR, and because the approach is very different I prefered create a new one to discuss with maintainers about it.

@gillg
Copy link
Contributor Author

gillg commented May 6, 2021

Metrics at exporter side after my second commit :

# HELP scrape_duration_seconds Duration of the scrape
# TYPE scrape_duration_seconds gauge
scrape_duration_seconds{otel_job="grafana"} 0.022050369 1620294333695
scrape_duration_seconds{otel_job="otel-collector"} 0.001391824 1620294335369
scrape_duration_seconds{otel_job="thanos-compactor"} 0.000312828 1620294327957
# HELP scrape_samples_post_metric_relabeling The number of samples remaining after metric relabeling was applied
# TYPE scrape_samples_post_metric_relabeling gauge
scrape_samples_post_metric_relabeling{otel_job="grafana"} 414 1620294333695
scrape_samples_post_metric_relabeling{otel_job="otel-collector"} 40 1620294335369
scrape_samples_post_metric_relabeling{otel_job="thanos-compactor"} 0 1620294327957
# HELP scrape_samples_scraped The number of samples the target exposed
# TYPE scrape_samples_scraped gauge
scrape_samples_scraped{otel_job="grafana"} 414 1620294333695
scrape_samples_scraped{otel_job="otel-collector"} 40 1620294335369
scrape_samples_scraped{otel_job="thanos-compactor"} 0 1620294327957
# HELP scrape_series_added The approximate number of new series in this scrape
# TYPE scrape_series_added gauge
scrape_series_added{otel_job="grafana"} 414 1620294333695
scrape_series_added{otel_job="otel-collector"} 40 1620294335369
scrape_series_added{otel_job="thanos-compactor"} 0 1620294327957
# HELP up The scraping was sucessful
# TYPE up gauge
up{otel_job="grafana"} 1 1620294333695
up{otel_job="otel-collector"} 1 1620294335369
up{otel_job="thanos-compactor"} 0 1620294327957

used scrape config :

      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 5s
          static_configs:
            - targets: ['otel-collector:8888']
          relabel_configs:
          # Trick because otel collector not expose the job and to avoid "honor_labels" at prometheus side
          - action: replace
            replacement: otel-collector
            target_label: otel_job
        - job_name: thanos-compactor
          static_configs:
            - targets: ['172.17.0.1:10942']
          relabel_configs:
          # Trick because otel collector not expose the job and to avoid "honor_labels" at prometheus side
          - action: replace
            replacement: thanos-compactor
            target_label: otel_job
        - job_name: grafana
          static_configs:
            - targets: ['172.17.0.1:3000']
          relabel_configs:
          # Trick because otel collector not expose the job and to avoid "honor_labels" at prometheus side
          - action: replace
            replacement: grafana
            target_label: otel_job

Copy link
Contributor

@dashpole dashpole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the approach I think we should take. Just clean up some of the extra debug code you added.

@gillg
Copy link
Contributor Author

gillg commented May 6, 2021

Ok, in fact there is no "useless" code. But I would uncomment my comments to enable a logger at these points.
These logs could be very useful to understand what happen. Maybe "Debug" could become "Trace".

I take any help to instanciate a logger

@gillg
Copy link
Contributor Author

gillg commented May 6, 2021

Fixing tests become a nightmare... I need help on receiver/prometheusreceiver/metrics_receiver_test.go ! 🆘 🙏 😅

@rakyll
Copy link
Contributor

rakyll commented May 6, 2021

There are too many tests to fix there, @gillg. Been there, done that. Let's help you if this is the way to go.

Copy link
Contributor Author

@gillg gillg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are too many tests to fix there, @gillg. Been there, done that. Let's help you if this is the way to go.

Thanks @rakyll !
It definitely works perfectly for now, I fixed a Prometheus exporter test, and some other internals, but I'm a little bit confused on the current logic tests. It's complicated because new introduced metrics should be not visible in a metrics fake documents, but they count in internal metrics. So I have the feeling that we should change a little the method doCompare but I'm not sure.

I also need a little help to instance a logger and uncomment my log lines.

receiver/prometheusreceiver/internal/metricsbuilder.go Outdated Show resolved Hide resolved
@@ -133,6 +132,42 @@ func (b *metricBuilder) AddDataPoint(ls labels.Labels, t int64, v float64) error
return b.currentMf.Add(metricName, ls, t, v)
}

func (b *metricBuilder) defineInternalMetric(metricName string) {
metadata, ok := b.mc.Metadata(metricName)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact Internal metrics have empty metadata, but they are recorded correctly.
A simple solution is to provide manual metadata. Even if we uncomment return nil above, the will be dropped later because they match with "unspecified" type here : https://github.com/open-telemetry/opentelemetry-collector/blob/main/receiver/prometheusreceiver/internal/metricsbuilder.go#L244

I made this approach working perfectly when I rewrited metadata into newMetricFamiliy constructor, but I try here to fix all internal metrics before send them to metric family.
Here... Probably due to a change without reference to the metadata object, metadata is always empty when it comes to metricfamily.

} else if !ok && isInternalMetric(metricName) {
metadata = defineInternalMetric(metricName, metadata)
}
//TODO convert it to OtelMetrics ?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we do here ?

receiver/prometheusreceiver/internal/metricfamily.go Outdated Show resolved Hide resolved
receiver/prometheusreceiver/internal/metricfamily.go Outdated Show resolved Hide resolved
receiver/prometheusreceiver/metrics_receiver_test.go Outdated Show resolved Hide resolved
`test_scrape_series_added 13`,
`. HELP test_up The scraping was sucessful`,
`. TYPE test_up gauge`,
`test_up 1`,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implicitely, prometheus exporter expose new metrics by default (and it makes sense, else any prometheus server on top of it is unable to know if a "sub-job" fails or not).


switch metricName {
case scrapeUpMetricName:
metadata.Unit = "bool"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is There a convention about that ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Units appear to be specific to OpenMetrics, rather than prometheus text format. See prometheus/prometheus/pkg/textparse/promparse.go#L195. For OpenMetrics, here is the documentation for units: OpenObservability/OpenMetrics/specification/OpenMetrics.md#units-and-base-units. The units we should stick to include (from the OpenMetrics link): seconds, bytes, joules, grams, meters, ratios, volts, amperes, and celsius. The "up" metric probably shouldn't have units in that case.

metadata.Type = textparse.MetricTypeGauge
metadata.Help = "The approximate number of new series in this scrape"
case "scrape_samples_post_metric_relabeling":
metadata.Unit = "count"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is There a convention about that ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no unit here. See the above comment.

metadata.Type = textparse.MetricTypeGauge
metadata.Help = "The scraping was sucessful"
case "scrape_duration_seconds":
metadata.Unit = "seconds"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is There a convention about that ? I saw in test scenarios some s instead of full word seconds

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the above. We should use "seconds" here.

@rakyll
Copy link
Contributor

rakyll commented May 11, 2021

cc @odeke-em

@gillg gillg changed the title Try to make internal prometheus scrapper metrics working "wake up" internal prometheus scrapper metrics (up / scrape_xxxx) May 17, 2021
@dashpole
Copy link
Contributor

Discussed at the prometheus wg meeting today. One request is to keep this PR narrowly tailored to the issue at hand. Can we only add the "up" metric for now, and defer on the other ones? We weren't able to reach consensus as to whether the other metrics should be added during the meeting.

Until open-telemetry/prometheus-interoperability-spec#52 is resolved, we should only add the 'up' metric.

@gillg
Copy link
Contributor Author

gillg commented May 19, 2021

Discussed at the prometheus wg meeting today. One request is to keep this PR narrowly tailored to the issue at hand. Can we only add the "up" metric for now, and defer on the other ones? We weren't able to reach consensus as to whether the other metrics should be added during the meeting.

Until open-telemetry/wg-prometheus#52 is resolved, we should only add the 'up' metric.

Thank you for these informations ! I don't understand why not implement all native metrics but yes it's really simple to only fix "up". We just need to remove unwanted case here https://github.com/open-telemetry/opentelemetry-collector/pull/3116/files#diff-a71211e5426c3d12d9c3c0b5991e4b284568b76310438f884a37ce20655327f4R88 and it will work as previously. The metric will have no type and be rejected later in the chain.

Today the main problem is to fix test cases, and help me to instanciate a logger where it's usefull (I commited commented lines with a fake logger).
Last question, metadata.Unit = "bool" is ok as OTEL unit for "up" ?

@bogdandrutu
Copy link
Member

@rakyll @Aneurysm9 @dashpole as Prometheus experts please review and help this person.

@gillg
Copy link
Contributor Author

gillg commented May 19, 2021

@rakyll @Aneurysm9 @dashpole as Prometheus experts please review and help this person.

Thank you @bogdandrutu (and all the team 😅 ) say me if you need access to my fork, or just guide me on comments. I can do it by myself but I don't know the good approach to inject the logger cleanly.

About the metrics, I remove other cases than "up" and I prepare a new PR in parallel for "scrape_xxx" metrics ?

@alolita alolita added the duplicate This issue or pull request already exists label May 19, 2021
@Aneurysm9
Copy link
Member

I also think this is the correct approach to take. It fully resolves the compliance test issues related to the missing up metric and appears to be a simple solution, even if the existing tests make it somewhat painful.

I managed to successfully wrangle these tests when fixing the receiver's start time adjustment logic and expect it will require similar changes (hence the current conflict with the main branch since that PR was merged). I've blocked off some time tomorrow to try to work on making them less awful.

As for whether it makes sense to only handle up in this PR, I think the work is already done to handle all of these internal metrics so we might as well carry on with that. No point adding more work down the road when we'd undoubtedly still have to do something to deal with testing those new metrics.

@dashpole
Copy link
Contributor

As for whether it makes sense to only handle up in this PR, I think the work is already done to handle all of these internal metrics so we might as well carry on with that. No point adding more work down the road when we'd undoubtedly still have to do something to deal with testing those new metrics.

I'm also OK with that. I figured it would be easier to have to only wrangle the tests for one metric, but if it isn't much more effort, then we can do them all and just be done.

@Aneurysm9
Copy link
Member

I've hammered the receiver's e2e tests into a much more manageable shape. I was going to make a PR to the source branch of this PR, but that repo doesn't seem to be accepting PRs. For now the changes can be viewed at https://github.com/Aneurysm9/opentelemetry-collector/tree/feat/add-internal-prom-metrics. @gillg can you pull in those changes to this branch so we can get this PR unstuck?

@gillg gillg requested a review from alolita as a code owner May 27, 2021 21:58
@gillg
Copy link
Contributor Author

gillg commented May 27, 2021

Thank you a lot @Aneurysm9 ! Good job around the tests, I lose my hairs just reading them 😅

So last things remaining :

  • Solve the units question for up metric ("bool" ?) and scrape_xxx ("seconds" vs "s", and "count")
  • Add a logger usable on metricfamily to add some debugging logs for the future
  • Answer to the //TODO convert it to OtelMetrics ? line 68

After we are probably ok to merge


switch metricName {
case scrapeUpMetricName:
metadata.Unit = "bool"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Units appear to be specific to OpenMetrics, rather than prometheus text format. See prometheus/prometheus/pkg/textparse/promparse.go#L195. For OpenMetrics, here is the documentation for units: OpenObservability/OpenMetrics/specification/OpenMetrics.md#units-and-base-units. The units we should stick to include (from the OpenMetrics link): seconds, bytes, joules, grams, meters, ratios, volts, amperes, and celsius. The "up" metric probably shouldn't have units in that case.

receiver/prometheusreceiver/internal/metricfamily.go Outdated Show resolved Hide resolved
metadata.Type = textparse.MetricTypeGauge
metadata.Help = "The scraping was sucessful"
case "scrape_duration_seconds":
metadata.Unit = "seconds"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the above. We should use "seconds" here.

receiver/prometheusreceiver/internal/metricfamily.go Outdated Show resolved Hide resolved
metadata.Type = textparse.MetricTypeGauge
metadata.Help = "The approximate number of new series in this scrape"
case "scrape_samples_post_metric_relabeling":
metadata.Unit = "count"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no unit here. See the above comment.

@dashpole
Copy link
Contributor

I was able to rebase this, and get tests working: main...dashpole:internal_metrics_second. The additional changes I had to make are in the last commit, but the rebase was the harder part...

@gillg gillg force-pushed the feat/add-internal-prom-metrics branch from 8481a50 to e2c4d6d Compare June 15, 2021 16:33
Copy link
Contributor Author

@gillg gillg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to rebase this, and get tests working: main...dashpole:internal_metrics_second. The additional changes I had to make are in the last commit, but the rebase was the harder part...

Good job on rebase.... ! I updated my branch. I will take a look if something else needs to be changed.

@gillg
Copy link
Contributor Author

gillg commented Jun 15, 2021

Ok, without more changes it seems work as before.
What introduce all the new things about prometheus / otel conversions ? I thought this will have impacts.

@dashpole
Copy link
Contributor

They are introducing the change slowly over a few PRs. The ones they added aren't used yet. They may need to make some changes when they rebase on this change (assuming it merges).

@Aneurysm9
Copy link
Member

It appears that the test issues have been resolved and the current failure is a (potentially flaky) load test. Can @open-telemetry/collector-maintainers confirm this and land this PR?

@gillg
Copy link
Contributor Author

gillg commented Jun 15, 2021

Data dropped due to high memory usage is common with load tests or not ?
I experienced pretty often this kind of things with my custom OTEL build. See https://github.com/open-telemetry/opentelemetry-collector/issues/3250
I think it's not related, but just in case.

@alolita alolita added the ready-to-merge Code review completed; ready to merge by maintainers label Jun 17, 2021
@alolita
Copy link
Member

alolita commented Jun 17, 2021

Thanks @gillg All tests are passing finally. 🎉

@bogdandrutu can you please merge.

@alolita alolita added release:required-for-ga Must be resolved before GA release and removed duplicate This issue or pull request already exists waiting-for-author labels Jun 17, 2021
@bogdandrutu bogdandrutu merged commit 329285d into open-telemetry:main Jun 21, 2021
odeke-em added a commit to orijtech/opentelemetry-collector-contrib that referenced this pull request Sep 13, 2021
This change updates internal code and is meant to alleviate the
massive PR #5184 which is our eventual end-goal. No need for tests
because the code path isn't used so we have a license to update
it towards the end goal.

Particularly it:
* adds open-telemetry/opentelemetry-collector#3116 to the otlp version.
* copies the sortPoints method over verbatim.
* sets DataType and Name on pdata metrics.

Updates PR #5184
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge Code review completed; ready to merge by maintainers release:required-for-ga Must be resolved before GA release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants