Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instrument instances types #1484

Merged
merged 18 commits into from
Jul 13, 2021
Merged

Conversation

rubenvp8510
Copy link
Collaborator

@rubenvp8510 rubenvp8510 commented Jun 24, 2021

Signed-off-by: Ruben Vargas [email protected]

Which problem is this PR solving?

  • First approach of instrumenting Jaeger operator to know more about operands.
  • This report the number of instances and use labels for stratgegy, storage and agentMode

Short description of the changes

  • Use Prometheus exporter and OpenTelemety metrics API, created a valueObserver and query for instances manages by the operator in every callback. The reason of using a valueObserver is because I cannot relies on reconciliation process and using UpDownCounter instrument because the reconciliation process can be execute multiple times for creation/update/delete and I can't distinguish when is an update or creation.

Screenshot from 2021-06-24 00-06-33

@rubenvp8510 rubenvp8510 force-pushed the instrument-op branch 2 times, most recently from fc24db9 to 36fdb62 Compare June 24, 2021 05:34
// Bootstrap prepares a new tracer to be used by the operator
func Bootstrap(ctx context.Context, namespace string, client client.Client) {
tracer := otel.GetTracerProvider().Tracer(v1.CustomMetricsTracer)
ctx, span := tracer.Start(ctx, "bootstrap")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this span recording events/ reported?

If the bootstrap creates tracer than the span might be noop/default if it is created before initialization.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tracer is already created when this bootstraping is called, this bootstraps the metrics for Open Telemetry,

const meterName = "jaegertracing.io/jaeger"

// Bootstrap prepares a new tracer to be used by the operator
func Bootstrap(ctx context.Context, namespace string, client client.Client) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this return some kind of closer interface to flush the spans on termination event?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is this method called? Periodically or just once?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called just once.

func (i *instancesMetric) strategy(jaeger v1.Jaeger) string {
strategy := string(jaeger.Spec.Strategy)
if strategy == "" {
return "allinone"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is the convention if strategy is empty then we should define a method on the Spec object to return a strategy and set the allinone as default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same for memory

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and agent sidecar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are set via Defaulters, but might be empty when running locally (make run) and/or in tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we using Defaulters in this version of the operator? I started to do it on v2 but IIRC we don't use it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. We have a normalizer in this operator.

pkg/apis/jaegertracing/v1/const.go Outdated Show resolved Hide resolved
pkg/metrics/bootstrap.go Show resolved Hide resolved
pkg/metrics/bootstrap.go Show resolved Hide resolved
pkg/metrics/instances.go Show resolved Hide resolved
func (i *instancesMetric) strategy(jaeger v1.Jaeger) string {
strategy := string(jaeger.Spec.Strategy)
if strategy == "" {
return "allinone"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are set via Defaulters, but might be empty when running locally (make run) and/or in tests.

pkg/metrics/bootstrap.go Outdated Show resolved Hide resolved
pkg/metrics/bootstrap.go Show resolved Hide resolved
pkg/metrics/instances.go Outdated Show resolved Hide resolved
pkg/metrics/instances.go Outdated Show resolved Hide resolved
ctx, span := tracer.Start(ctx, "setup-jaeger-instances")
defer span.End()
meter := global.Meter(meterName)
_, err := meter.NewInt64ValueObserver("operator_jaeger_instances", i.callback,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With which frequency is the callback called?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this is called per collection interval.

@codecov
Copy link

codecov bot commented Jun 24, 2021

Codecov Report

Merging #1484 (36877a0) into master (b27ec96) will decrease coverage by 0.37%.
The diff coverage is 67.92%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1484      +/-   ##
==========================================
- Coverage   88.28%   87.91%   -0.38%     
==========================================
  Files          90       92       +2     
  Lines        5660     5766     +106     
==========================================
+ Hits         4997     5069      +72     
- Misses        503      532      +29     
- Partials      160      165       +5     
Impacted Files Coverage Δ
pkg/metrics/bootstrap.go 0.00% <0.00%> (ø)
pkg/metrics/instances.go 87.80% <87.80%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b27ec96...36877a0. Read the comment docs.

@rubenvp8510 rubenvp8510 force-pushed the instrument-op branch 2 times, most recently from 083af9a to a5c078a Compare June 25, 2021 05:33
@rubenvp8510
Copy link
Collaborator Author

Address all comments, added an screen-shot with the visualization of the metrics.

@pavolloffay
Copy link
Member

Address all comments, added an screen-shot with the visualization of the metrics.

If the metrics are exposed in p8s nedpoint. It might be better to paste the metrics here in the text form. At least for me it is the best way to understand what is being exposed.

@rubenvp8510
Copy link
Collaborator Author

rubenvp8510 commented Jun 29, 2021

Address all comments, added an screen-shot with the visualization of the metrics.

If the metrics are exposed in p8s nedpoint. It might be better to paste the metrics here in the text form. At least for me it is the best way to understand what is being exposed.

# HELP operator_jaeger_instances Number of jaeger instances in cluster
# TYPE operator_jaeger_instances gauge
operator_jaeger_instances{agent="sidecar",service_instance_id="default.jaeger-operator",service_name="jaeger-operator",service_namespace="default",service_version="v1.23.0.redhat1-7-g97c68db8",storage="elasticsearch",strategy="production"} 1
operator_jaeger_instances{agent="sidecar",service_instance_id="default.jaeger-operator",service_name="jaeger-operator",service_namespace="default",service_version="v1.23.0.redhat1-7-g97c68db8",storage="memory",strategy="allinone"} 1

@pavolloffay
Copy link
Member

The metrics look good. I would rename agent to agent_strategy

Are the following labels needed? What do they provide us?

service_instance_id="default.jaeger-operator",service_name="jaeger-operator",service_namespace="default

@rubenvp8510
Copy link
Collaborator Author

The metrics look good. I would rename agent to agent_strategy

Are the following labels needed? What do they provide us?

service_instance_id="default.jaeger-operator",service_name="jaeger-operator",service_namespace="default

I don't think we need it, we can get rid of it

WDYT? @jpkrohling

@jpkrohling
Copy link
Contributor

jpkrohling commented Jun 30, 2021

If we keep the name/namespace labels, we will have high cardinality here. In fact, I would even split the metrics into this:

jaeger_operator_instances_agent_modes{mode=sidecar} 1
jaeger_operator_instances_agent_modes{mode=daemonset} 1
jaeger_operator_instances_storage_types{type=memory} 1
jaeger_operator_instances_storage_types{type=elasticsearch} 1
jaeger_operator_instances_strategies{type=production} 1
jaeger_operator_instances_strategies{type=allinone} 1
jaeger_operator_instances_versions{version=1.23.0} 1
jaeger_operator_instances_versions{version=1.24.0} 1

I'm not familiar with the OpenTelemetry Metrics API, but OpenCensus had a separation between counters and views: you'd have only one counter here, reporting how the instance looks like (agent mode, storage, strategy, version), and four different views, one for each metric I listed above.

With that, we only have a limited number of time series, the last one being the only one that can potentially grow "indefinitely".

Note that my proposal also changes the prefix, to be inline with the other Jaeger components ("jaeger_collector_...", "jaeger_agent_...")

@rubenvp8510
Copy link
Collaborator Author

rubenvp8510 commented Jul 4, 2021

I changed the metrics using the suggested way:

# HELP jaeger_operator_instances_agent_strategies Number of instances per agent strategy
# TYPE jaeger_operator_instances_agent_strategies gauge
jaeger_operator_instances_agent_strategies{type="sidecar"} 1
# HELP jaeger_operator_instances_storage_types Number of instances per storage type
# TYPE jaeger_operator_instances_storage_types gauge
jaeger_operator_instances_storage_types{type="memory"} 1
# HELP jaeger_operator_instances_strategies Number of instances per strategy type
# TYPE jaeger_operator_instances_strategies gauge
jaeger_operator_instances_strategies{type="allinone"} 1

pkg/cmd/start/bootstrap.go Outdated Show resolved Hide resolved
pkg/metrics/instances.go Show resolved Hide resolved
pkg/apis/jaegertracing/v1/const.go Outdated Show resolved Hide resolved
pkg/metrics/instances.go Show resolved Hide resolved
@rubenvp8510 rubenvp8510 force-pushed the instrument-op branch 2 times, most recently from 68c4a90 to eec4c90 Compare July 10, 2021 05:18
rubenvp8510 and others added 9 commits July 10, 2021 00:20
Signed-off-by: Ruben Vargas <[email protected]>
Co-authored-by: Juraci Paixão Kröhling <[email protected]>
Signed-off-by: Ruben Vargas <[email protected]>
- Fixed import orders
- Better error logging and handling
- Some code linting fixes

Signed-off-by: Ruben Vargas <[email protected]>
Signed-off-by: Ruben Vargas <[email protected]>
Signed-off-by: Ruben Vargas <[email protected]>
Signed-off-by: Ruben Vargas <[email protected]>
Signed-off-by: Ruben Vargas <[email protected]>
@rubenvp8510
Copy link
Collaborator Author

rubenvp8510 commented Jul 10, 2021

@jpkrohling @pavolloffay Please give it another review, I added tests and a flag to enable/disable this feature.

pkg/metrics/bootstrap.go Outdated Show resolved Hide resolved
pkg/cmd/start/main.go Outdated Show resolved Hide resolved
pkg/metrics/bootstrap.go Outdated Show resolved Hide resolved
pkg/metrics/bootstrap.go Outdated Show resolved Hide resolved
@rubenvp8510
Copy link
Collaborator Author

I address all comments, removed the flag, I think this is ready for another review and hopefully a merge =)

c := controller.New(
processor.New(
selector.NewWithHistogramDistribution(
histogram.WithExplicitBoundaries(config.DefaultHistogramBoundaries),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the boundaries defined here? Do they make sense for us?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of that is applicable to us. for now. That is why I just set default values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the default values? If we don't have histograms, why do we configure them? Can't we just leave this out?

Copy link
Collaborator Author

@rubenvp8510 rubenvp8510 Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the default values? If we don't have histograms, why do we configure them? Can't we just leave this out?

I don't know, we are not using it. I can't I need to pass and aggregation selector to the New method. :/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// NewWithHistogramDistribution returns a simple aggregator selector
// that uses histogram aggregators for `ValueRecorder` instruments.
// This selector is a good default choice for most metric exporters.

From this comment I'm assuming this is the best "default" configuration.

@jpkrohling jpkrohling enabled auto-merge (squash) July 13, 2021 15:42
@jpkrohling jpkrohling merged commit e6b4930 into jaegertracing:master Jul 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants