Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/util/log: introduce fluent.sink.conn.errors metric to log package #106607

Merged
merged 4 commits into from
Jul 24, 2023

Conversation

abarganier
Copy link
Contributor

@abarganier abarganier commented Jul 11, 2023

Logging is a critical subsystem within CRDB, but as things
are today, we have very little observability into logging
itself. For starters, we have no metrics in the logging
package at all!

This makes it difficult to observe things within the log
package. For example, if a fluent-server log sink fails
to connect to FluentBit, how can we tell? We get some STDOUT
message, but that's about it.

It's time to get some metrics into the log package.

Doing so is a bit of a tricky dance, because pretty much every
package in CRDB imports the logging package, meaning you
almost always get a circular dependency when trying to make
use of any library within pkg/util/log. Therefore, this PR
injects metrics functionality into the logging package.
It does so via a new interface called LogMetrics with
functions that enable its users to modify metrics.

The implementation of the interface can live elsewhere,
such as the metrics package itself, whe circular dependencies
aren't such a pain. We can then inject the implementation
into the log package.

With all that plumbing done, the PR also introduces a new metric,
fluent.sink.conn.errors representing fluentbit connection errors
whenever a fluent-server log sink fails to establish a connection.

We can see the metric represented below in a test, where no
fluent-server was running for a moment, before starting it.
Note that the connection errors level off once the server was
started:
Screenshot 2023-07-11 at 1 32 57 PM

Release note (ops change): This patch introduces the counter
metric fluent.sink.conn.errors to the CockroachDB tsdb,
which is exposed to /_status/vars clients as
fluent_sink_conn_errors. The metric is incremented whenever
a fluent-server log sink fails to establish a connection to
the log sink pointed to by the address for the sink in the
provided log config.

Epic: CC-9681

Addresses: #102753

@abarganier abarganier requested review from knz, dhartunian and a team July 11, 2023 17:47
@abarganier abarganier requested review from a team as code owners July 11, 2023 17:47
@abarganier abarganier requested review from herkolategan and smg260 and removed request for a team, herkolategan and smg260 July 11, 2023 17:47
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@abarganier abarganier force-pushed the fluent-server-conn-metric branch from 4eb0a10 to 0e59caf Compare July 11, 2023 17:51
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main feedback concerns the structure of the new package and the ownership of the registry (1st and 2nd commit). The rest checks out.

Reviewed 5 of 5 files at r1, 12 of 12 files at r2, 5 of 5 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @dhartunian)


pkg/cli/demo.go line 235 at r3 (raw file):

			// decide how to enable logging.
			serverCfg.Stores.Specs = nil
			logMetrics := metric.NewLogMetricsRegistry()

I believe this change should be moved form the 3rd commit to the 2nd commit.


pkg/server/tenant.go line 651 at r3 (raw file):

		lr, ok := r.(*metric.Registry)
		if !ok {
			panic(fmt.Sprintf("unexpected type for LogMetrics registry: %T", r))

panic(errors.AssertionFailedf(...))


pkg/util/log/flags.go line 93 at r2 (raw file):

//
// The returned logShutdownFn can be used to gracefully shut down logging facilities.
func ApplyConfig(config logconfig.Config, metrics LogMetrics) (logShutdownFn func(), err error) {

I don't love this change.

It seems to me that the (new) logmetrics package should instantiate the registry in a init function as a singleton, inject it into the log package only once (with a specific function just for that purpose).

Then the cli/start code would retrieve a reference to that singleton for passing to the server initialization.

(I am especially concerned by CLI unit tests that call start repeatedly. I don't think we want each iteration of the unit test to create a new metric registry.)


pkg/util/metric/log_metrics.go line 11 at r1 (raw file):

// licenses/APL.txt.

package metric

my preference would be to have the log metrics defined in a sub-package of log, e.g. util/log/logmetrics.
This way someone who is exploring the code around logging gets a "full picture" from the sub-directory tree.


pkg/util/metric/log_metrics.go line 68 at r1 (raw file):

	l.mu.Lock()
	defer l.mu.Unlock()
	if counter, ok := l.mu.counters[metric]; ok {

It seems to me that if the metric does not exist, the program should crash -- it would be an implementation bug.
Maybe a panic? panic(errors.AssertionFailedf(...))

@abarganier abarganier force-pushed the fluent-server-conn-metric branch from 0e59caf to ea63366 Compare July 12, 2023 21:52
Copy link
Contributor Author

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, it was very helpful. RFAL!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)


pkg/server/tenant.go line 651 at r3 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

panic(errors.AssertionFailedf(...))

Ack, but no longer needed as we don't need to worry about this case anymore 😌


pkg/cli/demo.go line 235 at r3 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I believe this change should be moved form the 3rd commit to the 2nd commit.

Ack - no longer relevant as this entire init process has been greatly simplified thanks to your other comment 🙂


pkg/util/log/flags.go line 93 at r2 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I don't love this change.

It seems to me that the (new) logmetrics package should instantiate the registry in a init function as a singleton, inject it into the log package only once (with a specific function just for that purpose).

Then the cli/start code would retrieve a reference to that singleton for passing to the server initialization.

(I am especially concerned by CLI unit tests that call start repeatedly. I don't think we want each iteration of the unit test to create a new metric registry.)

Thanks for this advice, I somehow felt the entire thing was still too brittle.

Using an init function made everything much simpler & we no longer have to worry about nil log metrics.

I ended up adding an init to the MetricsRecorder file for this, since the new logmetrics package isn't imported many places otherwise. LMK if you think there's a better place for this.


pkg/util/metric/log_metrics.go line 11 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

my preference would be to have the log metrics defined in a sub-package of log, e.g. util/log/logmetrics.
This way someone who is exploring the code around logging gets a "full picture" from the sub-directory tree.

Done.


pkg/util/metric/log_metrics.go line 68 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

It seems to me that if the metric does not exist, the program should crash -- it would be an implementation bug.
Maybe a panic? panic(errors.AssertionFailedf(...))

Agreed & Done.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 23 files at r5, 2 of 16 files at r6, 4 of 17 files at r7, 14 of 14 files at r8, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @dhartunian)


pkg/server/server.go line 1844 at r8 (raw file):

	// Grab the log metrics registry.
	r := log.LogMetricsRegistry()

nit: you could retrieve the registry directly with the right type (*metric.Registry) if you put the accessor in package logmetrics instead.


pkg/server/tenant.go line 646 at r8 (raw file):

	// Grab the log metrics registry.
	r := log.LogMetricsRegistry()

ditto


pkg/server/status/recorder.go line 134 at r8 (raw file):

		nodeRegistry *metric.Registry
		// logRegistry contains the global metrics registry used by the logging
		// package. NB: code paths exist where this registry is never initialized,

nit: pls provide an example of such code path.


pkg/server/status/recorder.go line 159 at r8 (raw file):

	// The logmetrics package is not imported by much, meaning
	// that its own init function is not a reliable place to initialize
	// log metrics. Being that the recorder is the primary user of

I disagree. you can use

import (
   _ ".../util/log/logmetrics"
)

at the top of this file, and you'd get the right behavior.


pkg/server/status/recorder.go line 254 at r8 (raw file):

			nodeReg.AddLabel("node_id", strconv.Itoa(int(desc.NodeID)))
			if !mr.hasLabel(logReg.GetLabels(), "node_id") {
				logReg.AddLabel("node_id", strconv.Itoa(int(desc.NodeID)))

This does not work for me. The log metric registry should be read-only.

If we have multiple in-memory servers running one after the other (e.g. think about multiple go tests in one package), the logic here would make the first one "win" the right to get its node_id in there. The others that then access the log package afterwards would not have "their" node_id in there and the person looking at the log output would be utterly confused

Is there a way to have the counters in the singleton in logmetrics, and the labels in an object that's different in each serveR?


pkg/util/log/clog.go line 115 at r8 (raw file):

}

// LogMetricsRegistry returns the metrics registry used by

As explained above, you can put this accessor in logmetrics. The Set function below can remain here.

@abarganier abarganier force-pushed the fluent-server-conn-metric branch from ea63366 to 9a8bf20 Compare July 20, 2023 19:37
Copy link
Contributor Author

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, great feedback. I've made some changes, PTAL!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)


pkg/server/server.go line 1844 at r8 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

nit: you could retrieve the registry directly with the right type (*metric.Registry) if you put the accessor in package logmetrics instead.

Thanks for the tip, this worked well and cleaned up the interface. Done.


pkg/server/tenant.go line 646 at r8 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

ditto

Done.


pkg/server/status/recorder.go line 134 at r8 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

nit: pls provide an example of such code path.

Ah, this was leftovers that I missed/forgot to remove. No such case exists anymore, since the recorder imports pkg/util/log/logmetrics, which ensures initialization.

Removed.


pkg/server/status/recorder.go line 159 at r8 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I disagree. you can use

import (
   _ ".../util/log/logmetrics"
)

at the top of this file, and you'd get the right behavior.

Thanks for this, it worked nicely. Done.


pkg/server/status/recorder.go line 254 at r8 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

This does not work for me. The log metric registry should be read-only.

If we have multiple in-memory servers running one after the other (e.g. think about multiple go tests in one package), the logic here would make the first one "win" the right to get its node_id in there. The others that then access the log package afterwards would not have "their" node_id in there and the person looking at the log output would be utterly confused

Is there a way to have the counters in the singleton in logmetrics, and the labels in an object that's different in each serveR?

Good point, thank you for bringing this up.

What we can do is export a function from the package, logmetrics.NewRegistry(), that adds the same singleton counters to a fresh registry per-server. This way, each server can add its own labels without anything colliding.

I've implemented this and found it worked nicely during testing. LMK your thoughts.


pkg/util/log/clog.go line 115 at r8 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

As explained above, you can put this accessor in logmetrics. The Set function below can remain here.

Done, great tip! Thanks.

@abarganier abarganier requested a review from knz July 20, 2023 19:57
@abarganier abarganier force-pushed the fluent-server-conn-metric branch 2 times, most recently from 8aadbf0 to d3e490a Compare July 20, 2023 21:01
@abarganier abarganier force-pushed the fluent-server-conn-metric branch 2 times, most recently from 43ee2f8 to 4be4af6 Compare July 20, 2023 21:33
Logging is a critical subsystem within CRDB, but as things
are today, we have very little observability into logging
itself. For starters, we have no metrics in the logging
package at all!

This makes it difficult to observe things within the log
package. For example, if a fluent-server log sink fails
to connect to FluentBit, how can we tell? We get some STDOUT
message, but that's about it.

It's time to get some metrics into the log package.

Doing so is a bit of a tricky dance, because pretty much every
package in CRDB imports the logging package, meaning you
almost always get a circular dependency when trying to make
use of any library within pkg/util/log.

Therefore, we must inject metrics functionality into the
logging package.

This patch provides a means of doing so. Within the log
package, we add a new interface called `LogMetrics` with
functions that enable its users to modify metrics.

The implementation of the interface can live elsewhere,
where circular dependencies aren't such a pain. We can
then inject the implementation into the log package.

This patch also provides a basic implementation to be used.
Future patches will plumb the implementation into the log
package and actually modify the supported metrics.

Release note: none
This patch plumbs the LogMetricRegistry into the log package
as part of the initialization of the package where the
implementation lives.

With the log metrics plumbed into the log package and
made available in the server, we can now add them to the
metrics recorder used by the server, as well as begin using
metrics within the log package.

Release note: none
This patch adds the necessary plumbing to inject the metrics
used by the logging package into the MetricsRecorders used by
servers in CockroachDB.

With this plumbing complete, we're ready to begin using metrics
within the log package.

Release note: none
Previous patches introduced the infrastructure necessary to record
metrics within the log package without introducing circular
dependencies.

With all that plumbing done, we can now increment the metric
representing fluentbit connection errors whenever a fluent-server
log sink fails to establish a connection.

Release note (ops change): This patch introduces the counter
metric `fluent.sink.conn.errors` to the CockroachDB tsdb,
which is exposed to `/_status/vars` clients as
`fluent_sink_conn_errors`. The metric is incremented whenever
a `fluent-server` log sink fails to establish a connection to
the log sink pointed to by the `address` for the sink in the
provided log config.
@abarganier abarganier force-pushed the fluent-server-conn-metric branch from 4be4af6 to abd9c99 Compare July 20, 2023 21:36
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 14 files at r9, 2 of 3 files at r10, 5 of 5 files at r11, 1 of 1 files at r12, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)

@abarganier
Copy link
Contributor Author

TFTR!

bors r=knz

@craig
Copy link
Contributor

craig bot commented Jul 24, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jul 24, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jul 24, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants