[WIP] Add protocol for dynamic configuration service #155

wtong98 · 2020-06-05T22:25:52Z

Changes:

New protocol for dynamic config service

Goal

We suggest a new protocol for a dynamic configuration service that enables updating metric collection schedules, trace sampling parameters, and other configurations without having to restart an instrumented application. An early draft of the proposal is attached:
OpenTelemetry Configuration Service.pdf

We'd love to hear your thoughts! An example implementation is currently underway, implemented as an extension on the collector and modifications to the Go SDK. Our goal is to deliver a functioning prototype in the upcoming weeks to give an example of what this system may look like.

About

We're an intern group at Google. Members include:

Ming Chen @vmingchen (host)
Josh Montana @jdmontana (co-host)
Churan Chen @churanchen (intern)
William Tong @wtong98 (me! also an intern)

linux-foundation-easycla · 2020-06-05T22:25:57Z

The committers are authorized under a signed CLA.

✅ William Tong (f2aed90, 2010ce4, b024fe4, aa516d0, ab778ca, 2c2220f, cb12eea, b6fc2e1, 42a0f05)

jmacd · 2020-06-05T22:52:27Z

I think we will probably want to coordinate this with the Metric Views API. This looks good as a start, for configuring collection periods, but I think people are even more interested in configuring the labels and aggregations used. Thanks!

SergeyKanzhelev

I agree with @jmacd that tighter integration with views API would be beneficial. Left a few comments. Indeed a good start.

@open-telemetry/technical-committee should this be developed as another experimental feature?

SergeyKanzhelev · 2020-06-05T22:56:34Z

opentelemetry/proto/collector/README.md

-3. `metrics` package contains the Metrics Service protos.
+1. `trace` package contains the Trace Service protos.
+2. `metrics` package contains the Metrics Service protos.
+3. `dynamicconfig`package contains the Dynamic Config Service protos


if you adding a note here, also add a it to the maturity matrix: https://github.com/open-telemetry/opentelemetry-proto#maturity-level

SergeyKanzhelev · 2020-06-05T23:00:12Z

opentelemetry/proto/collector/dynamicconfig/v1/dynamicconfig_service.proto

+
+message ConfigRequest{
+  // Describes the version of the config service protocol to use. Required.
+  int32 proto_version = 1;


proto_version may be misunderstood as a proto2 or proto3. Perhaps a better name can be used here.

Suggested change

int32 proto_version = 1;

int32 config_format_version = 1;

SergeyKanzhelev · 2020-06-05T23:00:34Z

opentelemetry/proto/collector/dynamicconfig/v1/dynamicconfig_service.proto

+
+message ConfigRequest{
+  // Describes the version of the config service protocol to use. Required.
+  int32 proto_version = 1;


would uint32 be a better type?

How will this be used? Protocol Buffers by nature handle versioning, and if a drastic version change was needed you'd probably use a v2 package instead. If it's for the opaque metadata, it would be better to have a struct to contains metadata and metadata_format_version, but it also seems reasonable to leave that versioning to the user.

great point, proto_version removed

SergeyKanzhelev · 2020-06-05T23:03:45Z

opentelemetry/proto/collector/dynamicconfig/v1/dynamicconfig_service.proto

+  // whether an update to the configs had occurred -- if the client request's
+  // fingerprint differs from the server's, then the server knows that a change
+  // had occurred and an updated configuration should be sent.
+  string last_fingerprint = 3;


should we also consider adding some error reporting from SDK to collector when it failed to apply the config?

SergeyKanzhelev · 2020-06-05T23:07:23Z

opentelemetry/proto/collector/dynamicconfig/v1/dynamicconfig_service.proto

+
+  // Optional. The client is suggested to wait this long (in seconds) before
+  // pinging the configuration service again.
+  int32 suggested_wait_time = 5;


Suggested change

int32 suggested_wait_time = 5;

int32 suggested_wait_time_sec = 5;

SergeyKanzhelev · 2020-06-05T23:11:03Z

opentelemetry/proto/collector/dynamicconfig/v1/dynamicconfig_service.proto

+  // whether an update to the configs had occurred -- if the client request's
+  // fingerprint differs from the server's, then the server knows that a change
+  // had occurred and an updated configuration should be sent.
+  string last_fingerprint = 3;


Suggested change

string last_fingerprint = 3;

string last_known_fingerprint = 3;

just a suggestion. I think it make the name more descriptive.

SergeyKanzhelev · 2020-06-05T23:17:03Z

opentelemetry/proto/collector/dynamicconfig/v1/dynamicconfig_service.proto

+  // Describes the version of the config service protocol to use. Required.
+  int32 proto_version = 1;
+
+  // The target resource. This resource pings the config server requesting


required or optional?

Since this is ConfigRequest, target is confusing, since target is something you usually send a request to. So maybe

// The resource for which configuration should be returned.

Also don't think we have to talk about pings / responses since those are implementation details and not part of the protocol.

Good point, changed

SergeyKanzhelev · 2020-06-05T23:25:23Z

opentelemetry/proto/collector/dynamicconfig/v1/dynamicconfig_service.proto

+      // protocol, but still desire to communicate these settings to
+      // instrumented applications. An application may in turn piggyback
+      // metadata on a vendor's metric exporter to communicate information back
+      // to its metric backend. In this way, metadata offers a channel to


how exporter will know this metadata?

SergeyKanzhelev · 2020-06-05T23:26:15Z

opentelemetry/proto/collector/dynamicconfig/v1/dynamicconfig_service.proto

+      }
+
+      // Metrics with names that match a rule in the inclusion_patterns are
+      // targeted by this schedule. Metrics that match the exclusion_patterns


what about labels matching?

anuraaga

Thanks! It'll be cool to have a configuration service if we can work out good semantics that apply to many systems. For reference, here's X-Ray's tracing sapling rules. They're fairly simple and probably similar to others, giving the link in case it provides information about filling out the trace config section

https://docs.aws.amazon.com/xray/latest/devguide/xray-api-sampling.html

anuraaga · 2020-06-06T02:21:16Z

opentelemetry/proto/collector/dynamicconfig/v1/dynamicconfig_service.proto

+
+message ConfigRequest{
+  // Describes the version of the config service protocol to use. Required.
+  int32 proto_version = 1;


How will this be used? Protocol Buffers by nature handle versioning, and if a drastic version change was needed you'd probably use a v2 package instead. If it's for the opaque metadata, it would be better to have a struct to contains metadata and metadata_format_version, but it also seems reasonable to leave that versioning to the user.

anuraaga · 2020-06-06T02:24:04Z

opentelemetry/proto/collector/dynamicconfig/v1/dynamicconfig_service.proto

+  // Describes the version of the config service protocol to use. Required.
+  int32 proto_version = 1;
+
+  // The target resource. This resource pings the config server requesting


Since this is ConfigRequest, target is confusing, since target is something you usually send a request to. So maybe

// The resource for which configuration should be returned.

Also don't think we have to talk about pings / responses since those are implementation details and not part of the protocol.

anuraaga · 2020-06-06T02:25:55Z

opentelemetry/proto/collector/dynamicconfig/v1/dynamicconfig_service.proto

+  // updated configs, and the server responds with the config information.
+  opentelemetry.proto.resource.v1.Resource resource = 2;
+
+  // Optional. Identifies the last configuration fingerprint / version number.


// Optional. The value of ConfigResponse.fingerprint for the last configuration a resource received.

And don't think we need details about how the fingerprint is used in this protocol, it's up to the server whether it uses it or not.

anuraaga · 2020-06-06T02:26:10Z

opentelemetry/proto/collector/dynamicconfig/v1/dynamicconfig_service.proto

+}
+
+message ConfigResponse {
+  // Describes the version of the config service protocol to use. Should match


anuraaga · 2020-06-06T02:26:29Z

opentelemetry/proto/collector/dynamicconfig/v1/dynamicconfig_service.proto

+  // the request’s version.
+  int32 proto_version = 1;
+
+  // Optional. The fingerprint associated with the current config settings. Each


The fingerprint associated with this ConfigResponse

anuraaga · 2020-06-06T02:30:52Z

opentelemetry/proto/collector/dynamicconfig/v1/dynamicconfig_service.proto

+  // change in configs yields a different fingerprint. The client may cache a
+  // fingerprint, then return it on the next request to enable the server to
+  // quickly check whether a change in the configs occurred. If no changes
+  // occurred, then all other fields in the response are optional.


It's not clear, especially with the unimplemented blocks, whether it's trivial for a client to understand the difference between a response containing no fields because it didn't change or containing default fields. How about adding a boolean indicating whether this response is valid or not?

Not sure if I follow. The original intention was that should nothing change, then the response will have only fingerprint, which equals the client's last_known_fingerprint, by which the client knows that no change happened. Does that sound like a practical outcome? I'll also update the comments to make the process more clear

Since a fingerprint is like a hash I don't think comparison with a previous value can guarantee nothing has changed due to hash collisions - it's more a hint which is more useful in the server which actually can compute the fingerprints.

So I think here for the client to know no change happened, it seems it needs to check the actual content. This would probably be many hasX checks using the generated code for proto and might work but may be indirect.

Another thought is perhaps this fingerprint is overoptimization, at least for now so maybe ok to remove for now. The configs don't seem so large and comparing fingerprints (etag + 204 responses in the http world) may not be worth it in practice.

Great points, thanks for the quick response! I'll bring it up with the rest of the group, and we'll update our proposal (which will hopefully be made available soon)

anuraaga · 2020-06-06T02:32:36Z

opentelemetry/proto/collector/dynamicconfig/v1/dynamicconfig_service.proto

+  // occurred, then all other fields in the response are optional.
+  string fingerprint = 2;
+
+  // Dynamic configs specific to metrics, namely the metric collection schedule


Might add different configs later, don't think there's value in restricting to , namely the ...

anuraaga · 2020-06-06T02:33:00Z

opentelemetry/proto/collector/dynamicconfig/v1/dynamicconfig_service.proto

+    // adopts the configuration specified by the schedule.
+    message Schedule {
+
+      // Pattern dictates a light-weight pattern that can match 1 or more


A light-weight pattern that can...

anuraaga · 2020-06-06T02:37:09Z

opentelemetry/proto/collector/dynamicconfig/v1/dynamicconfig_service.proto

+      repeated Pattern exclusion_patterns = 2;
+
+      // CollectionPeriod describes the sampling period for each metric. All
+      // larger units are divisible by all smaller ones.


Adding some background on why this should be an enum and not just arbitrary uint32 would make it clearer

The original intent was to enforce the property that all larger units be divisible by smaller ones (which makes the implementation much cleaner), though we're thinking of altering this -- for example in case a vendor has an extremely good reason to use a collection period of, say, 2 days

anuraaga · 2020-06-06T02:39:18Z

opentelemetry/proto/collector/dynamicconfig/v1/dynamicconfig_service.proto

+      //  * Using alternate representations for collection schedules, matching
+      //    metrics, resources, etc.
+      //  * Enabling other optimizations
+      string metadata = 4;


bytes is better for opaque data, easy to store a string in bytes field, impossible to store arbitrary binary in a string field

great idea, changed

tigrannajaryan · 2020-06-08T12:49:24Z

This is an interesting proposal that I think warrants an OTEP where it can be discussed. The attached proposal in PDF format is not suitable for commenting.

I would like to see a few things addressed that I don't see in the proposal:

What are the security implications of remote configuration. So far OpenTelemetry's operation mode is that instrumented applications voluntarily emit telemetry and the applications and instrumenting libraries control what information to include in the telemetry and when to emit. Remote configuration capabilities create a new attack vector via which applications may be compelled to expose sensitive information (such attack vector perhaps is not viable based on current Config Request/Response definition but may become a possibility once it evolves to include more configuration options). I would like to see this discussed.
What are performance implications of mass remote configuration. A malicious or accidental incorrect configuration that increases sampling and collection rates dramatically may result in a storm of telemetry data generation and bring down the collection network and/or the application. I would like to understand if we need safeguards that mitigate this.
It is not clear how Collector should proxy config requests. Collector operation is currently based on the notion of data pipelines that are of specific types (traces or metrics). Unclear if config requests should be part of the pipelines, are a new pipeline type or are a separate concept that needs to passthrough.

I think it is best to discuss these as part of an OTEP before discussing the protocol since protocol design/schema will be likely driven by the choices made in that OTEP.

wtong98 · 2020-06-09T14:23:08Z

Thanks for all the comments! Really appreciate the interest. We'll finalize the proposal and open an OTEP as soon as possible

wtong98 · 2020-07-29T17:17:21Z

An updated PR has been made with suggestions from here and the OTEP: #183

wtong98 added 9 commits May 29, 2020 15:36

Add dynamic config protos

f2aed90

Refactor dyconfig to dynamicconfig

2010ce4

Polish dynamicconfig proto

b024fe4

Add additional documentation

aa516d0

Modify inclusion-exclusion pattern type

ab778ca

Tidy message definitions

2c2220f

Elaborate Schedule documentation

cb12eea

Remove metadata field from request

b6fc2e1

Final polish

42a0f05

jmacd mentioned this pull request Jun 5, 2020

Metrics default SDK configuration CFP open-telemetry/opentelemetry-specification#382

Open

SergeyKanzhelev reviewed Jun 5, 2020

View reviewed changes

anuraaga reviewed Jun 6, 2020

View reviewed changes

Tidy documentation comments

5f67555

churanchen mentioned this pull request Jun 10, 2020

OTEP for dynamic configuration vmingchen/oteps#1

Closed

wtong98 added 2 commits June 11, 2020 10:43

Clarify fingerprint documentation

5406581

Restart field numbering

c4b3c95

churanchen mentioned this pull request Jun 25, 2020

Remote configuration for the SDK proposal open-telemetry/oteps#121

Merged

Merge with master

b686427

wtong98 mentioned this pull request Jul 29, 2020

Add protocol to Metric Configuration Service #183

Merged

wtong98 closed this Jul 29, 2020

	int32 suggested_wait_time = 5;
	int32 suggested_wait_time_sec = 5;

	string last_fingerprint = 3;
	string last_known_fingerprint = 3;

[WIP] Add protocol for dynamic configuration service #155

[WIP] Add protocol for dynamic configuration service #155

Conversation

wtong98 commented Jun 5, 2020

Changes:

Goal

About

linux-foundation-easycla bot commented Jun 5, 2020 • edited Loading

jmacd commented Jun 5, 2020

SergeyKanzhelev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anuraaga left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wtong98 Jun 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tigrannajaryan commented Jun 8, 2020 • edited Loading

wtong98 commented Jun 9, 2020

wtong98 commented Jul 29, 2020

linux-foundation-easycla bot commented Jun 5, 2020 •

edited

Loading

wtong98 Jun 9, 2020 •

edited

Loading

tigrannajaryan commented Jun 8, 2020 •

edited

Loading