Skip to content
This repository has been archived by the owner on Oct 3, 2023. It is now read-only.

Add resource specs #162

Merged
merged 6 commits into from
Oct 30, 2018
Merged

Add resource specs #162

merged 6 commits into from
Oct 30, 2018

Conversation

fabxc
Copy link
Contributor

@fabxc fabxc commented Aug 23, 2018

This spec aims to cover my proposal for a resource package in the core library (Go reference implementation).
I dropped the existing spec that was covering specific cloud vendor auto-detection since part of the proposal is that those live independently in census-ecosystem or entirely elsewhere.

I took my best guesses on MAY vs. SHOULD vs. MUST based on existing specs but don't feel strongly in most cases. Any guidance is appreciated.

```go
type Resource {
Type string
Labels map[string]string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your implementation had Labels as Tags. Are we going with Labels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we changed it from labels to tags. @Ramonza asked elsewhere yesterday whether "labels" or "properties" would make more sense.
Personally I don't feel strongly – there are arguments for either. Happy to change it to whatever you decide on.

The key MUST be seperated from the value by a single `=`. Values MAY be quoted with a single
leading and trailing `"`. If a values contains whitespaces, `=`, or `"` characters it MUST be
quoted and `"` characters MUST be escaped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please give examples of OC_RESOURCE_TYPE and OC_RESOURCE_LABELS to demonstrate the format?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
```

Exporter libraries MAY provide a default translation for well-known input resource types and labels.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to add a list of all known label keys here in the future.

Then, implementors can look up for the whole list and decide what they can map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the known label keys should be specific to the auto-detection implementations. The one's living in census-ecosystem can be documented there.
In general users may implement their own detection with their own set of keys – thus maintaining an authoritative set in the core spec seems problematic. Hence the proposal to use domain namespacing.

@@ -0,0 +1,10 @@
# OpenCensus Library Resource Package
This documentation serves to document the "look and feel" of the open source resource package.
It describes they key types and overall behavior.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/they/their


The primary purpose of resources as a first-class concept in the core library is decoupling
of discovery of resource information from exporters. This allows for independent development
of either and easy customization for users that need to integrate with closed source environments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove "either and"?

The resource library primarily defines a type that captures about information about the entity
for which stats or traces are recorded. It further provides a framework for detection of
resource information from the environment and progressive population as signals propagate
from the core instrumentation library to the a backend's exporter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the a" -> "a"

@@ -0,0 +1,10 @@
# OpenCensus Library Resource Package
This documentation serves to document the "look and feel" of the open source resource package.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/open source/OpenCensus

unicode character.

Implementations MAY define a `Resource` data type, constructed from the parameters above.
Resource MAY have getters for retrieving all the information used in `Resource` definition.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think getters is a "must".

@@ -1,84 +0,0 @@
# Monitored Resource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this page can be left as additional information for the "Auto-detection" section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that it's more scalable to keep these docs close to the code they belong to. OTOH, with spread across languages, this repo may indeed still be the better option – maybe with separate file per detection mechanism then as those will probably grow.

@@ -0,0 +1,133 @@
# Resource API Overview
The resource library primarily defines a type that captures about information about the entity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit; s/about information/the information/


## Populating resources
Resource information MAY be populated at any point between startup of the instrumented
application and passing it the a backend-specific exporter. This explicitly includes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit; s/passing it the a/passing it to a/


Additionally, exporters SHOULD provide configuration hooks for users to provide their own
translation unless the exporter's backend does not support resources at all. For such backends,
exporters SHOULD allow attaching converting resource labels to metric labels.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what you mean. what are metric labels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be tags probably.

@nfisher
Copy link

nfisher commented Aug 24, 2018

My assumptions about Resource.md:

  • this document is to outline how to associate environment related labels to spans and metrics (e.g. zone, region, pod, etc).

  • The primary goal in doing so is to provide flexibility for drill-down, and fine-grained groupings in downstream tools.

  • There appears to be 3 approaches being considered for associating the labels:

    1. metadata API end-points.
    2. environment variables.
    3. config file(s).

Thoughts on these approaches:

i. metadata

I see two mechanisms for this (although there might be others):

  1. probing.
  2. import at build time.

Probing doesn't feel like a great option because it makes a lot of assumptions that are liable to break (e.g. what if I'm using a custom image, what if I have a default deny policy on my machine, etc).

Importing also doesn't feel like a great option as it requires a recompilation to deploy to a new environment as I assume it would have an import that looks like this;

import _ "oc/metadata/aws"

I'm sure they can be made to work well but I don't think they're ideal as they feel too much like implicit magic.I can't deny their convenience in the contexts where it would just work.

ii. environment variables

I think you could iterate over variables with a given prefix (e.g. OC_LABEL_) and use them as labels. The terms region, zone, host, application, application_group can map to most cloud providers:

AWS

OC_LABEL_REGION=ca-central-1
OC_LABEL_ZONE=ca-central-1a
OC_LABEL_HOST=hostname.aws
OC_LABEL_APPLICATION=my-app
OC_LABEL_APPLICATION_GROUP=my-app-asg

k8s

OC_LABEL_REGION=ca-central-1
OC_LABEL_ZONE=ca-central-1a
OC_LABEL_HOST=pod-instance-id1234
OC_LABEL_APPLICATION=my-app
OC_LABEL_APPLICATION_GROUP=pod

My assumption with the above is that OC would truncate the prefix and down-case the remaining variable.

ii. config file(s)

I think simple property files could be used here with a default path relative to the binary and optionally overridden by environment variable or config option.

It's late where I am will come back to this.

This documentation serves to document the "look and feel" of the open source resource package.
It describes they key types and overall behavior.

The primary purpose of resources as a first-class concept in the core library is decoupling

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to start with a definition of what a resource is.

Copy link
Contributor Author

@fabxc fabxc Aug 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other file starts with the definiton

The resource library primarily defines a type that captures
information about the entity for which stats or traces are recorded.

If that sounds reasonable, I can move it to the readme as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please resolve this. I think it is important to specify what a "resource" is, maybe include examples.

@yurishkuro
Copy link

I saw that def but it is not precise. Specifically, is the use of term "entity" intended to generalize to more than just an application process / workload? I.e. what else an entity could be, an endpoint?

@fabxc
Copy link
Contributor Author

fabxc commented Aug 27, 2018

I saw that def but it is not precise. Specifically, is the use of term "entity" intended to generalize to more than just an application process / workload? I.e. what else an entity could be, an endpoint?

I agree that "entity" is as generic as it gets and I believe that's the best we can do in the end. Signals can refer to pretty much anything (consider IoT for example – "toothbrush" could be perfectly valid) and I don't think an instrumentation library should constrain that set artificially, especially since we can't draw any benefits from it.

@fabxc
Copy link
Contributor Author

fabxc commented Aug 27, 2018

@nfisher regarding the approaches: I think the user should be enabled to use what works for their environment. Thus the goal is to not enforce a single specific solution but to provide a thin integration package for resources – with envvars being the fallthrough, that can always be relied on.

In some environments that can mean automatic detection of resource information, which should be provided through libraries. As you said, this is unideal in the sense that it adds dependencies to the code.
That may be perfectly fine for some users – after all they'd be doing the same for the exporters they use. Both could eventually be offloaded to an OC agent/proxy that includes those dependencies once, while application binaries only send their data through a generic protocol.
The alternative is that the user has to perform the same actions the library does as a deployment step. That's certainly less magical, but also more error prone in total and a worse user experience.
Client libraries for cloud vendors also use what seems like magic to determine credentials so the user doesn't have to care about that. The auto-detection of resource information hooks into the exact same mechanisms.
In general, dumping the gathered resource information should be relatively straightforward way to debug what information was gathered or not.

For environments where auto-detection is not possible, runtime configuration is ultimately required. Envvars seemed like an easy way to integrate.
Using prefix-based variables is indeed an option. However, they character set becomes automatically limited and namespaced label keys such as k8s.io/pod_name aren't possible. This is rather important as the goal is to detach gathering of resource information from interpreting it, which happens in the exporters. That ensures that exporters are truly swappable.

Config files are also a way to integrate of course. My main concern is that they make deployment more tedious (generate file, get file into the right place, permissions, ...) whereas envvars are easy to inject in most systems I'm aware of.
I think both could exist, but maybe it's worth waiting until a need for files arises?

Does that provide some background on motivation? Anything you'd like to see in the spec to clarify things?

@nfisher
Copy link

nfisher commented Aug 27, 2018

@fabxc agree on flexibility especially if controlled by the user as to when and how it happens. Will place a review for any further comments.

To clarify for my own understanding where would you see these labels being attached during the initialisation process?

func initTracer() {
    sdexporter, err := stackdriver.NewExporter(stackdriver.Options{ProjectID: "testing"})
    if err != nil {
        return err
    }
    trace.RegisterExporter(sdexporter)
    trace.ApplyConfig(trace.Config{DefaultSampler: trace.ProbabilitySampler(0.01)})
    http.ListenAndServe(":50030", &ochttp.Handler{})
}

resource/Resource.md Outdated Show resolved Hide resolved
The resource library primarily defines a type that captures information about the entity
for which stats or traces are recorded. It further provides a framework for detection of
resource information from the environment and progressive population as signals propagate
from the core instrumentation library to a backend's exporter.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a difference between core library and core instrumentation library? Core library seems to be most consistently used throughout.

Is it worth giving an explicit definition of what the core library is? I assume it is any libraries provided under the Github census-instrumentation org which provide a common abstraction to backend exporters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd agree with the definition you gave, especially as exporters are moved to census-ecosystem.
What's the official take @bogdandrutu?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think core library is better.

an agent attaches further labels about the underlying VM, the cluster, or geo-location.

### From environment variables
Population of resource information from environment varibales MUST be provided by the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/varibales/variables


For example, process-identifying information may be populated through the library while
an agent attaches further labels about the underlying VM, the cluster, or geo-location.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth adding a From OC data structure section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mh, what do you mean specifically? Like when a resource is just passed around via internal OC APIs?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Labels defined in files. Main reason I arrived here to burden you with my nonsense was a tweet by @rakyll about difficulty of label discovery in k8s.

The downward API that you referenced in the deleted document seems the most reasonable/consistent approach within k8s to get those details but I'm not sure how well its constraints map to the previously outlined mechanisms (e.g. OC_RESOURCE_LABELS env variable and auto-detection).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with the downward API is that it can be used to set arbitrary environment variables. Currently the auto-detection OC's Stackdriver exporter does relies on a documentation example but it's not standardized.
We could define a standard of course, which users have to adhere to. But in practice no one wants to configure every single deployment/pod with the downward API or something else. Using Kubernetes' admission control to inject this information automatically is much more feasible and powerful. And when using that, one can simply resort to the generic OC envvars proposed here without adding yet another one-off standard.

I have a working implementation for this and can hopefully release a design soon.

That said, I'd like to understand a use case where a file with information is easier than an environment variable. Even with admission control in Kubernetes as described above, implementation is a lot simpler with environment variables and a lot less likely to collide with other admission restrictions, e.g. "pods in this namespace must not mount volumes".

A resource object MUST NOT be mutated further once it is passed to a backend-specific exporter.
From the provided resource information, the exporter MAY transform, drop, or add information
to build the resource identifying data type specific to its backend.
If the passed resource does not contain sufficient information, an exporter MAY drop
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the feedback loop on dropped/ignored data? (e.g. is it a silent error, printout to stderr, returned error, etc)

My preference would be towards failing fast with a validation error at the point of library re/initialisation. Alternatively but less preferred would be a returned error at the point of capture.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as this would happen during initialization, fail-fast would be best. Will add that.
For the update case described above, emitting an error on update failures and continuing best effort with the old resource? The risk is data being a bit skewed. That seems better though than dropping all data. Especially since only small pieces of information are likely to change, which may not even be used by the exporter.

@fabxc
Copy link
Contributor Author

fabxc commented Aug 29, 2018

@nfisher thanks for all your comments. This is really helpful and I added some clarifications.

@fabxc
Copy link
Contributor Author

fabxc commented Aug 31, 2018

Added back utils/MonitoredResource.md since this will only correspond to the independent new resource package. It can be replaced with new docs in parallel when the corresponding resource auto-detection mechanisms are added.

@nfisher I take we largely agree modulo potentially adding a file-based generic path? That could be added without conflicts at a later point.

@jbd @bogdandrutu could you agree on whether you'd prefer to go with resource labels or tags?

@rakyll
Copy link
Contributor

rakyll commented Aug 31, 2018

I have no strong opinions on tags vs labels. But you should consider that there is a cost of introducing new terminology. We are working on this field full time and can automatically can associate them. It is not true for an average user. @bogdandrutu, any more to add?

@nfisher
Copy link

nfisher commented Aug 31, 2018

@fabxc hrm common auto-detected path would need to think about that could see it working with JSON files in the format you've outlined? But generally speaking the more I think about files the more I think it might be best left as an exercise for the user that doesn't need to be encoded in the spec. It would be nice to have a reference implementation elsewhere though.

RE: downward API vs admission control I think users are likely to arrive at a solution from a number of directions as I think "best practises" are still emerging with k8s. Also what you might do with a small number of services (e.g. 10's is likely to be different to 100's, or 1,000's) so I'm not sure there's a 1 size fits all there. With a lack of a common standard/best practise I'm not sure auto-detection will work with k8s. In general my leaning is towards explicit over magic. I'm ok with a little explicit wire-up, especially if there's a mechanism to "verify" my work. Assuming it doesn't exist already it would good to have a func ResolvedLabels(w io.Writer) or ResolvedLabels() []Labels that would allow the user to inspect what OpenCensus views as it's label set for the process which could also be added to pagez end-point. I would probably use it to print the resolved labels as part of my start-up process. Not sure if it fits this spec though.

@nfisher
Copy link

nfisher commented Aug 31, 2018

With respect to @rakyll comment. I would opt to use whatever OpenTracing and Dapper paper use.

@fabxc
Copy link
Contributor Author

fabxc commented Sep 6, 2018

@nfisher Good point on providing the labels as debug output to a zpage. That should definitely be possible with the proposed APIs, i.e. the recommended detector functionality, will provide a function as you suggested.

Copy link
Contributor

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some high-level comment. I think we have 2 things to solve here: uniqueness for timeseries, usability (query capability) for metrics (show me latency per container). This document kind of tries to address both but I struggle to understand how do we solve the first part.


Type strings and label keys MAY start with a domain name and MAY be further namespaced through
single slashes. All other characters MUST be alphanumeric. Label values MAY contain any valid
unicode character.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you are using strings you need to define an encoding for them, otherwise use bytes. We prefer to support a subset of characters maybe consider printable ascii.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that should've just said "utf8" probably. But I'm fine with a subset, which it will be anyway in practice.
Only thing I'd be worried about is, that that might not add up with backends and thus make OC restrictive without much gain.

Prometheus supports utf8 strings for label values and it hasn't been an issue yet.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We (ab)use prometheus at work and have much wider range in use than ASCII. 👍 to UTF-8.

application and passing it to a backend-specific exporter. This explicitly includes
the path through future OpenCensus components such as agents or services.

For example, process-identifying information may be populated through the library while
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have problems to understand what is a resource vs what is a task/process identifying. Seems that resource can identify a VM instances or containers which means the process identity is lacking.

Do we have to ensure uniqueness of timeseries exported via the resource? What if in a GCE VM we have multiple linux tasks how does the gce_instance resource uniquely identifies a timeseries?

Not sure I understand how you envision this to be used. We currently also added in our exporters a notion of Node (see https://github.com/census-instrumentation/opencensus-proto/blob/master/src/opencensus/proto/agent/common/v1/common.proto#L33) does this completely replace that? Not sure that the current proposal can completely replace that unless we try to add some mandatory fields that can get us some guarantees about uniqueness of generated timeseries.

Seems to me that this is a bit too fragile for users (out of the box experience will not work properly), unless they set the env variables or they run the agent (may work) or they link one of the libraries that we have for auto detection of the resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that resource can identify a VM instances or containers which means the process identity is lacking.

The resource can be as granular as the OC user likes. I don't believe this is something we should or even can solve generically.
If the user wants to export with resource labels granular enough down to a process, they can. But this will not scale for many backends. Imagine something spawns a new process very often but never runs it concurrently. If each spawning of the process created a new set of series, virtually all backends would break under the load.
I believe at some point the user has to check at which level they can ensure no collisions take place and go with that – e.g. if they know their containers only run a single process exposing signals.

Note that this proposal does not amend anything about the <language>-<pid>@<hostname> label attached to all metrics right now, which theoretically guarantees what you are concerned about.
Though I think it raises the problems mentioned above and that should be revisited eventually.

What if in a GCE VM we have multiple linux tasks how does the gce_instance resource uniquely identifies a timeseries?

Then a pure VM-based resource is the wrong choice (it almost always is). More granular labels must be provided.

Not sure that the current proposal can completely replace that unless we try to add some mandatory fields that can get us some guarantees about uniqueness of generated timeseries.

Sounds like valuable information that exporters can make use of additional to the resource info in building the final samples they send.
Is this intended to replace the fixed <language>-<pid>@<hostname> metric label?

* `type`: a string which describes a well-known type of entity. It SHOULD be namespaced
to avoid collisions across different environment, e.g. `k8s.io/container`,
`cloud.google.com/gce/instance`.
* `labels`: a dictionary of labels with string keys and values that provide information
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment on the strings here. Strings in go for example are encoded utf8 but in java default is utf16 what do we do here? I think it is always better to start with printable characters or a set smaller than "everything". Also if we support any character we need to do sanitization for almost all the backends, and the biggest problem is that after sanitization the keys may no longer be unique for example hence you break the protocol.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mh, does the in-memory encoding really matter? Sounds like this is up to the exporter to convert/normalize. Even if we restricted to alphanum, dashes and underscores, some backend may not like dashes and the exporter has to mutate before sending.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the in-memory representation matters for one reason: we are giving people an option to set these strings via environment variables so we have to define what a valid input it is for the key/value strings.

Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need more details on intended use of API. If it's just a local endpoint scenario - I'd suggest to name is such, Resource is a very rich term, let's not waste it

@@ -0,0 +1,154 @@
# Resource API Overview
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this API also addresses this issue: #135? Does it cover only local endpoint or also a remote endpoint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resource does not aim describe individual service endpoints. In particular for remote endpoints, that's straight out impossible to do in the same way I'd say.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it does try to describe a local endpoint, right? Should it be called local endpoint then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean renaming resource to local endpoint? That possibly makes sense from a tracing perspective but not so much for other signals I believe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what other signals it wouldn't make sense for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A process doing file process can still expose metrics about its internal state without doing any request/response flows at all. Same for logs/events should they become part of OC at some point.


## Resource type
A `Resource` describes the entity for which a signal was collected through two fields:
* `type`: a string which describes a well-known type of entity. It SHOULD be namespaced
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to have some semantical, predefined fields on top of labels. Like name and instanceId. Those fiedls can be extracted from labels to provide some out-of-the box expirience

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd only consider doing this in the form of standardized tag keys potentially – which can be done later on when we've a better grasp of how people are using resources in different environments.
Dedicated fields will inevitably make user-side configuration more complex and the overall system harder to understand.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd imagine system will want to draw applicaton map with the names taken from resource identity. In this case having name or role will be helpful as UI shouldn't worry about knowing all the different "providers" of such information.

Alernative may be to define a well-known postfix for those properties so exporter can do generic translation of labels to strongly-defined fields .

Copy link
Contributor Author

@fabxc fabxc Sep 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you are getting at. I agree that in the system where the data ends up, one wants to have more structure to enable certain workflows.

This proposal does not extend as far as telling backend system to have a resource presentation in form of a type string and tag map. Instead, the specific exporters should convert the rather structure-less resource notion proposed here into something more structured.
For Stackdriver for example, the exporter is configured with a mapping that converts known resource type and tag keys to Stackdriver Monitored Resources (e.g. k8s_container) which have fixed fields.
Exporters will by default do a mapping of tags set by well-known resource detectors.

So ultimately, what's proposed here is so flexible to accomodate for the fact that backends may handle this very differently. For example, some backends may rather convert the OC resource into a single string like container.k8s.io/namespace/ns1/pod/nginx-1234/container/nginx.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. What I'm trying to accomplish is some hint from resource-auto-discovery logic to exporter on what is what. So perhaps some postfix in names that help determine whether this tag is about unique ID of instance or logical group of instances (role). Or have those two on the level with type. Otherwise every exporter will need to know about every resource auto-detector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect the number of active detectors in a given environment to be very low so that configuration isn't a particularly big problem. Plus, if most common environments are supported by default, only a small fraction of users has to do configuration at all.

Coming up with a canonical set of schemas (in the fashion of schema.org) will eat lots of time and is hard to get right. If it only relieves a few lines if configuration for a minority of users, I'm not sure that's worth it.

Again Prometheus as a data point – the provided data has even weaker structure and no default behavior at all, i.e. all users have to write rather awkward configuration. Regardless, they largely consider this to be a feature.

@@ -0,0 +1,154 @@
# Resource API Overview
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reading "resource" there is also a scenario when you want to record operation on certain resource. Like query all traces that created, updated, read properties of, etc of a certain resource. Where resource may be some business entity. Is this an intendent use of an API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resource is really about identifying the source of signal data and with that would be generally static. I didn't really see individual signals mutating information about their own source.
Could you give an example?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When Azure creates, modifies, etc. VMs it marks all traces performing operations on them with the VM ID. We call it resource ID =)

I can imagine the same thing applicable everywhere where business defines some resources that has a lifetime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the traces get tagged with the VM they are coming from?
That would essentially be exactly what this proposal aims at – just that machine identifiers are often to coarse and in the case of metrics, basically always.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, with the VMs they operate on. VM here is a customer VM created via call to Azure. Sorry for confusion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, the Azure API itself here being what's instrumented.

I think in that case the VM ID would just be part of the signal tag itself, i.e. a tag of trace rather than a tag of the resource of the trace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment comes from the very short intro of this document. Maybe make it more explicit what is resource.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

## Populating resources
Resource information MAY be populated at any point between startup of the instrumented
application and passing it to a backend-specific exporter. This explicitly includes
the path through future OpenCensus components such as agents or services.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the property like "health" be a valid field for it? If one want to mark the node as unhealthy and then healthy again? Or it's set once, change never kind of properties?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, volatile metadata isn't helpful in identifying a resource – so I'd say no. At the same time, such data could be useful for filtering noisy/spammy signals before ingesting them.

So I think it should be allowed but exporters wouldn't actually make this data part of signal data itself.
Basically, for the near future, only static properties are of relevance IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so resource may have property whether it's "master" or configuration "version" that is currently being used.

Since this PR defines resources - I'd suggest to define resources lifetime and health as well.

Mostly my comment comes from the term "resource". It is so wide and general that I natirally see the desire to use it to design health and lifetime management of that resource.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think whether a resource has health or lifetime info to begin with really depends on the resource. If it has this information, a library implementing a resource detector can easily provide this info through dedicated labels.

Though, as hinted at in other comments, I think things should mostly be constrained to static resource information. Backends can pull into volatile metadata at query time from other sources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

health and lifetime tracking requires more responsive mechanism to detect changes that pulling. If this is a scenario - it should be taken into consideration

core library. It provides the user with an ubiquitious way to manually provide information
that may not be detectable automatically through available integration libraries.

Two environment variables are used:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the assumption is that there is a single resource type per process? I'm still trying to wrap my head around what is resource thus the question. Will there be option to have admin part of site reporting resource separately from user part?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that information should generally remain part of the signal tags/labels. The resource is about identifying the signal source itself. In 90% of cases, that just means being able to identify the process through a set of tags that are meaningful within the context of the environment later, e.g. the k8s_container resource in Stackdriver.
Currently, this is only possible through the <language>-<pid>@<hostname> label on the signal itself, which rarely makes sense to an end user later.

I get that the mental line can get kinda of blurry though.
For example, if the actual source of signals isn't able to export data itself (e.g. some embedded device) but instead a third process exports signals about N sources by proxy. Then being able to set different resources explicitly is valuable.
That would generally be an exception case that can be added by further API extensions. I'd defer that though so the design of such an extension can be guided by real-world use cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does process defines the boundary or DI container? It needs to be clearly outlined in spec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion differs on this a bit. I think OC should provide clear docs around best practices – we can probably also some SHOULD/SHOULD NOT sentences to the spec.

But a clear line is not possible with this in my experience. There'll always be things we'd clearly see as "MUST NOT do this" that people might have a valid one-off use case for. If we don't enable those, even if they don't yield ideal results, OC becomes significantly less valuable for those users.
If 1 out of 20 teams in an org cannot use the instrumentation system of choice, the negative impact is far greater than 5%.


Two environment variables are used:
* `OC_RESOURCE_TYPE`: defines the resource type. Leading and trailing whitespaces are trimmed.
* `OC_RESOURCE_LABELS`: defines resource labels as a comma-seperated list of key/value pairs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BNF definition would be useful here


### Merging
As different mechanisms are run to gain information about a resource, their information
has to be merged into a single resulting resource.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

automatically or on demand? It feels that there is a singleton for a resource (to my previous question). Would this singleton be per tracer? Or it should be global?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would happen explicitly during initialization/setup.

See this PR for the core library extension I proposed (Go).

With that, the only change in general would be an option when setting up the exporter, e.g.:

exporter, err := stackdriver.NewExporter(stackdriver.Options{
    ResourceDetector: resource.ChainedDetector(
        resource.FromEnv,
        aws.Detect,
        azure.Detect,
        gcp.Detect,
    ),
})

Or, with a helper package that chains the common auto-detectors together:

exporter, err := stackdriver.NewExporter(stackdriver.Options{
    ResourceDetector: auto.Detect,
})

### Updates
Over the runtime of an application, resource information may change in some cases. To
allow exporters to incorporate those changes, they SHOULD take a resource getter as a
configuration parameter. The exporter SHOULD periodically rerun the getter and incorporate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be notification mechanism for subscribers on updates? It will be useful to have if you have an exporter that needs to react on these changes. Otherwise there will be a mess of threads checking for updates from each other

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now I think a periodic rerun of the detector and comparing with the previous result to detect changes will be sufficient. But there may very well be other use cases I didn't think of.

Note that the exporter-specific part of this spec is largely SHOULD, so there's enough flexibility to deviate/extend where necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heatlth and lifetime management are scenarios that might require more immidiate update of properties.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SergeyKanzhelev I would like to understand the "heatlth and lifetime management are scenarios". Can you give me more details about what you have in mind.

My personal opinion is to keep simplicity and actually not support runtime config for these in the library at all, unless we have a clear use-case which I miss here, but I can be convinced if you have one in mind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key of heatlth and lifetime management is that different modules may contribute into the common state and make decisions based on it. Like if CPU is too high - make sampling more aggressive and report yellow state of the app on this node. Or if exporter failed to upload telemetry with many retries - switch an app to red state and potentially node out of rotation. Or when app was swapped into production from staging - start collecting more metrics.

I agree with you that resource concept like proposed can be implemented if there is a way to set global attributes like proposed in #167. Maybe make it more elaborate and implement it via callbacks to resource auto-discovery modules (without calling them that).

So static and almost static informaiton is not that hard to implement without introducing new concepts (except global attributes or attributes callbacks).

```

### Updates
Over the runtime of an application, resource information may change in some cases. To
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some words on how often are required. Is health change or config update notification should be using this API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per above, I'd see whether users report use cases.

Prometheus has a similar resource info model (called target info there) – different integrations update via watches/notification or just via periodic refreshes. I cannot recall a single use cases where users complained about periodic refreshes being too laggy.
It may also be interesting to see what (meta)data we added to different integrations over time based on user demand.

Most non-identifying information was added a) for filtering or b) because people wanted to use attach it to the signal directly.
While b) should be discourage, I think OpenCensus must allow for odd one-off use cases since it becomes most valuable, when users can use it for everything.

vendors, MUST be implemented outside of the core libraries in third party or
[census-ecosystem][census-ecosystem] repositories.

### Merging
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how the type property will be merged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already set labels or type fields MUST NOT be overwritten.

@fabxc
Copy link
Contributor Author

fabxc commented Sep 24, 2018

Thanks for the detailed feedback @SergeyKanzhelev. I tried to answer all your questions and concerns with some context.

@SergeyKanzhelev
Copy link
Member

@fabxc thank you! I understand it much better now. Since it's all in-proc I believe more responsive update notification mechanism may be beneficial for health and lifetime tracking of the resource. Also some strongly typed fields might be helpful to simplify writing exporters. Do you think it's something this spec can accomodate?

@bogdandrutu
Copy link
Contributor

I personally struggle a bit on this so I would like to clarify the overall picture as I understand it. I think we need two different concepts (we may have overlap between them, that's probably the cause of confusion for me at least):

  • A notion of a Node (windows/linux process) that tries to uniquely identify the process instrumented with OpenCensus that produces signals. A Node is not required to be human readable (it can contain ip address, pid, etc.).
  • A notion of a Resource that tries to describe the "owner" of the signal. To clarify this we can say that the producer is always a process (windows/linux) but the owner can be a container/vm/service etc (e.g. vm_cpu_usage metric, this metric is produced by a process but it belongs to the vm resource). A Resource is human readable.

I think we can conclude that one process can produce signals that belong to different resources, so the resource is a property of the signal not a property of the producer task.

There are different use-cases for Resource and Node, here are some examples:

  1. Signal aggregation by Resource. We believe that it is very useful to aggregate signals by the Resource (e.g. show me all traces for this k8s_container, or show me vm_cpu_ussage breakdown by vm_instance, or show me all logs for this k8s_container). Conclusion: this information is useful for the back-end to know about every exported signal.
  2. Use Node to uniquely identify signals that belong to the same Resource. An example can be a k8s_container that runs multiple process and they export the same signal (e.g. http_server_latency metric). Conclusion: this information is required when the Resource does not uniquely identify a signal.
  3. Use Node for configuration push. If we develop a central service that allows users to configure all running process instrumented with OpenCensus then Node can be used to identify all these tasks. I think here is where my confusion is: if the Node has only host, pid, etc. it is not easy to be consumed by the users of this service, it will be nice to allow them to say "set this sampling probability for this service". Conclusion: We may be tempted to say that Node also belongs to a Resource.

Questions:

  • Should we have these as separate concepts? Based on the use-cases that I thought of they are always used together.
  • If they are separate concepts (which kind of I like because they describe two different things about the signal), should the Node include a Resource to facilitate use-case 3?

I really hope that I did not confused people more, and I would like to here others opinion about this.

@SergeyKanzhelev
Copy link
Member

@bogdandrutu it makes sense. If you think of resources as a way to design this type of hierarchy - there are also layers like role (set of endpoints - like "health", "admin") and endpoint itself.

It also calls to association of an individual span with the resource. Not for exporter to pull data from resources. Is it what you are thinking about?

Fabian Reinartz added 4 commits October 9, 2018 13:44
Signed-off-by: Fabian Reinartz <[email protected]>
Signed-off-by: Fabian Reinartz <[email protected]>
Signed-off-by: Fabian Reinartz <[email protected]>
@fabxc
Copy link
Contributor Author

fabxc commented Oct 9, 2018

Added some clarifications around the allowed character set (it's more restrictive now and matches tags).

@SergeyKanzhelev I changed the section on updates following a discussion with @mayurkale22 and @bogdandrutu to discourage usage of any mutable attributes in resource labels. On the one hand this is definitely the safe route and we can always lift restrictions based on the future feedback. On the other hand, it seems like mutable runtime information like health serves a quite different purpose from the resource feature discussed here.
Thus it's probably a good idea to think about it as an orthogonal feature in OC as well. Would you like to create a separate feature request issue so we can discuss use cases and potential implementation there?

This documentation serves to document the "look and feel" of the open source resource package.
It describes they key types and overall behavior.

The primary purpose of resources as a first-class concept in the core library is decoupling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please resolve this. I think it is important to specify what a "resource" is, maybe include examples.

`Resource` MUST have getters for retrieving all the information used in `Resource` definition.

Example in Go:
```go
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a TODO to add a link to the proto definition.

type Detector func(context.Context) (*Resource, error)

// Returns a detector that always returns a specific resource.
func NewDetectorFromResource(*Resource) Detector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more like a helper method. Not necessary to have it in the specs.

func NewDetectorFromResource(*Resource) Detector

// Returns a detector that runs all input detectors sequentially and merges their results.
func ChainedDetector(...Detector) Detector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chained does not sound very good to me. Maybe ResolveResourceFromDetector or maybe you have better suggestions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mh, it doesn't resolve a resource though but rather just turns a list of Detectors into a single Detector again that has to be called first.

Best alternative I can think of is CombinedDetector.

```

### Updates
Over the runtime of an application, attributes of a resource may change in some cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here this sentence implies we support dynamic changes of the resource fields. Or maybe I am just wrong in understanding this.

Copy link
Contributor

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SergeyKanzhelev what is your opinion on this?

LGTM for me with small comments sent earlier.

Signed-off-by: Fabian Reinartz <[email protected]>
@bogdandrutu
Copy link
Contributor

@SergeyKanzhelev I would like you to take a final look at this and also here where we define some examples:
https://github.com/census-ecosystem/opencensus-go-resource/pull/1/files#diff-53eec5f6ec725746ddb5c8b1c394865d

@fabxc
Copy link
Contributor Author

fabxc commented Oct 23, 2018

@SergeyKanzhelev did you have a chance to take a look at this?

@SergeyKanzhelev
Copy link
Member

@fabxc thank you for addressing feedback. Looks good to me. Sorry for delay

@fabxc fabxc merged commit ed99dcb into census-instrumentation:master Oct 30, 2018
@fabxc fabxc deleted the resource branch October 30, 2018 10:45
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants