-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add process.entity_id
to attribute registry
#983
Changes from 5 commits
0e94fd5
15c57e2
398e142
9ee5887
f92e96c
469653d
2ff65cc
07a739b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
change_type: enhancement | ||
component: process | ||
note: Add `process.entity_id` attribute | ||
issues: [983] |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -12,6 +12,7 @@ | |||||
| `process.command_args` | string[] | All the command arguments (including the command/executable itself) as received by the process. On Linux-based systems (and some other Unixoid systems supporting procfs), can be set according to the list of null-delimited strings extracted from `proc/[pid]/cmdline`. For libc-based executables, this would be the full argv vector passed to `main`. | `[cmd/otecol, --config=config.yaml]` | ![Experimental](https://img.shields.io/badge/-experimental-blue) | | ||||||
| `process.command_line` | string | The full command used to launch the process as a single string representing the full command. On Windows, can be set to the result of `GetCommandLineW`. Do not set this if you have to assemble it just for monitoring; use `process.command_args` instead. | `C:\cmd\otecol --config="my directory\config.yaml"` | ![Experimental](https://img.shields.io/badge/-experimental-blue) | | ||||||
| `process.creation.time` | string | The date and time the process was created, in ISO 8601 format. | `2023-11-21T09:25:34.853Z` | ![Experimental](https://img.shields.io/badge/-experimental-blue) | | ||||||
| `process.entity_id` | string | A unique identifier of the process. [1] | `9f26bf6c-2508-43e7-a048-ed1822d9ff52`; `903b0daafe95e8ed9c9c0c9070259c29` | ![Experimental](https://img.shields.io/badge/-experimental-blue) | | ||||||
mjwolf marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| `process.executable.name` | string | The name of the process executable. On Linux based systems, can be set to the `Name` in `proc/[pid]/status`. On Windows, can be set to the base name of `GetProcessImageFileNameW`. | `otelcol` | ![Experimental](https://img.shields.io/badge/-experimental-blue) | | ||||||
| `process.executable.path` | string | The full path to the process executable. On Linux based systems, can be set to the target of `proc/[pid]/exe`. On Windows, can be set to the result of `GetProcessImageFileNameW`. | `/usr/bin/cmd/otelcol` | ![Experimental](https://img.shields.io/badge/-experimental-blue) | | ||||||
| `process.exit.code` | int | The exit code of the process. | `127` | ![Experimental](https://img.shields.io/badge/-experimental-blue) | | ||||||
|
@@ -31,7 +32,26 @@ | |||||
| `process.session_leader.pid` | int | The PID of the process's session leader. This is also the session ID (SID) of the process. | `14` | ![Experimental](https://img.shields.io/badge/-experimental-blue) | | ||||||
| `process.user.id` | int | The effective user ID (EUID) of the process. | `1001` | ![Experimental](https://img.shields.io/badge/-experimental-blue) | | ||||||
| `process.user.name` | string | The username of the effective user of the process. | `root` | ![Experimental](https://img.shields.io/badge/-experimental-blue) | | ||||||
| `process.vpid` | int | Virtual process identifier. [1] | `12` | ![Experimental](https://img.shields.io/badge/-experimental-blue) | | ||||||
| `process.vpid` | int | Virtual process identifier. [2] | `12` | ![Experimental](https://img.shields.io/badge/-experimental-blue) | | ||||||
|
||||||
**[1]:** The process ID within a PID namespace. This is not necessarily unique across all processes on the host but it is unique within the process namespace that the process exists within. | ||||||
**[1]:** A globally unique identifier of the process. This can be any format | ||||||
of unique indentifier. | ||||||
|
||||||
On Linux, where possible, it is RECOMMENDED that this is calculated | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Other other limitations except Linux processes? Can we do either "On Linux it is RECOMMENDED" or "Where possible it is RECOMMENDED"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since all OSs should have PID and process start time, the only thing that's needed for this to work is a unique boot id which can be read by all collectors. AFAIK, only Linux has a boot UUID, but there could be other ways to generate one. Maybe some sort of IPC, where the first running collector on the system generates a boot UUID, and then the others read it. But that will start to get more complicated, and I think less likely that the recommended method is adopted. I also don't know Windows or macos very well, if anyone knows of a way to get a boot UUID on those systems, I could add it to the note, and this should work for those too. For now, I've changed this to "Where possible it is RECOMMENDED..." |
||||||
MD5 hash of the string `<pid>_<starttime>_<bootid>`. | ||||||
pyohannes marked this conversation as resolved.
Show resolved
Hide resolved
mjwolf marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this the first time we define an attribute value as a hash? I couldn't find any other instance at this point but if we do have a similar definition elsewhere, we should consistently stick to one algorithm so that instrumentation libraries can use the same hashing library and don't have to resort to multiple different ones. If it's the first time we define one, we should be fine with sticking to MD5 also for all other such hash-derived values. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. service.instance.id is kind of a hash. It's not MD5 though. Personally I'd prefer NOT to specify which hash algorithm is used, instead requirements (e.g. <N% collision over X cardinality). OR - we can agree that recommended algorithms can be updated as non-breaking changes - otherwise I think it risks change in the future. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's important to specify what hash algorithm to use, otherwise different collectors could use different algorithms, and the resulting values wouldn't match for the same process. The whole intention of this recommended method is to have the identical values generated for a process. The algorithm used doesn't matter as much. Maybe it should be something other than MD5, if we use only one, and it can be used for other things that could be more affected by collisions. Is there a place we can specify that entity_id might not match between versions? If so, then I think we can say that changing the algorithm between versions is a non-breaking change. Since this recommended algorithm can't be used in all situations (or some implementors could choose not to use it), it won't be possible for anything to rely on the entity_id being identical between collectors. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Makes sense. What would you think of taking the tuple string directly instead of using a hash? Since attribute value size can be of concern, we could consider using the last fragment of boot_id only, which would yield a string shorter than the MD5 hash would be. Not sure if the format is well-defined and portable, though.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There isn't a syntax in YAML to define it like that but we could consider putting that in a note. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm in favor of adding a note to the field describing hashing algorithm There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Some Elastic products are actually currently using a tuple, but we'd like to move away from it. Since systems can generate a very large number of process events, there's value in trying to keep event logs as small as possible, and we haven't ever really had a use for the data within the entity_id tuple, so we think it's better to use a hash. I think using a tuple with only partial uuid segments would increase the chances of collision, although I'm not sure by how much. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With tuple I wasn't actually referring to a semantic combination of multiple attributes but actually using a single attribute with a single string value that's composed of the individual components (by concatenating them with some character, just like the input used for the hash function, i.e.. |
||||||
|
||||||
e.g. | ||||||
mjwolf marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
| Key | Description | Example Value | | ||||||
| --- | --- | --- | | ||||||
| `pid` | PID | "1201" | | ||||||
| `starttime` | start time in seconds since boot | "3029" | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've change this to |
||||||
| `bootid` | boot ID, as read from `/proc/sys/kernel/random/boot_id` | "16b6f4a4-179e-4480-9561-0df471d8c6d4" | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
md5("1201_3029_16b6f4a4-179e-4480-9561-0df471d8c6d4") -> "230d8ee6147ded8e31dcef082bdbc383" | ||||||
|
||||||
By following this method, the same entity_id will be generated by all | ||||||
observers of a given process. | ||||||
|
||||||
**[2]:** The process ID within a PID namespace. This is not necessarily unique across all processes on the host but it is unique within the process namespace that the process exists within. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
<!-- endsemconv --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we give it a more descriptive name? E.g.
process.global.id
? Entity id seems to be very broad and does not seem to match the descriptionThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could rename this, but do you think we should make a namespace for one field? Or can you think of other things we might want to put in
global
in the future?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not global, it should be unique for given process but I don't think we can claim it is global
Also this field is used in 23 elastic integrations as well as internally. I wouldn't change it without sufficient solid grounds