Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add process.entity_id to attribute registry #983

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .chloggen/983.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
change_type: enhancement
component: process
note: Add `process.entity_id` attribute
issues: [983]
24 changes: 22 additions & 2 deletions docs/attributes-registry/process.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
| `process.command_args` | string[] | All the command arguments (including the command/executable itself) as received by the process. On Linux-based systems (and some other Unixoid systems supporting procfs), can be set according to the list of null-delimited strings extracted from `proc/[pid]/cmdline`. For libc-based executables, this would be the full argv vector passed to `main`. | `[cmd/otecol, --config=config.yaml]` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `process.command_line` | string | The full command used to launch the process as a single string representing the full command. On Windows, can be set to the result of `GetCommandLineW`. Do not set this if you have to assemble it just for monitoring; use `process.command_args` instead. | `C:\cmd\otecol --config="my directory\config.yaml"` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `process.creation.time` | string | The date and time the process was created, in ISO 8601 format. | `2023-11-21T09:25:34.853Z` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `process.entity_id` | string | A unique identifier of the process. [1] | `9f26bf6c-2508-43e7-a048-ed1822d9ff52`; `903b0daafe95e8ed9c9c0c9070259c29` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we give it a more descriptive name? E.g. process.global.id ? Entity id seems to be very broad and does not seem to match the description

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could rename this, but do you think we should make a namespace for one field? Or can you think of other things we might want to put in global in the future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not global, it should be unique for given process but I don't think we can claim it is global

Also this field is used in 23 elastic integrations as well as internally. I wouldn't change it without sufficient solid grounds

mjwolf marked this conversation as resolved.
Show resolved Hide resolved
| `process.executable.name` | string | The name of the process executable. On Linux based systems, can be set to the `Name` in `proc/[pid]/status`. On Windows, can be set to the base name of `GetProcessImageFileNameW`. | `otelcol` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `process.executable.path` | string | The full path to the process executable. On Linux based systems, can be set to the target of `proc/[pid]/exe`. On Windows, can be set to the result of `GetProcessImageFileNameW`. | `/usr/bin/cmd/otelcol` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `process.exit.code` | int | The exit code of the process. | `127` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
Expand All @@ -31,7 +32,26 @@
| `process.session_leader.pid` | int | The PID of the process's session leader. This is also the session ID (SID) of the process. | `14` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `process.user.id` | int | The effective user ID (EUID) of the process. | `1001` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `process.user.name` | string | The username of the effective user of the process. | `root` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `process.vpid` | int | Virtual process identifier. [1] | `12` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `process.vpid` | int | Virtual process identifier. [2] | `12` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

**[1]:** The process ID within a PID namespace. This is not necessarily unique across all processes on the host but it is unique within the process namespace that the process exists within.
**[1]:** A globally unique identifier of the process. This can be any format
of unique indentifier.

On Linux, where possible, it is RECOMMENDED that this is calculated
Copy link
Contributor

@lmolkova lmolkova May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other other limitations except Linux processes?

Can we do either "On Linux it is RECOMMENDED" or "Where possible it is RECOMMENDED"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all OSs should have PID and process start time, the only thing that's needed for this to work is a unique boot id which can be read by all collectors. AFAIK, only Linux has a boot UUID, but there could be other ways to generate one. Maybe some sort of IPC, where the first running collector on the system generates a boot UUID, and then the others read it. But that will start to get more complicated, and I think less likely that the recommended method is adopted.

I also don't know Windows or macos very well, if anyone knows of a way to get a boot UUID on those systems, I could add it to the note, and this should work for those too.

For now, I've changed this to "Where possible it is RECOMMENDED..."

MD5 hash of the string `<pid>_<starttime>_<bootid>`.
pyohannes marked this conversation as resolved.
Show resolved Hide resolved
mjwolf marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the first time we define an attribute value as a hash? I couldn't find any other instance at this point but if we do have a similar definition elsewhere, we should consistently stick to one algorithm so that instrumentation libraries can use the same hashing library and don't have to resort to multiple different ones. If it's the first time we define one, we should be fine with sticking to MD5 also for all other such hash-derived values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

service.instance.id is kind of a hash. It's not MD5 though.

Personally I'd prefer NOT to specify which hash algorithm is used, instead requirements (e.g. <N% collision over X cardinality). OR - we can agree that recommended algorithms can be updated as non-breaking changes - otherwise I think it risks change in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to specify what hash algorithm to use, otherwise different collectors could use different algorithms, and the resulting values wouldn't match for the same process. The whole intention of this recommended method is to have the identical values generated for a process. The algorithm used doesn't matter as much. Maybe it should be something other than MD5, if we use only one, and it can be used for other things that could be more affected by collisions.

Is there a place we can specify that entity_id might not match between versions? If so, then I think we can say that changing the algorithm between versions is a non-breaking change. Since this recommended algorithm can't be used in all situations (or some implementors could choose not to use it), it won't be possible for anything to rely on the entity_id being identical between collectors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to specify what hash algorithm to use, otherwise different collectors could use different algorithms, and the resulting values wouldn't match for the same process. The whole intention of this recommended method is to have the identical values generated for a process. The algorithm used doesn't matter as much. Maybe it should be something other than MD5, if we use only one, and it can be used for other things that could be more affected by collisions.

Makes sense.

What would you think of taking the tuple string directly instead of using a hash?

Since attribute value size can be of concern, we could consider using the last fragment of boot_id only, which would yield a string shorter than the MD5 hash would be. Not sure if the format is well-defined and portable, though.
man for random(4) only "defines" it by means of an example:

   uuid and boot_id
         These read-only files contain random strings like
         6fd5a44b-35f4-4ad4-a9b9-6b9be13e1fe9.  The former is
         generated afresh for each read, the latter was generated
         once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a place we can specify that entity_id might not match between versions? If so, then I think we can say that changing the algorithm between versions is a non-breaking change.

There isn't a syntax in YAML to define it like that but we could consider putting that in a note.
Your collectors might not run the same version but I'd hope we'd only change the hash algorithm (if we use one) once or twice in years, so it shouldn't be a large concern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of adding a note to the field describing hashing algorithm
I wouldn't use tuple though. Strings are easier to compare and hash is also a very common thing to produce such ids.
Internally in one of security product we use for example hash of instrumentation uid and process uid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think of taking the tuple string directly instead of using a hash?

Some Elastic products are actually currently using a tuple, but we'd like to move away from it. Since systems can generate a very large number of process events, there's value in trying to keep event logs as small as possible, and we haven't ever really had a use for the data within the entity_id tuple, so we think it's better to use a hash.

I think using a tuple with only partial uuid segments would increase the chances of collision, although I'm not sure by how much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With tuple I wasn't actually referring to a semantic combination of multiple attributes but actually using a single attribute with a single string value that's composed of the individual components (by concatenating them with some character, just like the input used for the hash function, i.e.. "<pid>_<starttime>_<bootid>").


e.g.
mjwolf marked this conversation as resolved.
Show resolved Hide resolved

| Key | Description | Example Value |
| --- | --- | --- |
| `pid` | PID | "1201" |
| `starttime` | start time in seconds since boot | "3029" |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `starttime` | start time in seconds since boot | "3029" |
| `starttime` | The number of seconds since the last boot | "3029" |

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've change this to The process start time in seconds since system boot, just to make it clear that's what time it's referring to

| `bootid` | boot ID, as read from `/proc/sys/kernel/random/boot_id` | "16b6f4a4-179e-4480-9561-0df471d8c6d4" |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `bootid` | boot ID, as read from `/proc/sys/kernel/random/boot_id` | "16b6f4a4-179e-4480-9561-0df471d8c6d4" |
| `bootid` | The boot ID, as read from `/proc/sys/kernel/random/boot_id` | "16b6f4a4-179e-4480-9561-0df471d8c6d4" |


md5("1201_3029_16b6f4a4-179e-4480-9561-0df471d8c6d4") -> "230d8ee6147ded8e31dcef082bdbc383"

By following this method, the same entity_id will be generated by all
observers of a given process.

**[2]:** The process ID within a PID namespace. This is not necessarily unique across all processes on the host but it is unique within the process namespace that the process exists within.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**[2]:** The process ID within a PID namespace. This is not necessarily unique across all processes on the host but it is unique within the process namespace that the process exists within.
**[2]:** The process ID within a PID namespace. This is not necessarily unique across all processes on the host but it is unique within this process namespace.

<!-- endsemconv -->
25 changes: 25 additions & 0 deletions model/registry/process.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,31 @@ groups:
brief: >
Process identifier (PID).
examples: [1234]
- id: entity_id
type: string
stability: experimental
brief: >
A unique identifier of the process.
note: |
A globally unique identifier of the process. This can be any format
mjwolf marked this conversation as resolved.
Show resolved Hide resolved
of unique indentifier.

On Linux, where possible, it is RECOMMENDED that this is calculated
trisch-me marked this conversation as resolved.
Show resolved Hide resolved
MD5 hash of the string `<pid>_<starttime>_<bootid>`.

e.g.

| Key | Description | Example Value |
| --- | --- | --- |
| `pid` | PID | "1201" |
| `starttime` | start time in seconds since boot | "3029" |
| `bootid` | boot ID, as read from `/proc/sys/kernel/random/boot_id` | "16b6f4a4-179e-4480-9561-0df471d8c6d4" |

md5("1201_3029_16b6f4a4-179e-4480-9561-0df471d8c6d4") -> "230d8ee6147ded8e31dcef082bdbc383"

By following this method, the same entity_id will be generated by all
observers of a given process.
examples: ["9f26bf6c-2508-43e7-a048-ed1822d9ff52", "903b0daafe95e8ed9c9c0c9070259c29"]
- id: parent_pid
type: int
stability: experimental
Expand Down
Loading