Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database metric semantic conventions #1076

Conversation

justinfoote
Copy link
Member

Fixes open-telemetry/semantic-conventions#707

This PR attempts to specify semanatic conventions for database metric instruments.

There are a few spicy parts that I'd love feedback about:

1. db.operation and db.table

We'd like to correlate the golden signal metrics to the spans that represent the same
operation. The easiest way to do that may have been to include the db.statement as a
label, but this has the potential to drastically increase cardinality.
The alternative that I see is to include db.operation and db.table as labels.

This has clearly been discussed already, and the trace semantic convention includes
this line:

While it would semantically make sense to set this, e.g., to a SQL keyword like SELECT or INSERT, it is not recommended to attempt any client-side parsing of db.statement just to get this property (the back end can do that if required).

But in my experience, these values are frequently available from an ORM library without
requiring SQL string parsing, and if this instrumentation may be eventually written into
those ORM libraries, I'd like to encourage them to provide this information on the
metric instruments.

2. Connection pool metrics

There is some prior art in go-redis of some very detailed metrics describing database
connection pool behavior. While this is pretty interesting, I think these values will not
be widely available to all database client libraries. So I've recommended that we record
a simpler db.connection_pool.limit and db.connection_pool.usage.

I've also included a MAY to allow instrumentation to collect more detailed information if
it is available. I'm thinking that if these concepts will be used by more than just
go-redis, it would be ideal if they looked the same.

Changes

This PR adds a new semantic conventions at specification/metrics/semantic_conventions/database.md

@justinfoote justinfoote requested review from a team October 8, 2020 21:50
@arminru
Copy link
Member

arminru commented Oct 9, 2020

Please add an entry in CHANGELOG.md

specification/metrics/semantic_conventions/database.md Outdated Show resolved Hide resolved
specification/metrics/semantic_conventions/database.md Outdated Show resolved Hide resolved
| `db.connections.new` | Counter | {connections} | The number of new connections created. |
| `db.connections.taken` | Counter | {connections} | The number of connections taken from the connection pool. |
| `db.connections.returned` | Counter | {connections} | The number of connections returned to the connection pool. |
| `db.connections.reused` | Counter | {connections} | The number of connections reused. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the criteria for knowing it's an existing connection that's being re-used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I had used this PR as reference, and had assumed that an instrumentation author would know best how to collect these numbers, but now I'm thinking we need definitions of exactly what value needs to be recorded here, or we run the risk of misunderstanding and inconsistency across instrumentation sources.

I'll give this some thought about how to describe this.

|---------------------------|------------|---------------|-------------|
| `db.connections.new` | Counter | {connections} | The number of new connections created. |
| `db.connections.taken` | Counter | {connections} | The number of connections taken from the connection pool. |
| `db.connections.returned` | Counter | {connections} | The number of connections returned to the connection pool. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it anticipated that db.connection_pool.usage is equal to db.connections.taken minus db.connections.returned?

If so, what is the benefit in having the additional instruments?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I expect db.connection_pool.usage = db.connections.taken - db .connections.returned. But the breakdown metrics can give much more detail than the use of the connection pool over time.
The usage/limit metrics use a LastValue aggregator by default, meaning you get a periodic snapshot of the number of connections currently in use.
The more detailed metrics allow you to see things like connection churn. With them, a user could differentiate between an application making millions of tiny database operations (and ending a harvest period with 4/5 connections in use) and an application with four very long and slow queries in progress (also ending the harvest period with 4 connections in use).

The connection_pool usage instruments are asynchronous -- they're intended to be polled on a set interval. The more detailed metrics are recording all the actions related to the connection pool in real time.

...but
Having said all of that, I don't have specific use cases to back up the inclusion of these extra instruments.
But, they're already being recorded by go-redis, and I can see the same sort of instruments being possible in a number of other frameworks. And if there's a desire to record them, I think we should spec them so we get consistent names and labels across implementations.

Copy link
Member Author

@justinfoote justinfoote Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a few different ways of explaining this. I ultimately ended up with the pretty simple approach in ec58de4.
I'm certainly open to suggestions about how to clarify what these instruments are all about.

Copy link
Contributor

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Thanks for the examples, this is great.)

CHANGELOG.md Outdated Show resolved Hide resolved
specification/metrics/semantic_conventions/database.md Outdated Show resolved Hide resolved
|------------------|--------|--------------|----------|----------|
| `db.name` | string | If no [tech-specific label](#call-level-labels-for-specific-technologies) is defined, this attribute is used to report the name of the database being accessed. For commands that switch the database, this should be set to the target database (even if the command fails). [1] | `customers`<br>`main` | Required if applicable. |
| `db.operation` | string | The name of the operation being executed, e.g. the [MongoDB command name](https://docs.mongodb.com/manual/reference/command/#database-operations) such as `findAndModify`. [4][5] | `findAndModify`<br>`HMSET`<br>`SELECT`<br>`CONNECT` | Required if applicable. |
| `db.table` | string | The name of the primary table, collection, segment, etc... that the operation is acting upon. | `user_table` | Required if applicable. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be in favor of introducing this in the (more elaborate) tracing spec first, as you proposed on #1104, and then just refer to that definition from here to avoid duplication and potentially going out of sync. Once both kinds of semconvs are auto-generated, it should be fine again, however.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 I've created that PR now (here: #1141). After it has some reviews, I'll update this PR like you suggest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #1141 has all the approvals it needs, so I've updated this one to match it.
Just that little bit of manual work to keep these two PRs has reminded me that we really need the semantic convention table generator to work for metrics.

specification/metrics/semantic_conventions/database.md Outdated Show resolved Hide resolved
@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Oct 31, 2020
@jmacd
Copy link
Contributor

jmacd commented Nov 6, 2020

This is not stale.

@andrewhsu
Copy link
Member

at the metrics sig mtg today talked about this issue's staleness

Copy link
Member

@aabmass aabmass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 Would be good to add a link to database.md from README.md too

Comment on lines 30 to 36
| `db.system` | An identifier for the database management system (DBMS) product being used. [1] | `other_sql` | Yes |
| `db.connection_string` | The connection string used to connect to the database. It is recommended to remove embedded credentials. | `Server=(localdb)\v11.0;Integrated Security=true;` | No |
| `db.user` | Username for accessing the database. | `readonly_user`<br>`reporting_user` | No |
| `net.transport` | Transport protocol used. See note below. See [general network connection attributes](../../trace/semantic_conventions/span-general.md#general-network-connection-attributes). | `IP.TCP`<br>`Unix` | Conditional [2] |
| `net.peer.ip` | Remote address of the peer (dotted decimal for IPv4 or [RFC5952](https://tools.ietf.org/html/rfc5952) for IPv6) | `127.0.0.1` | No |
| `net.peer.port` | Remote port number. | `80`<br>`8080`<br>`443` | No |
| `net.peer.name` | Remote hostname or similar, see note below. | `example.com` | No |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

|----------------------|---------------|--------------|-------------|
| `db.client.duration` | ValueRecorder | milliseconds | The duration of the database operation. |

Database operations SHOULD include execution of queries, including DDL, DML,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Database operations SHOULD include execution of queries, including DDL, DML,
Measured database operations SHOULD include execution of queries, including DDL, DML,

Comment on lines 70 to 72
**[5]:** To reduce cardinality, the value for `db.operation` should have parameters
removed or substituted. The resulting value should be a low-cardinality value
represeting the statement or operation being executed on the database. It may be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this is supposed to pertain to a db.statement parameter, which is missing in the table.

Suggested change
**[5]:** To reduce cardinality, the value for `db.operation` should have parameters
removed or substituted. The resulting value should be a low-cardinality value
represeting the statement or operation being executed on the database. It may be
**[5]:** To reduce cardinality, the value for `db.operation` should leave in placeholders
and not include any actual parameter values. The resulting value should be low-cardinality
and represent the statement or operation being executed on the database. It may be

Might also be worth adding "it is not recommended to attempt any client-side parsing of db.statement to remove parameters", although users shouldn't be adding formatting them in anyway. The trace conventions say "the value may be sanitized to exclude sensitive information," do you think that is mostly just about not substitution values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. I've updated it to be more like these words. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The db.operation label wouldn't have any placeholders I thought, since its just SELECT, UPDATE, etc. I was more thinking that advice would apply to db.statement, unless it was purposefully left out for cardinality reasons?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without db.statement, is there enough granularity? In the MySQL example, all SELECT queries to orders would be aggregated into that same metric

{
  "name": "db.client.duration",
  "labels": {
    "db.operation": "SELECT",
    "db.name": "ShopDb",
    "db.system": "mysql",
    "db.connection_string": "Server=shopdb.example.com;Database=ShopDb;Uid=billing_user;TableCache=true;UseCompression=True;MinimumPoolSize=10;MaximumPoolSize=50;",
    "db.user": "billing_user",
    "db.sql.table": "orders",
    "net.peer.name": "shopdb.example.com",
    "net.peer.ip": "192.0.2.12",
    "net.peer.port": "3306",
    "net.transport": "IP.TCP"
  }
}

| `db.connection_pool.usage` | ValueObserver | {connections} | The number of database connections _in use_. |

If the following detailed information is available, the following metric
instruments MAY be collected. They SHOULD have all [common](#common) labels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
instruments MAY be collected. They SHOULD have all [common](#common) labels
instruments MAY be used. They SHOULD have all [common](#common) labels


| Name | Instrument | Units | Description |
|---------------------------|------------|---------------|-------------|
| `db.connections.taken` | Counter | {connections} | Incremented when a connection is requested and returned from the pool. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These remind me of the MemStats variables, which are difficult to reason about. This makes me think they should be database-specific, e.g., db.redis.connections.taken, maybe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I'm on board. As I said before, I don't have strong use cases for these metrics, aside from the fact that go-redis is already collecting them.

My only hesitation is: if other database libraries decide to pick up these metrics, and they become somewhat common, I wonder if it diminishes their value to have them namespaced separately.
As a user, If I have redis and postgres connections, will I be frustrated that they report separately as db.redis.connections.taken and db.postgres.connections.taken?
Now that I've asked that, though, it doesn't seem very useful to aggregate connection pool information across multiple databases, so maybe it's fine to separate them.

The other option (that I'd be definitely in favor of) is to just drop these metrics entirely, so we'd only specify db.connection_pool.limit and db.connection_pool.usage.

Which of those would you prefer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the metrics SIG today: we're going to give guidance to use db.{db.system}.* for technology-specific instrument names.

And I'll include some examples using the names from go-redis.

@tigrannajaryan
Copy link
Member

Reviewers, please resolve comments that are addressed.

@github-actions
Copy link

github-actions bot commented Dec 7, 2020

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Dec 7, 2020
@github-actions
Copy link

Closed as inactive. Feel free to reopen if this PR is still being worked on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Define metric semantic conventions for database operations
8 participants