Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore spanner receiver, and add new codeowners #35998

Merged
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 0 additions & 27 deletions .chloggen/codeboten_rm-unmaintained-component.yaml

This file was deleted.

1 change: 1 addition & 0 deletions .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
Expand Up @@ -243,6 +243,7 @@ receiver/fluentforwardreceiver/ @open-teleme
receiver/githubreceiver/ @open-telemetry/collector-contrib-approvers @adrielp @andrzej-stencel @crobert-1 @TylerHelmuth
receiver/googlecloudmonitoringreceiver/ @open-telemetry/collector-contrib-approvers @dashpole @TylerHelmuth @abhishek-at-cloudwerx
receiver/googlecloudpubsubreceiver/ @open-telemetry/collector-contrib-approvers @alexvanboxel
receiver/googlecloudspannerreceiver/ @open-telemetry/collector-contrib-approvers @dashpole @dsimil @KiranmayiB @harishbohara11
receiver/haproxyreceiver/ @open-telemetry/collector-contrib-approvers @atoulme @MovieStoreGuy
receiver/hostmetricsreceiver/ @open-telemetry/collector-contrib-approvers @dmitryax @braydonk
receiver/httpcheckreceiver/ @open-telemetry/collector-contrib-approvers @codeboten
Expand Down
1 change: 1 addition & 0 deletions .github/ISSUE_TEMPLATE/bug_report.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,7 @@ body:
- receiver/github
- receiver/googlecloudmonitoring
- receiver/googlecloudpubsub
- receiver/googlecloudspanner
- receiver/haproxy
- receiver/hostmetrics
- receiver/httpcheck
Expand Down
1 change: 1 addition & 0 deletions .github/ISSUE_TEMPLATE/feature_request.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -231,6 +231,7 @@ body:
- receiver/github
- receiver/googlecloudmonitoring
- receiver/googlecloudpubsub
- receiver/googlecloudspanner
- receiver/haproxy
- receiver/hostmetrics
- receiver/httpcheck
Expand Down
1 change: 1 addition & 0 deletions .github/ISSUE_TEMPLATE/other.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -231,6 +231,7 @@ body:
- receiver/github
- receiver/googlecloudmonitoring
- receiver/googlecloudpubsub
- receiver/googlecloudspanner
- receiver/haproxy
- receiver/hostmetrics
- receiver/httpcheck
Expand Down
1 change: 1 addition & 0 deletions .github/ISSUE_TEMPLATE/unmaintained.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -236,6 +236,7 @@ body:
- receiver/github
- receiver/googlecloudmonitoring
- receiver/googlecloudpubsub
- receiver/googlecloudspanner
- receiver/haproxy
- receiver/hostmetrics
- receiver/httpcheck
Expand Down
2 changes: 2 additions & 0 deletions cmd/otelcontribcol/builder-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,7 @@ receivers:
- gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/githubreceiver v0.112.0
- gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/googlecloudmonitoringreceiver v0.112.0
- gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/googlecloudpubsubreceiver v0.112.0
- gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/googlecloudspannerreceiver v0.112.0
- gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/haproxyreceiver v0.112.0
- gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/hostmetricsreceiver v0.112.0
- gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/httpcheckreceiver v0.112.0
Expand Down Expand Up @@ -413,6 +414,7 @@ replaces:
- github.com/open-telemetry/opentelemetry-collector-contrib/extension/httpforwarderextension => ../../extension/httpforwarderextension
- github.com/open-telemetry/opentelemetry-collector-contrib/exporter/elasticsearchexporter => ../../exporter/elasticsearchexporter
- github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awscloudwatchlogsexporter => ../../exporter/awscloudwatchlogsexporter
- github.com/open-telemetry/opentelemetry-collector-contrib/receiver/googlecloudspannerreceiver => ../../receiver/googlecloudspannerreceiver
- github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver => ../../receiver/prometheusreceiver
- github.com/open-telemetry/opentelemetry-collector-contrib/exporter/sapmexporter => ../../exporter/sapmexporter
- github.com/open-telemetry/opentelemetry-collector-contrib/internal/kubelet => ../../internal/kubelet
Expand Down
1 change: 1 addition & 0 deletions receiver/googlecloudspannerreceiver/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
include ../../Makefile.Common
83 changes: 83 additions & 0 deletions receiver/googlecloudspannerreceiver/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Google Cloud Spanner Receiver

<!-- status autogenerated section -->
| Status | |
| ------------- |-----------|
| Stability | [beta]: metrics |
| Distributions | [contrib] |
| Issues | [![Open issues](https://img.shields.io/github/issues-search/open-telemetry/opentelemetry-collector-contrib?query=is%3Aissue%20is%3Aopen%20label%3Areceiver%2Fgooglecloudspanner%20&label=open&color=orange&logo=opentelemetry)](https://github.com/open-telemetry/opentelemetry-collector-contrib/issues?q=is%3Aopen+is%3Aissue+label%3Areceiver%2Fgooglecloudspanner) [![Closed issues](https://img.shields.io/github/issues-search/open-telemetry/opentelemetry-collector-contrib?query=is%3Aissue%20is%3Aclosed%20label%3Areceiver%2Fgooglecloudspanner%20&label=closed&color=blue&logo=opentelemetry)](https://github.com/open-telemetry/opentelemetry-collector-contrib/issues?q=is%3Aclosed+is%3Aissue+label%3Areceiver%2Fgooglecloudspanner) |
| [Code Owners](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/CONTRIBUTING.md#becoming-a-code-owner) | [@dashpole](https://www.github.com/dashpole), [@dsimil](https://www.github.com/dsimil), [@KiranmayiB](https://www.github.com/KiranmayiB), [@harishbohara11](https://www.github.com/harishbohara11) |
| Emeritus | [@architjugran](https://www.github.com/architjugran), [@varunraiko](https://www.github.com/varunraiko) |

[beta]: https://github.com/open-telemetry/opentelemetry-collector#beta
[contrib]: https://github.com/open-telemetry/opentelemetry-collector-releases/tree/main/distributions/otelcol-contrib
<!-- end autogenerated section -->

Google Cloud Spanner enable you to investigate issues with your database
by exposing via [Total and Top N built-in tables](https://cloud.google.com/spanner/docs/introspection):
- Query statistics
- Read statistics
- Transaction statistics
- Lock statistics
- and others

_Note_: Total and Top N built-in tables are used with 1 minute statistics granularity.

The ultimate goal of Google Cloud Spanner Receiver is to collect and transform those statistics into metrics
that would be convenient for further analysis by users.

## Getting Started

The following configuration example is:

```yaml
receivers:
googlecloudspanner:
collection_interval: 60s
initial_delay: 1s
top_metrics_query_max_rows: 100
backfill_enabled: true
cardinality_total_limit: 200000
hide_topn_lockstats_rowrangestartkey: false
truncate_text: false
projects:
- project_id: "spanner project 1"
service_account_key: "path to spanner project 1 service account json key"
instances:
- instance_id: "id1"
databases:
- "db11"
- "db12"
- instance_id: "id2"
databases:
- "db21"
- "db22"
- project_id: "spanner project 2"
service_account_key: "path to spanner project 2 service account json key"
instances:
- instance_id: "id3"
databases:
- "db31"
- "db32"
- instance_id: "id4"
databases:
- "db41"
- "db42"
```

Brief description of configuration properties:
- **googlecloudspanner** - name of the Cloud Spanner Receiver related section in OpenTelemetry collector configuration file
- **collection_interval** - this receiver runs periodically. Each time it runs, it queries Google Cloud Spanner, creates metrics, and sends them to the next consumer (default: 1 minute). **It is not recommended to change the default value of collection interval, since new values for metrics in the Spanner database appear only once a minute.**
- **initial_delay** defines how long this receiver waits before starting.
- **top_metrics_query_max_rows** - max number of rows to fetch from Top N built-in table(100 by default)
- **backfill_enabled** - turn on/off 1-hour data backfill(by default it is turned off)
- **cardinality_total_limit** - limit of active series per 24 hours period. If specified, turns on cardinality filtering and handling. If zero or not specified, cardinality is not handled. You can read [this document](cardinality.md) for more information about cardinality handling and filtering.
- **hide_topn_lockstats_rowrangestartkey** - if true, masks PII (key values) in row_range_start_key label for the "top minute lock stats" metric
- **truncate_text** - if true, the query text is truncated to 1024 characters.
- **projects** - list of GCP projects
- **project_id** - identifier of GCP project
- **service_account_key** - path to service account JSON key It is highly recommended to set this property to the correct value. In case it is empty, the [Application Default Credentials](https://google.aip.dev/auth/4110) will be used for the database connection.
- **instances** - list of Google Cloud Spanner instance for connection
- **instance_id** - identifier of Google Cloud Spanner instance
- **databases** - list of databases used from this instance

80 changes: 80 additions & 0 deletions receiver/googlecloudspannerreceiver/cardinality.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Cloud Spanner: Controlling Metrics Cardinality for Open Telemetry Receiver

## Overview
This document describes the algorithm used to control the cardinality of the Top N metrics exported by Cloud Spanner receiver in OpenTelemetry collector.
With cardinality limits enforced in Cloud Monitoring and possibly in Prometheus, it is critical to control the cardinality of the metrics to avoid drops in time series which can limit the usability of Cloud Spanner custom metrics.

## Background
Cloud Monitoring has 200,000 active time series (streams) [limits](https://cloud.google.com/monitoring/quotas) for custom metrics per project.
Active time series translates to streams that are sent in the last 24 hours.
This means that the number of streams that can be sent by OpenTelemetry collector should be controlled to this limit as the metrics will be dropped by Cloud Monitoring.

## Calculation
For simplicity purpose and for explaining of calculation, which is currently done in implementation, lets assume that we have 1 project, 1 instance and 1 database.
Such calculations are done when the collector starts. If the cardinality total limit is different(for calculations below we assume it is 200,000) the numbers will be different.

According to the metrics [metadata configuration](internal/metadataconfig/metrics.yaml)(at the moment of writing this document) we have:
- 30 low cardinality metrics(amount of total N + active queries summary metrics);
- 26 high cardinality metrics(amount of top N metrics).

For low cardinality metrics, the allowed amount of time series is **projects x instances x databases x low cardinality metrics** and in our case is **1 x 1 x 1 x 30 = 30** (1 per each metric).

Thus, the remaining quota of allowed time series in 24 hours is **200,000 - 30 = 199.970**.

For high cardinality metrics we have **199970 / 26 = 7691** allowed time series per metric.

This means each metric converted from these columns should not have a cardinality of 7691 per day.

We have one time series per minute, meaning that we can have up to 1,440 (24 hours x 60 minutes per hour = 1440) values per day.

Taking into account 1440 time series we'll obtain **7691 / 1440 = 5** new time series per minute.


## Labels
For each metric in the Total N table, the labels that create a unique time series are **Project ID + Instance ID + Database ID**.

For each metric in the Top N table, the labels that create a unique time series are **Project ID + Instance ID + Database ID + one of Query/Txn/Read/RowKey**.

While fingerprint and truncated field are part of some Top N table, it does not add to cardinality as there is 1 to 1 correspondence with Query/Txn/Read to their fingerprint and truncated information.

Cardinality is the number of unique sets of labels that are sent to Cloud Monitoring to create a new time series over a 24-hour period.

## Detailed Design
To avoid confusion between labels and queries/txn shapes/read shapes and row key, queries will be loosely used in the below design to make it easier for understanding the design.

Also, the **7691** time series limit is used as an example.
This needs to be made configurable based on the number of databases, number of metrics per table and future growth of the metrics per table.

With **7691** time series limit and each metric is sent once per minute, there are **60 minute x 24 hours = 1440** time series data will be sent to Cloud Monitoring.

To allow new queries (new Top queries which are not seen in the last 24 hours) to be sent to Cloud Monitoring, each minute should have an opportunity to send new queries.
Assuming a uniform distribution of new queries being created in Top N tables, the algorithm should handle **floor(7691/1440) = 5** new queries per minute.
These 5 new queries should be the Top 5 queries which were not exported in the last 24 hours.

While the existing time series (queries which have been exported in the last 24 hours) should be good to export, exporting more than 100(**top_metrics_query_max_rows** receiver config parameter) queries per minute may be an overkill.
Customers can tune the maximum number of queries per minute per metric to a smaller value if required.

An LRU cache with TTL of 24 hours is maintained per each Top N metric.
For each collection interval(1 minute), cache will be populated with no more than 5 new queries, (i.e) if cache does not have an entry for the query, it will be added to the cache.
This can be done no more than 5 times every time.
This means that after the first minute, collector can send 5 queries, second minute collector can send 10 queries(5 new and 5 existing queries sent in the first minute which is also seen in top 100 for second minute), third minute collector can send 15 queries(5 new and 10 existing queries sent in the first two minute which is also seen in top 100 for third minute).
In a normal state after 20 minutes, the collector will be sending about 100 of queries per minute if the top 95 queries are already in the cache and 5 new queries(if it is in top 100 queries) are added to the cache in that interval.

Since LRU cache is in memory of collector instance - you'll have separate cache and limits per instance-metric.
Ability to use external caching for this is currently out of scope.
In case of multiple collector instances used you need to remember that handling overall total cardinality limit for all metrics will be user responsibility and needs to be properly defined in receiver configuration.
Also, there is ability to turn on/off cardinality handling using receiver configuration(it is turned off by default).

## Limitation
While this algorithm is pessimistic, it provides the guarantee that new 5 queries in top 100 queries will be seen all the time during normal operation without losing information.
If there are 6 new queries which are seen in a minute, the top 5 new queries will be sent and the 6th one will not be sent in that minute.
Next minute may capture that query if that query is still in the top 5 new query.

To alleviate this limitation of 5 queries at the start of the collector, the receiver can collect the last 20 minutes of Top Queries(like replay) and populate the LRU cache.
This will allow the receiver to send 100 queries at the first minute.
Backfilling will also populate the cache with the last one hour of queries which can avoid 5 query limitations and still maintain 5 new query per minute going forward.

## Alternatives Considered
One option that was considered is to allow aggressive collection where the top 100 will always be sent till the limit is reached.
This means if every minute has new top 100 queries(worst case), new queries after the limit of 7691 cannot be sent. The limit can be reached in 76 minutes (7691/100 new queries per minute = 76 minutes).
Though it is impractical to see 100 new queries all the time in top 100 queries, it is quite possible that the limit can be hit easily if the number of time series per metric is limited to 2500 queries per day to allow for more databases to be supported or if the queries are using literals instead of parameters.
Loading