Storage backends for adaptive sampling #3305

yurishkuro · 2021-10-05T19:12:32Z

Since v1.27 adaptive sampling is supported in the backend, but it only works with Cassandra as the backing store. We need to implement it for other types of stores, e.g.

memory-only (for all-in-one) Add in-memory storage support for adaptive sampling #3335
Badger feat: Add sampling store support to Badger #4834
Elasticsearch Add Elasticsearch storage support for adaptive sampling #5158
OpenSearch
gRPC remote storage
Documentation once all of the above are done

srikanthccv · 2021-10-17T04:41:47Z

I wanted to try out this feature but realised not supported for different backends. I can take a stab at this if nobody is already working on it.

albertteoh · 2021-10-17T10:52:12Z

That would be appreciated, @lonewolf3739.

james-ryans · 2023-04-17T17:49:11Z

Hi, does anyone working on this? I would like to work on Elasticsearch storage support.

james-ryans · 2023-04-20T14:33:02Z

I have some questions before I start implementing the feature.

What is the purpose of the bucket column in the operation_throughput and sampling_probabilities tables in the Cassandra storage backend? Is it solely for performance, or are there other considerations I'm missing?
Do I need to use index-per-day pattern? Do I need to support rollover and index-cleaner for adaptive sampling?

Here is my idea to store the document, feedbacks are welcome!

jaeger-throughputs
Is it better if we encode the service, operation, count, and probabilities field into a single string? Since we only query the timestamp field

// mapping
{
  "mappings": {
    "properties": {
      "timestamp": {
        "type": "long"
      },
      "service": {
        "type": "keyword",
        "index": false
      },
      "operation": {
        "type": "keyword",
        "index": false
      },
      "count": {
        "type": "long",
        "index": false
      },
      "probabilities": {
        "type": "keyword",
        "index": false
      }
    }
  }
}

// example
{
  "timestamp": 1485467191639875,
  "service": "svc",
  "operation": "op",
  "count": 40,
  "probabilities": ["0.1", "0.5"]
}

jaeger-probabilities-and-qps

// mapping
{
  "mappings": {
    "properties": {
      "timestamp": {
        "type": "long"
      },
      "hostname": {
        "type": "keyword",
        "index": false
      },
      "probabilities": {
        "type": "object",
        "dynamic": false,
        "properties": {
          "operations": {
            "type": "object",
            "dynamic": false,
            "properties": {
              "operation": {
                "type": "keyword",
                "index": false
              },
              "probability": {
                "type": "keyword",
                "index": false
              },
              "qps": {
                "type": "long",
                "index": false
              }
            }
          },
          "service": {
            "type": "keyword",
            "index": false
          }
        }
      }
    }
  }
}

// example
{
  "timestamp": 1485467191639875,
  "hostname": "localhost",
  "probabilities": [
    {
      "service": "svc",
      "operations": [
        {
          "operation": "op1",
          "probability": "0.1",
          "qps": 40
        },
        {
          "operation": "op2",
          "probability": "0.2",
          "qps": 50
        }
      ]
    },
    {
      "service": "another_svc",
      "operations": [
        {
          "operation": "op3",
          "probability": "0.4",
          "qps": 20
        },
        {
          "operation": "op4",
          "probability": "0.5",
          "qps": 30
        }
      ]
    }
  ]
}

Since Elasticsearch 5+ does not support _ttl mapping, my idea to overcome the limitation is to store expire_timestamp to indicate if the lease is expired when we retrieve it. This approach is highly feasible if we need to support an index-per-day pattern, which can be easily scaled with es-rollover and es-index-cleaner. One of the biggest advantages of this solution is that it supports milliseconds(or microseconds) granularity.

jaeger-leases

// mapping
{
  "mappings": {
    "properties": {
      "name": {
        "type": "keyword"
      },
      "owner": {
        "type": "keyword"
      },
      "expire_timestamp": {
        "type": "long"
      }
    }
  }
}

// example
{
  "name": "sampling_store_leader",
  "owner": "localhost",
  "expire_timestamp": 1681998717000000
}

yurishkuro · 2023-04-20T22:52:14Z

What is the purpose of the bucket column in the operation_throughput and sampling_probabilities tables in the Cassandra storage backend? Is it solely for performance, or are there other considerations I'm missing?

bucket in Cassandra is used to avoid hot spots in the hash ring (bucket is a random number 1..n), because without this field the primary key is just the timestamp, and all collectors write sampling data at the same time.

Do I need to use index-per-day pattern? Do I need to support rollover and index-cleaner for adaptive sampling?

I think it should be treated as any other index. The main difference in sampling data from the trace/span data is that while they all always growing, the sampling is only valuable for the last N writes. The LAST write is the most important as it provides the initial seed of the probabilities, while N last writes are used to compute the next iteration of sampling probabilities (e.g. using exponential decay of the older data). In theory, the whole adaptive sampling storage can be modeled with these N slots (in a round robin fashion), but in practice we found it useful to keep the history for a few days in order to investigate how sampling rates change over time. Hence my suggestion to use the same TTL / rotation / rollover as the main span indices (also makes the implementation simpler & maintenance streamlined).

slayer321 · 2023-09-20T05:40:50Z

Hey @yurishkuro , I did like to work on Implementing Badger storage support. Currently I'm going through the memory-only and Cassandra Implementation will share more on Badger Implementation in some time.

yurishkuro · 2023-09-20T17:08:51Z

@slayer321 I would strongly recommend starting with adding new tests in the storage e2e integration test, which today does not cover sampling storage. Then you will have a clear blueprint of what needs to be implemented in another backend.

## Which problem is this PR solving? Related #3305 ## Description of the changes - Implemented badger db for sampling store ## How was this change tested? - Added Unit test and also tested it with the already Implemented integration test ## Checklist - [x] I have read https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md - [x] I have signed all commits - [x] I have added unit tests for the new functionality - [x] I have run lint and test steps successfully - for `jaeger`: `make lint test` - for `jaeger-ui`: `yarn lint` and `yarn test` --------- Signed-off-by: slayer321 <[email protected]>

Pushkarm029 · 2024-01-27T17:04:13Z

I would like to implement Adaptive Sampling support for Elasticsearch.

akagami-harsh · 2024-01-30T12:21:38Z

hey @Pushkarm029, are you working on it?

Pushkarm029 · 2024-01-30T12:23:58Z

@akagami-harsh, yeah, I am halfway. I will complete it within 2-3 days.

## Which problem is this PR solving? - #3305 ## Description of the changes - Implemented Elasticsearch storage for adaptive sampling ## How was this change tested? - not tested yet ## Checklist - [x] I have read https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md - [x] I have signed all commits - [x] I have added unit tests for the new functionality - [x] I have run lint and test steps successfully - for `jaeger`: `make lint test` - for `jaeger-ui`: `yarn lint` and `yarn test` --------- Signed-off-by: Pushkar Mishra <[email protected]> Co-authored-by: Yuri Shkuro <[email protected]>

Pushkarm029 · 2024-02-27T04:44:21Z

Should we update the documents to reflect the current state?

Adaptive sampling requires a storage backend to store the observed traffic data and computed probabilities. At the moment memory (for all-in-one deployment) and cassandra are supported as sampling storage backends. We are seeking help in implementing support for other backends ( tracking issue ).

https://www.jaegertracing.io/docs/1.54/sampling/#adaptive-sampling

yurishkuro · 2024-02-27T06:19:43Z

yes

gmandrade21 · 2024-03-20T20:21:18Z

@yurishkuro somebody is working now for the Opensearch backend in this feature?

yurishkuro · 2024-03-20T22:58:24Z

OpenSearch is already supported via Elasticsearch code (they are the same)

rsafonseca · 2024-04-11T09:16:21Z

Is it really supported?

When I try to start jaeger-collector (tested with 1.55.0 and 1.56.0) with SAMPLING_STORAGE_TYPE=elasticsearch I get the following:

{"level":"fatal","ts":1712826901.3422914,"caller":"collector/main.go:92","msg":"Failed to create sampling store factory","error":"storage factory of type elasticsearch does not support sampling store","stacktrace":"main.main.func1\n\tgithub.com/jaegertracing/jaeger/cmd/collector/main.go:92\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/[email protected]/command.go:983\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/[email protected]/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/[email protected]/command.go:1039\nmain.main\n\tgithub.com/jaegertracing/jaeger/cmd/collector/main.go:157\nruntime.main\n\truntime/proc.go:271"}

In addition, according to the docs "By default adaptive sampling will attempt to use the backend specified by SPAN_STORAGE_TYPE to store data."
But when if i set SPAN_STORAGE_TYPE=elasticsearch and don't set SAMPLING_STORAGE_TYPE, i get this when starting the collector:

{"level":"fatal","ts":1712825412.326171,"caller":"collector/main.go:97","msg":"Failed to init sampling strategy store factory","error":"sampling store factory is nil. Please configure a backend that supports adaptive sampling","stacktrace":"main.main.func1\n\tgithub.com/jaegertracing/jaeger/cmd/collector/main.go:97\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/[email protected]/command.go:983\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/[email protected]/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/[email protected]/command.go:1039\nmain.main\n\tgithub.com/jaegertracing/jaeger/cmd/collector/main.go:157\nruntime.main\n\truntime/proc.go:271"}

yurishkuro · 2024-04-11T15:16:25Z

@Pushkarm029 can you please take a look at this ^ report?

Pushkarm029 · 2024-04-13T05:47:05Z

@Pushkarm029 can you please take a look at this ^ report?

👀looking into it.

yurishkuro added the help wanted Features that maintainers are willing to accept but do not have cycles to implement label Oct 5, 2021

albertteoh assigned srikanthccv Oct 17, 2021

srikanthccv mentioned this issue Oct 21, 2021

Add in-memory storage support for adaptive sampling #3335

Merged

slayer321 mentioned this issue Sep 22, 2023

Add e2e test for sampling storage #4772

Merged

4 tasks

slayer321 mentioned this issue Oct 11, 2023

feat: Add sampling store support to Badger #4834

Merged

4 tasks

Pushkarm029 mentioned this issue Feb 2, 2024

Add Elasticsearch storage support for adaptive sampling #5158

Merged

4 tasks

Wise-Wizard mentioned this issue Feb 6, 2024

Added Adaptive sampling support for Elasticsearch #5169

Closed

4 tasks

Pushkarm029 mentioned this issue Mar 3, 2024

List all storage backends are currently supported for adaptive sampling jaegertracing/documentation#676

Closed

4 tasks

Pushkarm029 mentioned this issue Apr 23, 2024

Create sampling templates when creating sampling store #5349

Merged

4 tasks

akstron mentioned this issue Dec 4, 2024

[WIP] Add Adaptive Sampling Support for gRPC Remote Storage #6308

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storage backends for adaptive sampling #3305

Storage backends for adaptive sampling #3305

yurishkuro commented Oct 5, 2021 •

edited

Loading

srikanthccv commented Oct 17, 2021

albertteoh commented Oct 17, 2021

james-ryans commented Apr 17, 2023

james-ryans commented Apr 20, 2023

yurishkuro commented Apr 20, 2023

slayer321 commented Sep 20, 2023

yurishkuro commented Sep 20, 2023

Pushkarm029 commented Jan 27, 2024

akagami-harsh commented Jan 30, 2024

Pushkarm029 commented Jan 30, 2024

Pushkarm029 commented Feb 27, 2024

yurishkuro commented Feb 27, 2024

gmandrade21 commented Mar 20, 2024

yurishkuro commented Mar 20, 2024

rsafonseca commented Apr 11, 2024 •

edited

Loading

yurishkuro commented Apr 11, 2024

Pushkarm029 commented Apr 13, 2024

Storage backends for adaptive sampling #3305

Storage backends for adaptive sampling #3305

Comments

yurishkuro commented Oct 5, 2021 • edited Loading

srikanthccv commented Oct 17, 2021

albertteoh commented Oct 17, 2021

james-ryans commented Apr 17, 2023

james-ryans commented Apr 20, 2023

yurishkuro commented Apr 20, 2023

slayer321 commented Sep 20, 2023

yurishkuro commented Sep 20, 2023

Pushkarm029 commented Jan 27, 2024

akagami-harsh commented Jan 30, 2024

Pushkarm029 commented Jan 30, 2024

Pushkarm029 commented Feb 27, 2024

yurishkuro commented Feb 27, 2024

gmandrade21 commented Mar 20, 2024

yurishkuro commented Mar 20, 2024

rsafonseca commented Apr 11, 2024 • edited Loading

yurishkuro commented Apr 11, 2024

Pushkarm029 commented Apr 13, 2024

yurishkuro commented Oct 5, 2021 •

edited

Loading

rsafonseca commented Apr 11, 2024 •

edited

Loading