Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/loki] retry/queue causes any log to be blocked until queue resolves #18060

Closed
wiardvanrij opened this issue Jan 26, 2023 · 11 comments
Closed
Labels
bug Something isn't working exporter/loki Loki Exporter Stale

Comments

@wiardvanrij
Copy link
Contributor

Component(s)

exporter/loki

What happened?

Description

If the queue starts building up, it blocks all other logs (that would be perfectly fine to ship) to be shipped to Loki. For example, it would be possible to stop shipping logs to Loki entirely on a ~5k logs/s collector by pushing ~100 logs/s that will get rejected by Loki (for ease of testing, using a log that has a very old timestamp and then gets retried due to this bug: #18059).

Steps to Reproduce

  1. Setup a stream of logs that are perfectly fine. Use tenant: foo
  2. See that ^ gets into Loki at for example 1k logs/s
  3. Setup a new stream of logs that will fail. Use a different tenant: bar
  4. See the processor + exporter logs sent per second drop
  5. See the queue rise
  6. Do not see any improvement until you 'kill' part 3, where you stop ingesting 'bad logs'.

Expected Result

  1. 'bad logs' come in queue en retry (or ideally not for 400 errors but thats a different issue)
  2. Since it should be grouped by tenant, this behaviour from tenant barshould have zero impact on tenant foo

Actual Result

Completely deadlocked environment

Collector version

otel/opentelemetry-collector-contrib:0.66.0

Environment information

Environment

K8s

OpenTelemetry Collector configuration

config:
  receivers:
    fluentforward:
      endpoint: 0.0.0.0:8006
  processors:
    memory_limiter: null
    filter/excludeloki:
      logs:
        exclude:
          match_type: strict
          record_attributes:
            - key: namespace_name
              value: "loki"
    attributes/delete:
      actions:
        - key: annotations
          action: delete
        - key: container_hash
          action: delete
        - key: docker_id
          action: delete
        - key: fluent.tag
          action: delete
        - key: labels_pod-template-hash
          action: delete
    filter/excludetenants:
      logs:
        exclude:
          match_type: regexp
          record_attributes:
            - key: tenant
              value: .*.
    filter/includetenants:
      logs:
        include:
          match_type: regexp
          record_attributes:
            - key: tenant
              value: .*.
    attributes/xxxx:
      actions:
      - action: insert
        key: loki.attribute.labels
        value: stream, app, pod_name, account, cluster_name, namespace_name, region, container_name
      - action: insert
        key: loki.tenant
        value: region
    attributes/external:
      actions:
      - action: insert
        key: loki.attribute.labels
        value: account, region, team, environment, service, container_name
      - action: insert
        key: loki.tenant
        value: tenant
  exporters:
    loki:
      endpoint: redacted
  service:
    pipelines:
      logs/xxxx:
        exporters:
          - loki
        processors:
          - memory_limiter
          - filter/excludetenants
          - filter/excludeloki
          - attributes/delete
          - attributes/xxxx
          - batch
        receivers:
          - fluentforward
      logs/external:
        exporters:
          - loki
        processors:
          - memory_limiter
          - filter/includetenants
          - attributes/external
          - batch
        receivers:
          - otlp
          - fluentforward

Log output

No response

Additional context

Example visualisation:
image

@wiardvanrij wiardvanrij added bug Something isn't working needs triage New item requiring triage labels Jan 26, 2023
@github-actions github-actions bot added the exporter/loki Loki Exporter label Jan 26, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jpkrohling
Copy link
Member

I have mixed feelings about this: you certainly have a good point about all tenants sharing the same pipe to the backend, but perhaps you could use the routing processor and have each tenant on their own exporter? I'm not sure the logic to split a connection per tenant would belong to the Loki exporter.

I'm eager to hear other component owner's opinions. What do you think @kovrus and @mar4uk?

@wiardvanrij
Copy link
Contributor Author

I share your thoughts. In retrospect; I'm wondering what retries could potentially cause this. At the moment it's more a bug that rejected logs in Loki get retried.

Normally it should be more related to service/network outage impacting the entire "backend" and not a specific stream. But I could be mistaken here.

@mar4uk
Copy link
Contributor

mar4uk commented Feb 15, 2023

I agree that loki exporter shouldn't be responsible for splitting connection.

  1. 'bad logs' come in queue en retry (or ideally not for 400 errors but that's a different issue)

was fixed in #18083

  1. Since it should be grouped by tenant, this behaviour from tenant bar should have zero impact on tenant foo

Probably fix #18083 has eliminated impact. I don't think we should implement splitting logic on exporter side. The logic is already implemented on Loki side by sending X-Scope-OrgID header. So if the tenant bar affects tenant foo the problem should be solved on Loki side.

@wiardvanrij
Copy link
Contributor Author

Well, the thing is tho, if we use a separation of concerns from the start, we still needed that fix, but it also would mean it would not blow up the entire collector stack for all tenants.
Maybe in the future we add a feature x or introduce a bug y. Then I want to make sure that whatever tenant A does, never ever impacts tenant B. So, having that separation makes sure that the blast radius is smaller

@jpkrohling
Copy link
Member

The same concern applies to other aspects of the collector, not just the connection to the backend. I would argue that each tenant should have its exporter, perhaps tying all together with connectors, similar to what we have with the routing processor today.

cc @kovrus

@atoulme atoulme removed the needs triage New item requiring triage label Mar 7, 2023
@github-actions
Copy link
Contributor

github-actions bot commented May 8, 2023

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label May 8, 2023
@jpkrohling
Copy link
Member

I created a thread on Slack inviting folks to share their thoughts and ideas around this: https://cloud-native.slack.com/archives/C01N6P7KR6W/p1686168282317269

@github-actions github-actions bot removed the Stale label Jun 8, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Aug 8, 2023

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Aug 8, 2023
@jpkrohling jpkrohling removed the Stale label Aug 22, 2023
@jpkrohling
Copy link
Member

@wiardvanrij , A few months later, I wanted to check back and see how you progressed with this. The thread I created generated some thoughts and ideas, but I still have a feeling that we cannot prevent noisy neighbors from influencing other tenants if they share the same collector instance.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/loki Loki Exporter Stale
Projects
None yet
Development

No branches or pull requests

4 participants