Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

promtail-linux-amd64.gz 0.4.0 full of errors after upgrade #1199

Closed
Alexvianet opened this issue Oct 24, 2019 · 15 comments
Closed

promtail-linux-amd64.gz 0.4.0 full of errors after upgrade #1199

Alexvianet opened this issue Oct 24, 2019 · 15 comments

Comments

@Alexvianet
Copy link

Alexvianet commented Oct 24, 2019

Describe the bug
After promtail update to 0.4.0 binary (Centos 7 VM)
promtail log full of:

turned HTTP status 429 Too Many Requests (429): ingestion rate limit (4194304) exceeded while adding 3 lines"
Oct 24 17:22:47 loki.local promtail[24373]: level=error ts=2019-10-24T22:22:47.41979532Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=429 error="server returned HTTP status 429 Too Many Requests (429): ingestion rate limit (4194304) exceeded while adding 2 lines"
Oct 24 17:22:47 loki.local promtail[24373]: level=error ts=2019-10-24T22:22:47.424316433Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-24 20:18:33.8265 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki..local\", job=\"systemd-journal\", unit=\"loki.service\"},"

To Reproduce
Steps to reproduce the behavior:

  1. Started Loki (SHA or version) 0.4.0 binary https://github.com/grafana/loki/releases/download/v0.4.0/loki-linux-amd64.gz
    loki_config:
auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    address: 0.0.0.0
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 5m
  chunk_retain_period: 30s
  max_transfer_retries: 1

schema_config:
  configs:
  - from: 2018-04-15
    store: boltdb
    object_store: filesystem
    schema: v9
    index:
      prefix: index_
      period: "720h"

storage_config:
  boltdb:
    directory: "/srv/loki/index"

  filesystem:
    directory: "/srv/loki/chunks"

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: "720h"

chunk_store_config:
  max_look_back_period: 0

table_manager:
  chunk_tables_provisioning:
    inactive_read_throughput: 0
    inactive_write_throughput: 0
    provisioned_read_throughput: 0
    provisioned_write_throughput: 0
  index_tables_provisioning:
    inactive_read_throughput: 0
    inactive_write_throughput: 0
    provisioned_read_throughput: 0
    provisioned_write_throughput: 0
  retention_deletes_enabled: false
  retention_period: 0
  1. Started Promtail (SHA or version) 0.4.0 binary https://github.com/grafana/loki/releases/download/v0.4.0/promtail-linux-amd64.gz
    promtail_config:
server:
  http_listen_port: 9080
  grpc_listen_port: 0
  grpc_server_max_recv_msg_size: 6291456
  grpc_server_max_send_msg_size: 6291456
positions:
  filename: /tmp/positions.yaml

client:
  url: http://127.0.0.1:3100/api/prom/push

scrape_configs:
  - job_name: journal
    journal:
      path: /var/log/journal
      labels:
        job: systemd-journal
        hostname: "loki.local"
        env: "infrastucture"
    relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        target_label: 'unit'
  - job_name: nginx
    static_configs:
    - targets:
        - localhost
      labels:
        hostname: "loki.local"
        job: nginx
        env: "infrastucture"
        __path__: /var/log/nginx/*log

After downgrade promtail to 0.3.0 with the same configuration, everything looks ok:

Oct 24 17:27:03 loki.local promtail[25145]: level=info ts=2019-10-24T22:27:03.852070243Z caller=main.go:55 msg="Starting Promtail" version="(version=HEAD-e23c141, branch=HEAD, revision=v0.3.0)"
Oct 24 17:27:08 loki.local promtail[25145]: level=info ts=2019-10-24T22:27:08.85068113Z caller=filetargetmanager.go:257 msg="Adding target" key="{env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"nginx\"}"

Expected behavior
Make 0.4.0 promtail works well without errors

Environment:

  • Infrastructure: CentOS 7 VM
  • Deployment tool: Curl
@Alexvianet Alexvianet changed the title promtail-linux-amd64.gz 0.4.0 full of errors after update from 0.3.0 promtail-linux-amd64.gz 0.4.0 full of errors after upgrade Oct 24, 2019
@slim-bean
Copy link
Collaborator

Hey @Alexvianet, there was additional functionality added to Loki supporting additional limits. These mostly apply when running Loki in a multi-tennant mode but are still enforced in any scenario.

I believe you are seeing the difference between promtail 0.3 and 0.4 because of some batching changes made in promtail 0.4, where it will send larger batches.

Regardless, the limits are set in Loki and the description of the configs can be found here

I believe you should be able to add something like ingestion_rate: 50000 to the limits_config section in your Loki config to double the amount of traffic Loki will allow (or increase this even more if you never want it to interfere)

I will update the release notes and changelog tomorrow to call better attention to this change as I expect others will see this also, thanks for reporting! Let us know if changing this fixes the problem for you.

@Alexvianet
Copy link
Author

i have updated loki_config:

auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    address: 0.0.0.0
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 5m
  chunk_retain_period: 30s
  max_transfer_retries: 1

schema_config:
  configs:
  - from: 2018-04-15
    store: boltdb
    object_store: filesystem
    schema: v9
    index:
      prefix: index_
      period: "720h"

storage_config:
  boltdb:
    directory: "/srv/loki/index"

  filesystem:
    directory: "/srv/loki/chunks"

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: "720h"
  ingestion_rate: 75000
  ingestion_burst_size: 100000

chunk_store_config:
  max_look_back_period: 0

table_manager:
  chunk_tables_provisioning:
    inactive_read_throughput: 0
    inactive_write_throughput: 0
    provisioned_read_throughput: 0
    provisioned_write_throughput: 0
  index_tables_provisioning:
    inactive_read_throughput: 0
    inactive_write_throughput: 0
    provisioned_read_throughput: 0
    provisioned_write_throughput: 0
  retention_deletes_enabled: false
  retention_period: 0

promtail_config:

server:
  http_listen_port: 9080
  grpc_listen_port: 0
  grpc_server_max_recv_msg_size: 6291456
  grpc_server_max_send_msg_size: 6291456
positions:
  filename: /tmp/positions.yaml

client:
  url: http://127.0.0.1:3100/api/prom/push
            
scrape_configs:
  - job_name: journal
    journal:
      path: /var/log/journal
      labels:
        job: systemd-journal
        hostname: "loki.local"
        hostip: "192.168.5.62"
        env: "infrastucture"
    relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        target_label: 'unit'
  - job_name: nginx
    static_configs:
    - targets:
        - localhost
      labels:
        hostname: "loki.local"
        hostip: "192.168.5.62"
        job: nginx
        env: "infrastucture"
        __path__: /var/log/nginx/*log

After promtail restart i get:

Oct 25 02:38:03 loki.local promtail[3593]: level=error ts=2019-10-25T07:38:03.171116495Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 07:29:11.148225 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"lok...
Oct 25 02:38:03 loki.local promtail[3593]: level=error ts=2019-10-25T07:38:03.173545654Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 07:29:11.172746 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"lok...
Oct 25 02:38:03 loki.local promtail[3593]: level=error ts=2019-10-25T07:38:03.181702906Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 07:30:11.117778 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"lok...
Oct 25 02:38:03 loki.local promtail[3593]: level=error ts=2019-10-25T07:38:03.188515599Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 07:35:13.78647 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki...

@cyriltovena
Copy link
Contributor

The error limiting is gone, looks like promtail is trying to send old logs that can happen is the position has not been save correctly, are you still seeing this permanently ?

@Alexvianet
Copy link
Author

Alexvianet commented Oct 25, 2019

my /tmp/positions.yaml was not changed

positions:
  journal-journal: s=0fc1537c6b1840e1877999baf0cb73ea;i=293ba;b=0ac4bb9da52f49d7
9a6bd2d163c1da11;m=26e7b3c01f3;t=595bbdc1ea6a6;x=e141da8f3b5c3d38

it's showing after each promtail restart and then silence:

Oct 25 02:38:03 loki.local promtail[3593]: level=error ts=2019-10-25T07:38:03.013418766Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 03:20:01.759025 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\"},"
Oct 25 02:38:03 loki.local promtail[3593]: level=error ts=2019-10-25T07:38:03.065720055Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 04:40:01.74643 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\"},"
Oct 25 02:38:03 loki.local promtail[3593]: level=error ts=2019-10-25T07:38:03.140274973Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 05:53:33.938057 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"loki.service\"},"
Oct 25 02:38:03 loki.local promtail[3593]: level=error ts=2019-10-25T07:38:03.162363497Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 07:29:10.869248 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"promtail.service\"},"
Oct 25 02:38:03 loki.local promtail[3593]: level=error ts=2019-10-25T07:38:03.164695805Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 07:29:10.970477 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"loki.service\"},"
Oct 25 02:38:03 loki.local promtail[3593]: level=error ts=2019-10-25T07:38:03.168694927Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 07:29:10.994415 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"promtail.service\"},"
Oct 25 02:38:03 loki.local promtail[3593]: level=error ts=2019-10-25T07:38:03.171116495Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 07:29:11.148225 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"loki.service\"},"
Oct 25 02:38:03 loki.local promtail[3593]: level=error ts=2019-10-25T07:38:03.173545654Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 07:29:11.172746 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"promtail.service\"},"
Oct 25 02:38:03 loki.local promtail[3593]: level=error ts=2019-10-25T07:38:03.181702906Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 07:30:11.117778 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"systemd-logind.service\"},"
Oct 25 02:38:03 loki.local promtail[3593]: level=error ts=2019-10-25T07:38:03.188515599Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 07:35:13.78647 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"promtail.service\"},"
Oct 25 02:38:07 loki.local promtail[3593]: level=info ts=2019-10-25T07:38:07.866754039Z caller=filetargetmanager.go:257 msg="Adding target" key="{env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"nginx\"}"
Oct 25 08:13:32 loki.local systemd[1]: Stopping promtail...
Oct 25 08:13:32 loki.local promtail[3593]: level=info ts=2019-10-25T13:13:32.278052023Z caller=signals.go:54 msg="=== received SIGINT/SIGTERM ===\n*** exiting"
Oct 25 08:13:32 loki.local promtail[3593]: level=info ts=2019-10-25T13:13:32.27823752Z caller=filetargetmanager.go:318 msg="Removing target" key="{env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"nginx\"}"
Oct 25 08:13:32 loki.local systemd[1]: Stopped promtail.
Oct 25 08:13:32 loki.local systemd[1]: Started promtail.
Oct 25 08:13:32 loki.local promtail[9748]: level=warn ts=2019-10-25T13:13:32.309585928Z caller=filetargetmanager.go:98 msg="WARNING!!! entry_parser config is deprecated, please change to pipeline_stages"
Oct 25 08:13:32 loki.local promtail[9748]: level=error ts=2019-10-25T13:13:32.310880623Z caller=journaltarget.go:226 msg="received error reading saved journal position" err="failed to get realtime timestamp: 99"
Oct 25 08:13:32 loki.local promtail[9748]: level=info ts=2019-10-25T13:13:32.311206382Z caller=server.go:121 http=[::]:9080 grpc=[::]:42673 msg="server listening on addresses"
Oct 25 08:13:32 loki.local promtail[9748]: level=info ts=2019-10-25T13:13:32.311387242Z caller=main.go:60 msg="Starting Promtail" version="(version=, branch=, revision=)"
Oct 25 08:13:32 loki.local promtail[9748]: level=error ts=2019-10-25T13:13:32.362092272Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 06:20:01.24096 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\"},"
Oct 25 08:13:32 loki.local promtail[9748]: level=error ts=2019-10-25T13:13:32.374332169Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 07:28:20.938094 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"loki.service\"},"
Oct 25 08:13:32 loki.local promtail[9748]: level=error ts=2019-10-25T13:13:32.37682463Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 07:29:10.970477 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"loki.service\"},"
Oct 25 08:13:32 loki.local promtail[9748]: level=error ts=2019-10-25T13:13:32.379264113Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 07:29:10.973722 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"loki.service\"},"
Oct 25 08:13:32 loki.local promtail[9748]: level=error ts=2019-10-25T13:13:32.381672507Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 07:29:11.148225 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"loki.service\"},"
Oct 25 08:13:32 loki.local promtail[9748]: level=error ts=2019-10-25T13:13:32.38460642Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 07:29:11.151865 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"loki.service\"},"
Oct 25 08:13:32 loki.local promtail[9748]: level=error ts=2019-10-25T13:13:32.393822577Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 07:32:14.939277 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"sshd.service\"},"
Oct 25 08:13:32 loki.local promtail[9748]: level=error ts=2019-10-25T13:13:32.400407259Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 07:38:02.834845 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\"},"
Oct 25 08:13:32 loki.local promtail[9748]: level=error ts=2019-10-25T13:13:32.40317915Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 07:38:03.013007 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"loki.service\"},"
Oct 25 08:13:32 loki.local promtail[9748]: level=error ts=2019-10-25T13:13:32.40543603Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 07:38:03.118287 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"loki.service\"},"
Oct 25 08:13:32 loki.local promtail[9748]: level=error ts=2019-10-25T13:13:32.430744673Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 07:47:51.624904 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"systemd-logind.service\"},"
Oct 25 08:13:32 loki.local promtail[9748]: level=error ts=2019-10-25T13:13:32.504668195Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 08:08:36.533847 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"loki.service\"},"
Oct 25 08:13:32 loki.local promtail[9748]: level=error ts=2019-10-25T13:13:32.554298106Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 09:30:01.677583 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\"},"
Oct 25 08:13:32 loki.local promtail[9748]: level=error ts=2019-10-25T13:13:32.604250281Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 10:50:01.185296 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\"},"
Oct 25 08:13:32 loki.local promtail[9748]: level=error ts=2019-10-25T13:13:32.649238214Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 12:10:01.384705 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\"},"
Oct 25 08:13:37 loki.local promtail[9748]: level=info ts=2019-10-25T13:13:37.311708415Z caller=filetargetmanager.go:257 msg="Adding target" key="{env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"nginx\"}"

then I remove my /tmp/positions.yaml and restart promtail and still get:

Oct 25 08:23:13 loki.local promtail[10001]: level=error ts=2019-10-25T13:23:13.643684448Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 13:13:32.374574 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"promtail.service\"},"
Oct 25 08:23:13 loki.local promtail[10001]: level=error ts=2019-10-25T13:23:13.646245179Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 13:13:32.482415 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"loki.service\"},"
Oct 25 08:23:13 loki.local promtail[10001]: level=error ts=2019-10-25T13:23:13.656656879Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 13:20:41.229254 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"polkit.service\"},"
Oct 25 08:23:13 loki.local promtail[10001]: level=error ts=2019-10-25T13:23:13.661142377Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 13:21:54.368167 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"promtail.service\"},"
Oct 25 08:23:13 loki.local promtail[10001]: level=error ts=2019-10-25T13:23:13.663586223Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 13:21:54.528743 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"loki.service\"},"
Oct 25 08:23:13 loki.local promtail[10001]: level=error ts=2019-10-25T13:23:13.666531283Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 13:21:54.551121 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"promtail.service\"},"
Oct 25 08:23:13 loki.local promtail[10001]: level=error ts=2019-10-25T13:23:13.669273437Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 13:21:54.604777 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"loki.service\"},"
Oct 25 08:23:13 loki.local promtail[10001]: level=error ts=2019-10-25T13:23:13.677783135Z caller=client.go:206 component=client host=127.0.0.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2019-10-25 13:21:54.627821 +0000 UTC ignored, reason: 'entry out of order' for stream: {env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"systemd-journal\", unit=\"promtail.service\"},"
Oct 25 08:23:18 loki.local promtail[10001]: level=info ts=2019-10-25T13:23:18.302178291Z caller=filetargetmanager.go:257 msg="Adding target" key="{env=\"infrastucture\", hostip=\"192.168.5.62\", hostname=\"loki.local\", job=\"nginx\"}"

@slim-bean
Copy link
Collaborator

It's possible this is normal behavior, if promtail re-sends a line (or a line older than the newest for a given stream) Loki will reject the line with that error.

Deleting the positions file would definitely cause some resent files and trigger this error.

I would however expect those errors would go away over time as promtail starts sending newer logs which Loki has not seen yet.

@Alexvianet
Copy link
Author

Alexvianet commented Oct 25, 2019

I got new logs into grafana, and looks like functional is working, the problem only in pormtail spamming error logs after each restart, on v0.3.0 all was ok, I got newer logs which Loki has not seen yet but after pormtail restart, nothing changed (each restart error spam), also promtail do not have any info process logs, in grafana explore live I can see new alerts but promtail log said nothing about journal logs only about static nginx logs.

@imrankhan11
Copy link

We are having same issue. What is the highest value we can set for "ingestion_rate"?

@roidelapluie
Copy link
Contributor

ingestion_rate is no longer correct. Use ingestion_rate_mb:

#1278

@thirdeyenick
Copy link

We were also hitting the ingestion rate of loki when using promtail to ship logs. I was wondering if promtail actually tries to resend the batch once it gets a 429 HTTP status code from loki (saying that the ingestion rate limit was exceeded). If I interpret this line correctly then promtail only tries to resend on 5xx HTTP errors and connection errors. Is there a particular reason for this?

@chancez
Copy link
Contributor

chancez commented Dec 19, 2019

I just hit this with fluentd using the loki plugin also.

@stale
Copy link

stale bot commented Jan 18, 2020

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Jan 18, 2020
@slim-bean
Copy link
Collaborator

We were also hitting the ingestion rate of loki when using promtail to ship logs. I was wondering if promtail actually tries to resend the batch once it gets a 429 HTTP status code from loki (saying that the ingestion rate limit was exceeded). If I interpret this line correctly then promtail only tries to resend on 5xx HTTP errors and connection errors. Is there a particular reason for this?

You are correct, rate limited logs are dropped. The reasoning is that if you are hitting the rate limit, attempting to retry those logs would never be successful and would just dig the hole deeper. (New logs still coming in while you retry failed logs would only add to the problem)

@stale stale bot removed the stale A stale issue or PR that will automatically be closed. label Jan 21, 2020
@slim-bean
Copy link
Collaborator

Also sorry for the slow reply, the stale bot nudged this and realized I never responded. I'm going to close this issue however, as I believe the initial issue was addressed/responded to.

@jgehrcke
Copy link
Contributor

jgehrcke commented Apr 9, 2020

You are correct, rate limited logs are dropped. The reasoning is that if you are hitting the rate limit, attempting to retry those logs would never be successful and would just dig the hole deeper. (New logs still coming in while you retry failed logs would only add to the problem)

Chiming in to say that while this is a possible scenario it also is a worst-case scenario.

When the other end tells you to back off (429) then there are at least two rather different ways to handle that situation: 1) retry later or 2) drop data immediately. Deciding between both is easy when you know a lot about the context, specifically about the (expected) time evolution of the data push rate. When you know that pressure is going to decrease at some point in the future then (1) wise. When you know that the push rate will remain as-is or even further increase then (2) is the right choice. In other words, if the current log push rate peaks for only a "short time window" (short relative to local buffer size) and drops again thereafter then one can get away with not dropping data by backing off when seeing 429 responses, being able to send it when there is some breathing room again.

This is part of a larger optimization problem and of course one needs to know exactly what to optimize for. In many cases dropping data comes with a large penalty in that optimization problem, and therefore the strategy that does not imply dropping data is very much preferable.

There is no one-size-fits all default strategy, one must decide between (1) and (2) based on knowledge about the wider context.

@slim-bean
Copy link
Collaborator

slim-bean commented Apr 9, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants