Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Ingester out-of-order error for faster troubleshooting #963

Closed
wardbekker opened this issue Sep 3, 2019 · 4 comments · Fixed by #1008
Closed

Improve Ingester out-of-order error for faster troubleshooting #963

wardbekker opened this issue Sep 3, 2019 · 4 comments · Fixed by #1008
Assignees
Labels
component/loki good first issue These are great first issues. If you are looking for a place to start, start here! type/enhancement Something existing could be improved

Comments

@wardbekker
Copy link
Member

wardbekker commented Sep 3, 2019

Is your feature request related to a problem? Please describe.
Out-of-order 400 http is a very common (user) error when starting out with Loki. Due too mislabeling, sending old logs, or the logging client is indeed sending out-of-order events (happened to me when creating loki_logger for elixir). Current HTTP Status 400 msg entry only logs e.g. "entry out of order for stream: {filename="/var/log/sntpc.log", job="varlogs"}", making it hard to understand what entry/entries are causing this. And ultimately it makes debugging/troubleshooting harder.

Additionally, a HTTP POST with stream entries is not atomic. So, you don't know how many entries failed, aka. just a blip or something more fundamental goes wrong.

Describe the solution you'd like
Include additional details to the "out of order" error:

  • number entries failed
  • number of entries successfully committed.
  • timestamp(s) of failing entry/entries vs current max timestamp(s) in loki index

Also "final error sending batch" is a bit cryptic msg. Perhaps change it to something that describes Loki's ingester behaviour e.g; "Batch post of entries not successful. Some and/or all entries might not be committed to Loki"

I feel this will help the user to get up and running with Loki faster.

Describe alternatives you've considered

  • (Semi)Manual inspection of logs for out-of-order dates.

Additional context

@wardbekker wardbekker self-assigned this Sep 3, 2019
@wardbekker wardbekker added component/loki good first issue These are great first issues. If you are looking for a place to start, start here! type/enhancement Something existing could be improved labels Sep 3, 2019
@slim-bean
Copy link
Collaborator

Also "final error sending batch" is a bit cryptic msg. Perhaps change it to something that describes Loki's ingester behaviour e.g; "Batch post of entries not successful. Some and/or all entries might not be committed to Loki"

If possible it would be nice to log how many entries succeeded since it sounds like we will be extracting that info anyway to return to the client

@pracucci
Copy link
Contributor

pracucci commented Sep 5, 2019

From my perspective, it's desirable a push is not atomic, as far as loki gives back enough information to the client to identify which log entries have been rejected.

In order to fix this, we need to enrich the push response in case of error. On the client side (promtail) we should have enough information to understand which log entries have been successfully ingested and which rejected.

An option is to switch to a structured error format (JSON) with the following structure:

{
  "failed_streams": [
    {
      "labels": "{foo=\"bar\"}",
      "entries": [
        { "ts": "2018-12-18T08:28:06.801064-04:00", "line": "aaa", "cause": "entry out of order" },
        { "ts": "2018-12-18T08:28:06.801064-04:00", "line": "bbb", "cause": "entry out of order" }
      ]
    }
  ]
}

Given we do expect an ingestion error not to be the normal case, we do return the most expressive error message we can (including the full failed entries), in order to give to the client enough data to easily spot the issue.

Such change breaks backward compatibility. We can both support old and new response format in promtail, while third parties integrations directly pushing to loki may potentially break the error response parsing.

What's your take?

@wardbekker
Copy link
Member Author

wardbekker commented Sep 11, 2019

@pracucci not sure if return that json struct would result in breaking backwards compatibility. HTTP status is still the same (400). Error body not significant for promtail client :

err = fmt.Errorf("server returned HTTP status %s (%d): %s", resp.Status, resp.StatusCode, line)

@pracucci
Copy link
Contributor

@wardbekker In the option describe above, the new version of promtail will parse the JSON response and act accordingly. If the new promtail will push logs to an older loki, it may break if we do support backward compatibility in promtail, because the older loki will reply as text which the newer promtail will try to parse as JSON.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/loki good first issue These are great first issues. If you are looking for a place to start, start here! type/enhancement Something existing could be improved
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants