Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Level detection #12645

Open
cyriltovena opened this issue Apr 17, 2024 · 25 comments
Open

Improve Level detection #12645

cyriltovena opened this issue Apr 17, 2024 · 25 comments
Assignees

Comments

@cyriltovena
Copy link
Contributor

We've seen case where the log would contains Warn as the level and an error message but it would still eventually end up as error because we don't properly search for Warn with a first capital level. This should be the very least.

We might want to investigate to support logfmt and json level, severity keys to make this more robust.

cc @shantanualsi @sandeepsukhani

@cyriltovena
Copy link
Contributor Author

We should also consider using another field like detected_level

@bentonam
Copy link

I am working on Alloy modules for log processing porting what was done for module v1 in agent to alloy modules v2 this is what i'm currently doing to detect / default a log-level if it helps.

https://github.com/grafana/alloy-modules/blob/log-modules/modules/utils/logs/log-levels.alloy

@jtackaberry
Copy link
Contributor

Looking forward to the upcoming improvements. Currently we have some JSON logs containing "level":"INFO" that are being misdetected as debug. I understand it'll never be as sophisticated as what @bentonam is doing on the Alloy side, but getting level detection of common formats for free is a nice value-add.

We should also consider using another field like detected_level

If so, could be configurable?

@YevhenLodovyi
Copy link

@cyriltovena Hi, sorry for troubling. I am using fluent-bit to send logs to loki 3.0.0. I decided no try non-released version of loki(due to fixes in loki-bloom feature), but noticed that I do not have level label anymore and logs are gray in logs volume.

just in case i have enabled log level discovery:

discover_log_levels: true
allow_structured_metadata: true

version: grafana/loki:main-fa2c789'

Screenshot 2024-06-04 at 17 59 49

Is it expected?

@jtackaberry
Copy link
Contributor

jtackaberry commented Jul 2, 2024

I see Loki 3.1 just landed and indeed level has been renamed to detected_level.

Is there going to be a way to reconfigure this to level? Users have gotten accustomed to it, and also Grafana uses level for the log volume histogram.

Should I open a separate issue?

@YevhenLodovyi
Copy link

Is there going to be a way to reconfigure this to level? Users have gotten accustomed to it, and also Grafana uses level for the log volume histogram.

As far as i can see there is a task in grafana: grafana/grafana#87564

To be honest it is a bit disappoint that it was changed in loki, but grafana code was not changed accordingly. I have to wait for grafana/grafana#87564 and only after that i can update to 3.1 version.

@cyriltovena
Copy link
Contributor Author

@svennergr can we do something for Grafana to fallback on detected level ?

@diranged
Copy link

Since this is sort of a breaking change - is there a way to push a feature to revert this optionally, until Grafana is updated?

@jtackaberry
Copy link
Contributor

jtackaberry commented Jul 17, 2024

I'd love the label name to be configurable myself. Even though I completely understand the it was changed, I'm partial to level instead of detected_level from both an aesthetic perspective as well as, more importantly, a user familiarity perspective.

AFAICT it's just a single line change so I'm almost tempted to do our own internal build, but it's hard to claim that's sane from a long-term support perspective.

I'd be happy to submit a PR if someone from Grafana Labs could assure me that wouldn't be wasted time.

@runenilsenoe
Copy link

1up on this

@lko23
Copy link

lko23 commented Aug 20, 2024

A further difference I see between level and detected_level, detected_level is not usable in stream selector (only as label filter expression in log pipeline).

{source=~".+", level="debug"} | detected_level="debug"

@jtackaberry
Copy link
Contributor

jtackaberry commented Aug 20, 2024

A further difference I see between level and detected_level, detected_level is not usable in stream selector (only as label filter expression in log pipeline).

That isn't a change. The auto-detected log level was always structured metadata, not a log stream label, so label filter expressions were always required. If you had a level label it must be added by your log collector.

Possibly the difference here is that in Loki 3.0, if a level label existed in the log stream it would skip autodetection, whereas in Loki 3.1 it always does it?

@lko23
Copy link

lko23 commented Aug 20, 2024

Thanks for clarification @jtackaberry. Indeed in my environment the label level was always populated both in loki 3.0 and 3.1 and therefore not auto-detected in both version. The label detected_level was never populated and is therefore auto-detected in version 3.1.

@jtackaberry
Copy link
Contributor

Thanks @lko23. This is another unfortunate consequence of the rename to detected_level: for log streams that explicitly add their own level label or structured metadata, we still get the detected_level field which is a bit pointless in this case. And this can only be disabled globally at the tenant level.

We're in the same boat: a small handful of our services define a level structured metadata field at collection time. I really don't want a detected_level in these cases. But it seems like this has to be all or nothing.

We haven't updated to Loki 3.1 yet but I'm starting to feel we'll need a custom build that reverts detected_level back to level in order to preserve the desired experience.

@trevorwhitney
Copy link
Collaborator

@jtackaberry hi, sorry for the disruption this has caused you.

so, the reason for moving to detected_level from level was a lot of people have level as and indexed/stream label, and we didn't want to conflict with it.

the reason we introduced this feature was that people had level in lots of different places, some via a label, some via the log line, others like yourself via structured metadata, and still others in no type of structure at all. what we needed was a label we can always rely on being present, in a place we can always count on it being. thus, we moved to detected_label injected into structured metadata. can you think of a solution that would solve your issues while still guaranteeing that, when someone turns on level detection, we have a single place to look for level for everyone?

@jtackaberry
Copy link
Contributor

@trevorwhitney, thanks for the thoughtful comment.

so, the reason for moving to detected_level from level was a lot of people have level as and indexed/stream label, and we didn't want to conflict with it.

I'm a bit confused by this, and I think this might indicate we have different goals for the this functionality, possibly influenced by future integrations Grafana has planned, or there are some nuances I don't understand yet. Also ...

what we needed was a label we can always rely on being present, in a place we can always count on it being

... except we can't always rely on this, at least not universally, because it's configurable, so it can be disabled per deployment.

For me, the ideal outcome takes in account the following:

  • Service operators may choose to enable log level detection at the log collector. If they are able, this is great as it's going to be the most reliable approach, because they are able to tailor it to the software whose logs are being collected. (We have a couple services that do this, but most don't.)
    • In this case, I don't need Loki to spend time trying to autodetect the level. I'd want the level detection to be skipped -- Loki isn't likely to do better at this than the service-specific log collector configuration.
  • If the logs aren't being shipped with a level label or structured metadata, only then should Loki's level detection logic run.
    • FWIW, I think structured metadata is a sane choice for the detected level

Given that, IMO the most desirable solution is:

  1. Allow the auto-detected level structured metadata field name to be configurable. This would allow me to change it from detected_level to level, preserving compatibility with Loki 3.0 (and current Grafana, though I appreciate that will change eventually). And, admittedly trivial, I also just happen to think that's better aesthetically, otherwise detected_level would be our only such field with an underscore.
  2. If the log line contains either a log stream label or a structure metadata field with the same name as the configurable auto-detected level field name, then skip the level detection for that log line.

This way, whether the level is auto-detected by Loki or set by the log collector, users searching logs only have to know about one label, and when I upgrade away from Loki 3.0, I don't need to ask people to update their saved queries/dashboards/wikis. And I'm in a position to dictate to everyone shipping logs to the platform what labels and structured metadata is valid and what isn't, and police their values to ensure compliance.

can you think of a solution that would solve your issues while still guaranteeing that, when someone turns on level detection, we have a single place to look for level for everyone?

Ultimately I guess I don't see why this guarantee needs to exist, at least for values of "everyone" that means all Loki users across all deployments. OTOH, if "everyone" means all Loki users of a given deployment or at a given organization, I believe this can be accomplished by the proposed solution above, because organizations are in a position to enforce consistency across the services under their operation.

I suppose that may be wishful thinking for some large organizations. Although in those egregious cases there's nothing preventing some smartass from configuring a label or structured metadata field called detected_level and we're right back to square one. :)

@lko23
Copy link

lko23 commented Aug 21, 2024

@jtackaberry fully agree with your proposed solution. Log lines that already have a label or metadata field called level, loki should not try to auto-detect any further level information. Making the names for auto-detected fields configurable is also a good idea.

And, admittedly trivial, I also just happen to think that's better aesthetically, otherwise detected_level would be our only such field with an underscore.

I think the first auto-detected field with an underscore introduced by loki was service_name and the aesthetics bothered my already then. I would have preferred service or servicename.

@trevorwhitney
Copy link
Collaborator

Got it, thanks for the feedback and helping me understand the use case. I think you nailed it that we might have different goals here.

This was added largely to support Explore Logs, our new, LogQL free UI for interacting with Loki. This experience is largely driven by metric queries, visualizing the logs over time by various aspects (levels, index labels, structured log lines, etc.). A lot of these charts are broken up by level as that's the number one aspect users tend to break up their logs by. In order to do this, Explore Logs needs to know where to find the level. Similarly, the logs histogram in Grafana has a similar problem, as it too attempts to visualize by level.

So, this is what I mean I say "we needed was a label we can always rely on being present, in a place we can always count on it being". Sure, you can turn off level detection completely, but when it's on, these UIs need a consistent place to look for it. So if we exposed a configuration to change the structured metadata field it's in, we would also have to expose an endpoint the UIs could hit to get that.

Currently, if the log line has an indexed level (there's a few label names we accept for this), we skip detection and just copy that value to the known detected_level field. I think we could add logic for when a log line either has detected_level or level as structured metadata, we skip detection as well. However, I'm resistant to the idea of being able to change this label due the complication that would introduce to UIs relying on it, especially since if you use level, it's still there.

@jtackaberry
Copy link
Contributor

Wow, how has it been a whole month? :/

This was added largely to support Explore Logs, our new, LogQL free UI for interacting with Loki.

Aha, yes, things have clicked into place for me here. I understand the use case and the challenges. Thanks @trevorwhitney

I think we could add logic for when a log line either has detected_level or level as structured metadata, we skip detection as well. However, I'm resistant to the idea of being able to change this label due the complication that would introduce to UIs relying on it, especially since if you use level, it's still there.

Presumably another concern isn't just that the level can be reliably found in a specific place, but also one of normalization: if level (whether a label or structured metadata) is pushed during collection, it may not be in the form required by Explore Logs. So does Loki actually copy the value of level as-is through to detected_level, or does it try to normalize it? I have to imagine normalization would also be important for Explore Logs?

But detected_level isn't guaranteed to avoid this problem, it's just less likely to have the problem. Based on other precedent in Loki, one could argue it should be called __level__. But that's ugly as hell and we surely don't want that surfaced to the user (although I could make the same argument about __stream_shard__ and __tenant_id__ which are surfaced to the user), so detected_level isn't an unreasonable compromise. It's very probably safe from conflicts.

Ultimately what I don't like about it is:

  1. For those of our services that define level as structured metadata, we'll have two fields with the same thing (one of them possibly normalized), creating a bit of noise in the UI
  2. detected_level is, as I mentioned earlier, aesthetically objectionable to my eyes

My personal preference would be to rule with an iron fist:

  • configure Loki to assert its dominion over level: for log streams that pass it as a label, use it as a hint for level detection (normalization) and then drop the label from the log stream, setting level as structured metadata instead. If level is passed and is structured metadata, rewrite/normalize it as needed.
  • configure Grafana to use level for Explore Logs

Default configuration could be detected_level and those of us who want to preserve original behavior or just dislike detected_level could configure Loki and Grafana as level instead.

Presumably you will need to do something similar to the first bullet anyway, to deal with the corner case where detected_level is passed by the log collector (either as a label or structured metadata)? Then it's just a matter of configuration to tell Loki and Grafana exactly which structured metadata field should be used for log level during presentation.

Is that too impractical? Is that heroics just because I really like the conciseness of level? :)

@svennergr
Copy link
Contributor

@svennergr can we do something for Grafana to fallback on detected level ?

FWIW, we just fixed this behavior in Grafana so that level and/or detected_level can be used: grafana/grafana#93574

@camrossi
Copy link

camrossi commented Oct 8, 2024

Reading this thread and the doc it seems to me the level detection uses hard coded values.
I have a system that send logs that are matching the hard coded values plus a custom level called "cleared" that ends up being set to "unknown"

This is expected but I am un-sure on what would be the best way to proceed here to have this extra log level saved as "cleared" in the detected_level aside adding this as an option in the go code.

Should I add a pipelineStage to parse just this specific message and add a a level or detected_level label ?

@trevorwhitney
Copy link
Collaborator

trevorwhitney commented Oct 17, 2024

@camrossi the values are only hardcoded when we try to detect the level from the log line. If you send an indexed or structured metatdata label that matches the allowed label names we will use the raw value for the level (which we then store in strctured metadata as detected_level).

@camrossi
Copy link

@trevorwhitney thanks ! Unfortunately I can't control the log format and the format is something on these lines:

%LOG_LOCAL7-6-SYSTEM_MSG [E4204936][transition][info][sys] <logMessage>
So the 3rd value in the [] is always the level.

I ended up doing this to find lines with my levels (that are also not matching the hard coded ones in Grafana for logs graph colours) and this has been working perfectly. I hope is a sensible way to do it!

          pipeline_stages:
            - regex: 
                expression: '\[(?P<level>critical|major|minor|warning|cleared|info)\]'
            - replace:
                expression: '^(major)$'
                source: level
                replace: 'error'
            - replace:
                expression: '^(minor)$'
                source: level
                replace: 'warning'
            - replace:
                expression: '^(cleared)$'
                source: level
                replace: 'info'
            - labels:
                level:

@KyriosGN0
Copy link

@camrossi where did you implement this, in loki or the collector ?

@camrossi
Copy link

@camrossi where did you implement this, in loki or the collector ?

I put it in the promtail ScrapeConfig

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests