Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(v1.8.3) Built-in multiline parser (cri) failing to concatenate large logs #3938

Closed
Fullplate opened this issue Aug 10, 2021 · 13 comments
Closed
Labels

Comments

@Fullplate
Copy link

Behaviour:

Containerd uses the CRI log format. Logs on disk are limited to a max size of 16k per line, and therefore must be concatenated to produce the original log line.

Log lines up to around 200k in size are parsed and concatenated correctly.

Log lines around 300k+ in size are split into multiple logs (usually 2-4 depending on original log size).

My theory is that the parser's default flush timeout of 4 seconds is too short to capture all log lines once the total size is beyond a certain point. I'm not sure if reading new lines resets this timeout, but that seems like it would be the correct behaviour. It would also be nice if this value was configurable, from my reading of the code it seems to be hardcoded.

Environment: Kubernetes 1.20.7, containerd runtime

Config:

[SERVICE]
    Daemon Off
    Flush 1
    Log_Level {{ .Values.logLevel }}
    HTTP_Server On
    HTTP_Listen 0.0.0.0
    HTTP_Port 2020
    Health_Check On
[INPUT]
    Name tail
    Path /var/log/containers/*.log
    multiline.parser cri
    Tag containers.*
    DB /var/log/fluent/fluentbit-containers.db
    Read_from_Head True
    Refresh_Interval 10
    Mem_Buf_Limit 256m
    Buffer_Chunk_Size 2m
    Buffer_Max_Size 64m
[FILTER]
    Name kubernetes
    Match containers.*
    Kube_Tag_Prefix containers.var.log.containers.
    Merge_Log On
    Keep_Log Off
[OUTPUT]
    Name                 forward
    Match                *
    Host                 xxxxx
    Port                 24224
    Require_ack_response True
@agup006
Copy link
Member

agup006 commented Aug 12, 2021

Thanks for filing @Fullplate. Are you able to build a version with that hard coded version modified and see if that fixes the issue? If yes then I think the solution should be to add that as a configurable option for large log use cases

@Fullplate
Copy link
Author

@agup006 Thanks for the reply. I'll have a go at doing that and report back.

@Fullplate
Copy link
Author

Fullplate commented Aug 19, 2021

@agup006 I confirmed that the "auto flush" function isn't an issue -- I disabled it completely and was still able to reproduce.

Here is a minimal reproduction of the issue:

  • create 2 containers, which results in 2 log files being tailed
  • in container #1, log a 10MB line which will be split into many partial lines and take seconds to process
  • while partial lines are being processed, log something in container #2
  • this condition will be met, and the group for container #1 will be flushed too early
  • finally, we'll get a rule match for the last partial line and the group for container #1 will be flushed again

I was able to fix the issue by removing the extra flush however without understanding the rational behind this bit of code, I'm unsure this is safe to do.

@edsiper would you be able to shed light on this? (sorry if pinging maintainers is a no-no, but I think you may be the best person to answer this one)

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Sep 18, 2021
@Fullplate
Copy link
Author

It would be great to get a response on this issue, even if it's just that the current implementation doesn't support this use case. Currently it's blocking our adoption of fluent-bit.

@github-actions github-actions bot removed the Stale label Sep 20, 2021
@edsiper
Copy link
Member

edsiper commented Oct 6, 2021

@nokute78 can you take a look at this ?

@github-actions
Copy link
Contributor

github-actions bot commented Nov 6, 2021

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Nov 6, 2021
@Fullplate
Copy link
Author

Notstale

@github-actions github-actions bot removed the Stale label Nov 8, 2021
@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Dec 10, 2021
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

@cameronattard
Copy link

This change seems to help to some degree - #6552

@nokute78
Copy link
Collaborator

Sorry for missing issue.

I read #3938 (comment) and it may be solved by #5564 .

@cameronattard
Copy link

cameronattard commented Jan 31, 2023

The above 2 changes do seem to make a significant improvement. It still happens occasionally but nowhere near as much as before. The remaining occurrences seem to be around the time of log file rotation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants