[receiver/filelog] It doesn't respect `poll_interval` when files are too large #17846

hvaghani221 · 2023-01-18T09:16:47Z

Component(s)

receiver/filelog

Describe the issue you're reporting

We are using filelog receiver to read container logs In Splunk OpenTelemetry Collector for Kubernetes.

One of our customers noticed that logs were missing for some of the apps. On further inspection, we found that one container was producing logs too quickly and it was rotated in few range of 105ms-6s. We observed that some log files were completely missing. The reason could be that file was rotated before it was picked by the agent.

By inspecting the file consumer from stanza, I found that poll is done synchronously.

opentelemetry-collector-contrib/pkg/stanza/fileconsumer/file.go

Line 101 in cdf0a42

m.poll(ctx)

So, the time interval between 2 polls is the max of poll_interval and the time taken by the previous poll. And poll_interval will not be respected when a single poll takes more time than poll_interval.

The poll method finds the path of all the files and consumes them in batches of length MaxConcurrentFiles/2 to limit the number of opened file descriptors.

Here, until all the paths from the batch are read to the end, it will not process the next batch. And the next batch is waiting for the current batch to finish. So, most of the time, it will be reading files lesser than MaxConcurrentFiles/2.

opentelemetry-collector-contrib/pkg/stanza/fileconsumer/file.go

Lines 132 to 140 in cdf0a42

    
           var wg sync.WaitGroup 
        
           for _, reader := range readers { 
        
           	wg.Add(1) 
        
           	go func(r *Reader) { 
        
           		defer wg.Done() 
        
           		r.ReadToEnd(ctx) 
        
           	}(reader) 
        
           } 
        
           wg.Wait()

I think we can move to a thread pooling pattern to utilise MaxConcurrentFiles.

The text was updated successfully, but these errors were encountered:

github-actions · 2023-01-18T09:17:10Z

Pinging code owners:

receiver/filelog: @djaglowski

See Adding Labels via Comments if you do not have permissions to add labels yourself.

hvaghani221 · 2023-01-18T09:18:07Z

cc: @atoulme

matthewmodestino · 2023-01-18T12:45:07Z

In practice, we have seen many Kubernetes providers using default kubelet file rotation of 10MB, which is not a production ready default in my experience.

--container-log-max-size string Default: 10Mi

--container-log-max-files int32 Default: 5

https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/

This causes any real production service with any real traffic load to spin files at a ridiculous rate.

As part of this issue we should be championing better production defaults in the Kubernetes community, as anyone who has worked in Production services before Kubernetes knows, rotation is generally in the realm of 500MB to 1GB or higher!!

This would also impact tools like kubectl logs, not just OTel.

Do we know what flavour of Kubernetes this particular example involves? (Open Source K8s or a managed provider?)

hvaghani221 · 2023-01-18T13:03:41Z

Do we know what flavour of Kubernetes this particular example involves? (Open Source K8s or a managed provider?)

They are using GKE. Unfortunately, GKE doesn't expose container-log-max-size config (https://cloud.google.com/kubernetes-engine/docs/how-to/node-system-config). So, they are stuck with the default value.

matthewmodestino · 2023-01-18T13:10:59Z

Yeah, we need to make some noise with cloud providers on this. I wonder if they are able to request through their support. Will reach out and see what I can find out!

djaglowski · 2023-01-18T14:01:33Z

I think we can move to a thread pooling pattern to utilise MaxConcurrentFiles.

I agree there is an opportunity to increase utilization of the allowed resources here.

One consideration to keep in mind is that we are currently leaning on the batching strategy to detect files that are rotated out of the receiver's include pattern. A pool based solution may also require rethinking of the detection algorithm.

Given that this may potentially involve some major changes, I believe we should look at adding a parallel implementation that can be opted into with a feature gate. Once proven out sufficiently, we can likely migrate to this strategy as default.

github-actions · 2023-03-20T03:31:02Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/filelog: @djaglowski

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2023-05-22T03:29:55Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/filelog: @djaglowski

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2023-07-24T03:29:33Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/filelog: @djaglowski

See Adding Labels via Comments if you do not have permissions to add labels yourself.

VihasMakwana · 2023-08-22T13:57:37Z

@djaglowski @dmitryax
I ran some benchmarks to measure log throughput for my thread pool model.  Each test is executed ten times per model, its the average time calculated.

E.g.
I created 100 files. Half of the files were 17MB in size and others were quite small, 160 bytes.
It took 13.1 seconds on average to receive all the logs for the current model
I ran the same test with the threadpool feature gate and it received the logs in 11.5 seconds on average.  
Nearly a 15% improvement in terms of logs throughput

Here’s the improvement of thread pooling over the normal approach for different file sizes, and different batch sizes.
The improvement varies in the 13-20% range, but the improvement is quite evident.

For files of 17MBs and 162Bytes:

For files of 8 MBs and 162Bytes:

For files of 2 MBs and 162Bytes

djaglowski · 2023-08-28T14:24:48Z

@VihasMakwana, thanks for sharing these benchmarks. Looks like a nice improvement.

I'll comment on your PR as well but I think it will be important to have benchmarking for this scenario as well as other (e.g. all similar sized files) in our codebase so that we can ensure all changes move us in the right direction.

VihasMakwana · 2023-08-31T13:23:45Z

@djaglowski makes sense.
I performed the same benchmarks for similar files. Results are similar for both the models:

100 files of 160KB each: takes ~600ms to receive all the logs.
100 files of 1.6MB each: takes ~2.7s to receive all the logs.
100 files of 17MB each: takes ~26-27s to receive all the logs

VihasMakwana · 2023-08-31T13:23:56Z

I will add those test cases in my PR.

github-actions · 2023-10-31T03:29:19Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/filelog: @djaglowski

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-02-02T19:26:20Z

Pinging code owners for pkg/stanza: @djaglowski. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-04-15T04:07:08Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/filelog: @djaglowski
pkg/stanza: @djaglowski

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-06-17T03:32:10Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/filelog: @djaglowski
pkg/stanza: @djaglowski

See Adding Labels via Comments if you do not have permissions to add labels yourself.

hvaghani221 added the needs triage New item requiring triage label Jan 18, 2023

github-actions bot added the receiver/filelog label Jan 18, 2023

djaglowski added priority:p2 Medium and removed needs triage New item requiring triage labels Jan 18, 2023

atoulme mentioned this issue Jan 18, 2023

[DISCUSS] Make otel as a default logsEngine signalfx/splunk-otel-collector-chart#614

Closed

VihasMakwana mentioned this issue Feb 24, 2023

[receiver/filelog] A proposal for issue #17846 #18908

Closed

github-actions bot added the Stale label Mar 20, 2023

djaglowski removed the Stale label Mar 20, 2023

dmitryax mentioned this issue Apr 30, 2023

[receiver/filelog] Threadpooling for maxconcurrency #19448

Closed

github-actions bot added the Stale label May 22, 2023

djaglowski removed the Stale label May 22, 2023

VihasMakwana pushed a commit to VihasMakwana/opentelemetry-collector-contrib that referenced this issue Jun 4, 2023

Fixes open-telemetry#17846

1e71ef1

github-actions bot added the Stale label Jul 24, 2023

VihasMakwana mentioned this issue Aug 11, 2023

Add an interface in fileconsumer instead #25166

Closed

github-actions bot removed the Stale label Aug 23, 2023

djaglowski mentioned this issue Aug 28, 2023

[pkg/stanza/fileconsumer] Add ability to read files asynchronously #25884

Closed

github-actions bot added the Stale label Oct 31, 2023

djaglowski removed the Stale label Dec 6, 2023

djaglowski added the pkg/stanza label Feb 2, 2024

djaglowski added the release:required-for-ga Must be resolved before GA release label Feb 12, 2024

github-actions bot added the Stale label Apr 15, 2024

djaglowski removed the Stale label Apr 15, 2024

github-actions bot added the Stale label Jun 17, 2024

djaglowski removed the Stale label Jun 17, 2024

djaglowski added the never stale Issues marked with this label will be never staled and automatically removed label Jun 28, 2024

VihasMakwana mentioned this issue Sep 12, 2024

[receiver/filelog] Filelog receiver missing reading log lines in high throughput scenario #35137

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[receiver/filelog] It doesn't respect `poll_interval` when files are too large #17846

[receiver/filelog] It doesn't respect `poll_interval` when files are too large #17846

hvaghani221 commented Jan 18, 2023

github-actions bot commented Jan 18, 2023

hvaghani221 commented Jan 18, 2023

matthewmodestino commented Jan 18, 2023 •

edited

Loading

hvaghani221 commented Jan 18, 2023

matthewmodestino commented Jan 18, 2023

djaglowski commented Jan 18, 2023

github-actions bot commented Mar 20, 2023

github-actions bot commented May 22, 2023

github-actions bot commented Jul 24, 2023

VihasMakwana commented Aug 22, 2023

djaglowski commented Aug 28, 2023

VihasMakwana commented Aug 31, 2023

VihasMakwana commented Aug 31, 2023

github-actions bot commented Oct 31, 2023

github-actions bot commented Feb 2, 2024

github-actions bot commented Apr 15, 2024

github-actions bot commented Jun 17, 2024

[receiver/filelog] It doesn't respect poll_interval when files are too large #17846

[receiver/filelog] It doesn't respect poll_interval when files are too large #17846

Comments

hvaghani221 commented Jan 18, 2023

Component(s)

Describe the issue you're reporting

github-actions bot commented Jan 18, 2023

hvaghani221 commented Jan 18, 2023

matthewmodestino commented Jan 18, 2023 • edited Loading

hvaghani221 commented Jan 18, 2023

matthewmodestino commented Jan 18, 2023

djaglowski commented Jan 18, 2023

github-actions bot commented Mar 20, 2023

github-actions bot commented May 22, 2023

github-actions bot commented Jul 24, 2023

VihasMakwana commented Aug 22, 2023

djaglowski commented Aug 28, 2023

VihasMakwana commented Aug 31, 2023

VihasMakwana commented Aug 31, 2023

github-actions bot commented Oct 31, 2023

github-actions bot commented Feb 2, 2024

github-actions bot commented Apr 15, 2024

github-actions bot commented Jun 17, 2024

[receiver/filelog] It doesn't respect `poll_interval` when files are too large #17846

[receiver/filelog] It doesn't respect `poll_interval` when files are too large #17846

matthewmodestino commented Jan 18, 2023 •

edited

Loading