Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/filelog] It doesn't respect poll_interval when files are too large #17846

Open
hvaghani221 opened this issue Jan 18, 2023 · 17 comments
Labels
never stale Issues marked with this label will be never staled and automatically removed pkg/stanza priority:p2 Medium receiver/filelog release:required-for-ga Must be resolved before GA release

Comments

@hvaghani221
Copy link
Member

Component(s)

receiver/filelog

Describe the issue you're reporting

We are using filelog receiver to read container logs In Splunk OpenTelemetry Collector for Kubernetes.

One of our customers noticed that logs were missing for some of the apps. On further inspection, we found that one container was producing logs too quickly and it was rotated in few range of 105ms-6s. We observed that some log files were completely missing. The reason could be that file was rotated before it was picked by the agent.

By inspecting the file consumer from stanza, I found that poll is done synchronously.

So, the time interval between 2 polls is the max of poll_interval and the time taken by the previous poll. And poll_interval will not be respected when a single poll takes more time than poll_interval.

The poll method finds the path of all the files and consumes them in batches of length MaxConcurrentFiles/2 to limit the number of opened file descriptors.

Here, until all the paths from the batch are read to the end, it will not process the next batch. And the next batch is waiting for the current batch to finish. So, most of the time, it will be reading files lesser than MaxConcurrentFiles/2.

var wg sync.WaitGroup
for _, reader := range readers {
wg.Add(1)
go func(r *Reader) {
defer wg.Done()
r.ReadToEnd(ctx)
}(reader)
}
wg.Wait()

I think we can move to a thread pooling pattern to utilise MaxConcurrentFiles.

@hvaghani221 hvaghani221 added the needs triage New item requiring triage label Jan 18, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@hvaghani221
Copy link
Member Author

cc: @atoulme

@matthewmodestino
Copy link

matthewmodestino commented Jan 18, 2023

In practice, we have seen many Kubernetes providers using default kubelet file rotation of 10MB, which is not a production ready default in my experience.

--container-log-max-size string Default: 10Mi

--container-log-max-files int32 Default: 5

https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/

This causes any real production service with any real traffic load to spin files at a ridiculous rate.

As part of this issue we should be championing better production defaults in the Kubernetes community, as anyone who has worked in Production services before Kubernetes knows, rotation is generally in the realm of 500MB to 1GB or higher!!

This would also impact tools like kubectl logs, not just OTel.

Do we know what flavour of Kubernetes this particular example involves? (Open Source K8s or a managed provider?)

@hvaghani221
Copy link
Member Author

Do we know what flavour of Kubernetes this particular example involves? (Open Source K8s or a managed provider?)

They are using GKE. Unfortunately, GKE doesn't expose container-log-max-size config (https://cloud.google.com/kubernetes-engine/docs/how-to/node-system-config). So, they are stuck with the default value.

@matthewmodestino
Copy link

Yeah, we need to make some noise with cloud providers on this. I wonder if they are able to request through their support. Will reach out and see what I can find out!

@djaglowski djaglowski added priority:p2 Medium and removed needs triage New item requiring triage labels Jan 18, 2023
@djaglowski
Copy link
Member

I think we can move to a thread pooling pattern to utilise MaxConcurrentFiles.

I agree there is an opportunity to increase utilization of the allowed resources here.

One consideration to keep in mind is that we are currently leaning on the batching strategy to detect files that are rotated out of the receiver's include pattern. A pool based solution may also require rethinking of the detection algorithm.

Given that this may potentially involve some major changes, I believe we should look at adding a parallel implementation that can be opted into with a feature gate. Once proven out sufficiently, we can likely migrate to this strategy as default.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label May 22, 2023
@djaglowski djaglowski removed the Stale label May 22, 2023
VihasMakwana pushed a commit to VihasMakwana/opentelemetry-collector-contrib that referenced this issue Jun 4, 2023
@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@VihasMakwana
Copy link
Contributor

@djaglowski @dmitryax
I ran some benchmarks to measure log throughput for my thread pool model. 
Each test is executed ten times per model, its the average time calculated.

E.g.
I created 100 files. Half of the files were 17MB in size and others were quite small, 160 bytes.
It took 13.1 seconds on average to receive all the logs for the current model
I ran the same test with the threadpool feature gate and it received the logs in 11.5 seconds on average. 

Nearly a 15% improvement in terms of logs throughput

Here’s the improvement of thread pooling over the normal approach for different file sizes, and different batch sizes.
The improvement varies in the 13-20% range, but the improvement is quite evident.

For files of 17MBs and 162Bytes:

Screenshot 2023-08-22 at 7 20 05 PM

For files of 8 MBs and 162Bytes:

Screenshot 2023-08-22 at 7 21 02 PM

For files of 2 MBs and 162Bytes

Screenshot 2023-08-22 at 7 21 55 PM

@github-actions github-actions bot removed the Stale label Aug 23, 2023
@djaglowski
Copy link
Member

@VihasMakwana, thanks for sharing these benchmarks. Looks like a nice improvement.

I'll comment on your PR as well but I think it will be important to have benchmarking for this scenario as well as other (e.g. all similar sized files) in our codebase so that we can ensure all changes move us in the right direction.

@VihasMakwana
Copy link
Contributor

@djaglowski makes sense.
I performed the same benchmarks for similar files. Results are similar for both the models:

100 files of 160KB each: takes ~600ms to receive all the logs.
100 files of 1.6MB each: takes ~2.7s to receive all the logs.
100 files of 17MB each: takes ~26-27s to receive all the logs

@VihasMakwana
Copy link
Contributor

I will add those test cases in my PR.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

github-actions bot commented Feb 2, 2024

Pinging code owners for pkg/stanza: @djaglowski. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@djaglowski djaglowski added the release:required-for-ga Must be resolved before GA release label Feb 12, 2024
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jun 17, 2024
@djaglowski djaglowski removed the Stale label Jun 17, 2024
@djaglowski djaglowski added the never stale Issues marked with this label will be never staled and automatically removed label Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
never stale Issues marked with this label will be never staled and automatically removed pkg/stanza priority:p2 Medium receiver/filelog release:required-for-ga Must be resolved before GA release
Projects
None yet
Development

No branches or pull requests

4 participants