-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[receiver/filelog] It doesn't respect poll_interval
when files are too large
#17846
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
cc: @atoulme |
In practice, we have seen many Kubernetes providers using default kubelet file rotation of 10MB, which is not a production ready default in my experience. --container-log-max-size string Default: 10Mi --container-log-max-files int32 Default: 5 https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/ This causes any real production service with any real traffic load to spin files at a ridiculous rate. As part of this issue we should be championing better production defaults in the Kubernetes community, as anyone who has worked in Production services before Kubernetes knows, rotation is generally in the realm of 500MB to 1GB or higher!! This would also impact tools like kubectl logs, not just OTel. Do we know what flavour of Kubernetes this particular example involves? (Open Source K8s or a managed provider?) |
They are using GKE. Unfortunately, GKE doesn't expose container-log-max-size config (https://cloud.google.com/kubernetes-engine/docs/how-to/node-system-config). So, they are stuck with the default value. |
Yeah, we need to make some noise with cloud providers on this. I wonder if they are able to request through their support. Will reach out and see what I can find out! |
I agree there is an opportunity to increase utilization of the allowed resources here. One consideration to keep in mind is that we are currently leaning on the batching strategy to detect files that are rotated out of the receiver's include pattern. A pool based solution may also require rethinking of the detection algorithm. Given that this may potentially involve some major changes, I believe we should look at adding a parallel implementation that can be opted into with a feature gate. Once proven out sufficiently, we can likely migrate to this strategy as default. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
@djaglowski @dmitryax E.g. Here’s the improvement of thread pooling over the normal approach for different file sizes, and different batch sizes. For files of 17MBs and 162Bytes: For files of 8 MBs and 162Bytes: For files of 2 MBs and 162Bytes |
@VihasMakwana, thanks for sharing these benchmarks. Looks like a nice improvement. I'll comment on your PR as well but I think it will be important to have benchmarking for this scenario as well as other (e.g. all similar sized files) in our codebase so that we can ensure all changes move us in the right direction. |
@djaglowski makes sense. 100 files of 160KB each: takes ~600ms to receive all the logs. |
I will add those test cases in my PR. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Pinging code owners for pkg/stanza: @djaglowski. See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Component(s)
receiver/filelog
Describe the issue you're reporting
We are using filelog receiver to read container logs In Splunk OpenTelemetry Collector for Kubernetes.
One of our customers noticed that logs were missing for some of the apps. On further inspection, we found that one container was producing logs too quickly and it was rotated in few range of 105ms-6s. We observed that some log files were completely missing. The reason could be that file was rotated before it was picked by the agent.
By inspecting the file consumer from stanza, I found that poll is done synchronously.
opentelemetry-collector-contrib/pkg/stanza/fileconsumer/file.go
Line 101 in cdf0a42
So, the time interval between 2 polls is the max of poll_interval and the time taken by the previous poll. And
poll_interval
will not be respected when a single poll takes more time thanpoll_interval
.The
poll
method finds the path of all the files and consumes them in batches of lengthMaxConcurrentFiles/2
to limit the number of opened file descriptors.Here, until all the paths from the batch are read to the end, it will not process the next batch. And the next batch is waiting for the current batch to finish. So, most of the time, it will be reading files lesser than
MaxConcurrentFiles/2
.opentelemetry-collector-contrib/pkg/stanza/fileconsumer/file.go
Lines 132 to 140 in cdf0a42
I think we can move to a thread pooling pattern to utilise MaxConcurrentFiles.
The text was updated successfully, but these errors were encountered: