-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When disk I/O was very heavy, klog calling file sync with locking caused all log recording to be blocked for a long time. #403
Comments
@pohly I'm facing an issue and would appreciate if you could take a look when you have a chance. |
So your proposal is that Do you want to work on a PR? |
/assign |
Working on the PR sounds great. I'm available and would be happy to take that on. |
I suggest acquiring a lock when calling file.Flush() (to write the klog buffer into the kernel buffer), as not doing so may lead to exceptional issues. However, for file.Sync() (the fsync system call that flushes the kernel buffer to disk), there is no need to acquire a lock; let the kernel handle it. |
Indeed, |
/triage accepted |
/kind bug
What steps did you take and what happened:
We found an issue in our Kubernetes cluster. When disk I/O was very heavy, the node would occasionally become not ready.
We identified that this was because klog was calling fsync with locking, which caused kubelet's log writing to be blocked. This resulted in the container runtime status check time exceeding 10 seconds. As a result, kubelet considered the container runtime to be unhealthy and set the node to not ready status.
The kernel has an excellent and efficient buffered I/O mechanism. fsync can ensure that logs are persisted to disk, but we should not call fsync with locking as it would significantly impact performance and fail to take advantage of the kernel's buffered I/O mechanism, making it no different from using direct I/O (which is generally only used by databases).
A high-performance approach is to spawn a lock-free goroutine to periodically call fsync.(fsync does not block buffered I/O)
https://github.com/kubernetes/klog/blob/main/klog.go#L1224
What did you expect to happen:
klog will not block normal log recording due to flushing dirty pages to disk(heavy IO).
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
The text was updated successfully, but these errors were encountered: