-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extended deployment tests and others flake all the time because of journald rate limiting #14785
Comments
this is journald logging issue when we loosing the log messages... |
So at this point this is now my "journald breaks deployment tests and causes hundreds of flakes" tracking issue. @stevekuznetsov @mfojtik @soltysh @derekwaynecarr @sjenning @jcantrill @portante Journald rate limiting on pod logs causes flakes on e2e tests that depend on looking at log output. Log output is something that is supposed to work no matter what. We have a few options on our test runs:
What other options do we have? This is failing 1/3 or 1/4 e2e runs and making me angry. Let's get a decision. |
What are the journals settings?
|
@stevekuznetsov do we change anything except what's in here? https://github.com/openshift/origin-ci-tool/blob/151e40a674f15fcf77dd7ca12b75738e06f93972/oct/ansible/oct/roles/dependencies/tasks/pre_install.yml If not, then stock rhel/centos. We turn rate limiting off in a few test setups, but it may be that we just want to turn off rate limiting on the docker journal via openshift-ansible (for the single and multi-node setups). Basically, dropping logs via rate limiting has no place in our test environment for end user containers. Only blocking would be acceptable. |
Nope, nothing other than that |
I vote for jsonfile (option 2). That's what I do on my personal cluster deployments. That and switching to overlay storage driver. Not sure what the historical reason was for doing journald support in docker. Remote log aggregation? This is part of my overall opinion that we should align with upstream kube as much as possible. Leads to things like this when we try to run kube e2e: |
I was not under the impression that we suggested anything but |
We don't, but it has challenges in docker because the driver can't control
rate limiting per service. The journal is still our default because it
handles things like rollover and rate limiting that we'd have to deploy
additional solutions for. Effectively this issue is just "rate limiting
with drop should be opt in for journal, not opt out", but we can't fix that
yet.
On Aug 1, 2017, at 5:44 PM, Steve Kuznetsov <[email protected]> wrote:
I was not under the impression that we suggested anything but journald for
production customer deployments?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#14785 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p9g2LxfxR8mnp49XJVcV1I9VKmGOks5sT5w6gaJpZM4OAD_U>
.
|
@smarterclayton, proposed PR openshift/origin-ci-tool#133 to drop rate-limiting. |
How do you want to achieve that, that's the reason you rate limit, to drop what's above limit. IMHO for our tests we should be turning off the limits. Our docs should warn users about it, until docker-journald has a proper fix in place, which is each container should be taken as a separate service. |
Bumping to P0 as the linked PR. They are both the same, the only difference is that they touch different test regions. |
also seen in #15790 (comment) |
I've double checked the logs you've pointed to @tnozicka and it doesn't look like it's related to journald rate limiting. The test happened after the journald was modified and during the test time I have not seen indication that journald is rate limiting any messages. I will close this issue when #17597 merges and I'm more assured this is not happening anymore. |
Automatic merge from submit-queue (batch tested with PRs 17217, 17597, 17606). Remove journald limits @mfojtik this is dropping the hacks we had in place to tweak journald, now that ansible is doing it (openshift/openshift-ansible#3753 and openshift/openshift-ansible#5796). I'm additionally bringing back the deployments e2e's that were suffering from it. Let's see how far we can go with it. /cc @tnozicka Fixes #14785
https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/1060
The text was updated successfully, but these errors were encountered: