Extended deployment tests and others flake all the time because of journald rate limiting #14785

bparees · 2017-06-20T19:43:39Z

/go/src/github.com/openshift/origin/test/extended/deployments/deployments.go:423
Expected
    <string>: --> pre: Running hook pod ...
to contain substring
    <string>: hello bar
/go/src/github.com/openshift/origin/test/extended/deployments/deployments.go:422

https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/1060

The text was updated successfully, but these errors were encountered:

mfojtik · 2017-06-23T14:23:03Z

this is journald logging issue when we loosing the log messages...

smarterclayton · 2017-08-01T20:38:04Z

So at this point this is now my "journald breaks deployment tests and causes hundreds of flakes" tracking issue.

@stevekuznetsov @mfojtik @soltysh @derekwaynecarr @sjenning @jcantrill @portante

Journald rate limiting on pod logs causes flakes on e2e tests that depend on looking at log output. Log output is something that is supposed to work no matter what. We have a few options on our test runs:

turn off journald rate limiting on our test instances
turn off journald logging and use jsonfile on our test instances
???

What other options do we have? This is failing 1/3 or 1/4 e2e runs and making me angry. Let's get a decision.

portante · 2017-08-01T20:39:42Z

What are the journals settings?

smarterclayton · 2017-08-01T20:50:37Z

@stevekuznetsov do we change anything except what's in here? https://github.com/openshift/origin-ci-tool/blob/151e40a674f15fcf77dd7ca12b75738e06f93972/oct/ansible/oct/roles/dependencies/tasks/pre_install.yml

If not, then stock rhel/centos. We turn rate limiting off in a few test setups, but it may be that we just want to turn off rate limiting on the docker journal via openshift-ansible (for the single and multi-node setups).

Basically, dropping logs via rate limiting has no place in our test environment for end user containers. Only blocking would be acceptable.

stevekuznetsov · 2017-08-01T21:02:50Z

Nope, nothing other than that

sjenning · 2017-08-01T21:30:01Z

I vote for jsonfile (option 2). That's what I do on my personal cluster deployments. That and switching to overlay storage driver. Not sure what the historical reason was for doing journald support in docker. Remote log aggregation?

This is part of my overall opinion that we should align with upstream kube as much as possible.

Leads to things like this when we try to run kube e2e:
kubernetes/kubernetes#43479

stevekuznetsov · 2017-08-01T21:44:24Z

I was not under the impression that we suggested anything but journald for production customer deployments?

smarterclayton · 2017-08-01T23:40:42Z

We don't, but it has challenges in docker because the driver can't control rate limiting per service. The journal is still our default because it handles things like rollover and rate limiting that we'd have to deploy additional solutions for. Effectively this issue is just "rate limiting with drop should be opt in for journal, not opt out", but we can't fix that yet. On Aug 1, 2017, at 5:44 PM, Steve Kuznetsov <[email protected]> wrote: I was not under the impression that we suggested anything but journald for production customer deployments? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#14785 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p9g2LxfxR8mnp49XJVcV1I9VKmGOks5sT5w6gaJpZM4OAD_U> .

portante · 2017-08-02T01:44:55Z

@smarterclayton, proposed PR openshift/origin-ci-tool#133 to drop rate-limiting.

soltysh · 2017-08-04T11:02:35Z

rate limiting with drop should be opt in for journal, not opt out

How do you want to achieve that, that's the reason you rate limit, to drop what's above limit. IMHO for our tests we should be turning off the limits. Our docs should warn users about it, until docker-journald has a proper fix in place, which is each container should be taken as a separate service.

soltysh · 2017-08-04T11:04:08Z

Bumping to P0 as the linked PR. They are both the same, the only difference is that they touch different test regions.

tnozicka · 2017-08-21T16:47:29Z

also seen in #15790 (comment)

tnozicka · 2017-11-24T10:53:56Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17448/test_pull_request_origin_extended_conformance_install/2904/

soltysh · 2017-12-05T11:18:21Z

I've double checked the logs you've pointed to @tnozicka and it doesn't look like it's related to journald rate limiting. The test happened after the journald was modified and during the test time I have not seen indication that journald is rate limiting any messages. I will close this issue when #17597 merges and I'm more assured this is not happening anymore.

@mfojtik

Automatic merge from submit-queue (batch tested with PRs 17217, 17597, 17606). Remove journald limits @mfojtik this is dropping the hacks we had in place to tweak journald, now that ansible is doing it (openshift/openshift-ansible#3753 and openshift/openshift-ansible#5796). I'm additionally bringing back the deployments e2e's that were suffering from it. Let's see how far we can go with it. /cc @tnozicka Fixes #14785

bparees added component/apps kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1 labels Jun 20, 2017

bparees assigned mfojtik Jun 20, 2017

bparees mentioned this issue Jun 20, 2017

doc cluster up --service-catalog flag #14746

Merged

mfojtik added priority/P2 and removed priority/P1 labels Jun 23, 2017

mfojtik mentioned this issue Jul 12, 2017

Extended.deploymentconfigs with custom deployments [Conformance] should run the custom deployment steps #9285

Closed

smarterclayton added priority/P1 and removed priority/P2 labels Aug 1, 2017

smarterclayton changed the title ~~Extended.deploymentconfigs with env in params referencing the configmap [Conformance] should expand the config map key to a value~~ Extended deployment tests and others flake all the time because of journald rate limiting Aug 1, 2017

smarterclayton added the component/logging label Aug 1, 2017

soltysh mentioned this issue Aug 4, 2017

Flake: github.com/openshift/origin/test/end-to-end/core.test/end-to-end/core.sh:389 #14897

Closed

soltysh self-assigned this Aug 4, 2017

soltysh added priority/P0 and removed priority/P1 labels Aug 4, 2017

tnozicka mentioned this issue Aug 21, 2017

flake: Deployment hook test fails with retry 2 instead of retry 1 #15790

Closed

soltysh mentioned this issue Sep 15, 2017

Increase rate limiting in journald.conf openshift/openshift-ansible#3753

Merged

tnozicka mentioned this issue Oct 12, 2017

extended conformance test flake: should get all logs from retried hooks #14140

Closed

tnozicka mentioned this issue Nov 24, 2017

Log events in deploymentconfig and deployer controllers #17448

Merged

soltysh mentioned this issue Dec 5, 2017

Remove journald limits #17597

Merged

openshift-merge-robot closed this as completed in #17597 Dec 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extended deployment tests and others flake all the time because of journald rate limiting #14785

Extended deployment tests and others flake all the time because of journald rate limiting #14785

bparees commented Jun 20, 2017

mfojtik commented Jun 23, 2017

smarterclayton commented Aug 1, 2017

portante commented Aug 1, 2017 via email •

edited

Loading

smarterclayton commented Aug 1, 2017

stevekuznetsov commented Aug 1, 2017

sjenning commented Aug 1, 2017

stevekuznetsov commented Aug 1, 2017

smarterclayton commented Aug 1, 2017 via email

portante commented Aug 2, 2017

soltysh commented Aug 4, 2017

soltysh commented Aug 4, 2017

tnozicka commented Aug 21, 2017

tnozicka commented Nov 24, 2017

soltysh commented Dec 5, 2017

Extended deployment tests and others flake all the time because of journald rate limiting #14785

Extended deployment tests and others flake all the time because of journald rate limiting #14785

Comments

bparees commented Jun 20, 2017

mfojtik commented Jun 23, 2017

smarterclayton commented Aug 1, 2017

portante commented Aug 1, 2017 via email • edited Loading

smarterclayton commented Aug 1, 2017

stevekuznetsov commented Aug 1, 2017

sjenning commented Aug 1, 2017

stevekuznetsov commented Aug 1, 2017

smarterclayton commented Aug 1, 2017 via email

portante commented Aug 2, 2017

soltysh commented Aug 4, 2017

soltysh commented Aug 4, 2017

tnozicka commented Aug 21, 2017

tnozicka commented Nov 24, 2017

soltysh commented Dec 5, 2017

portante commented Aug 1, 2017 via email •

edited

Loading