Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extended deployment tests and others flake all the time because of journald rate limiting #14785

Closed
bparees opened this issue Jun 20, 2017 · 14 comments
Assignees
Labels
component/apps component/logging kind/test-flake Categorizes issue or PR as related to test flakes. priority/P0

Comments

@bparees
Copy link
Contributor

bparees commented Jun 20, 2017

/go/src/github.com/openshift/origin/test/extended/deployments/deployments.go:423
Expected
    <string>: --> pre: Running hook pod ...
to contain substring
    <string>: hello bar
/go/src/github.com/openshift/origin/test/extended/deployments/deployments.go:422

https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/1060

@mfojtik
Copy link
Contributor

mfojtik commented Jun 23, 2017

this is journald logging issue when we loosing the log messages...

@smarterclayton smarterclayton changed the title Extended.deploymentconfigs with env in params referencing the configmap [Conformance] should expand the config map key to a value Extended deployment tests and others flake all the time because of journald rate limiting Aug 1, 2017
@smarterclayton
Copy link
Contributor

So at this point this is now my "journald breaks deployment tests and causes hundreds of flakes" tracking issue.

@stevekuznetsov @mfojtik @soltysh @derekwaynecarr @sjenning @jcantrill @portante

Journald rate limiting on pod logs causes flakes on e2e tests that depend on looking at log output. Log output is something that is supposed to work no matter what. We have a few options on our test runs:

  1. turn off journald rate limiting on our test instances
  2. turn off journald logging and use jsonfile on our test instances
  3. ???

What other options do we have? This is failing 1/3 or 1/4 e2e runs and making me angry. Let's get a decision.

@portante
Copy link

portante commented Aug 1, 2017 via email

@smarterclayton
Copy link
Contributor

@stevekuznetsov do we change anything except what's in here? https://github.com/openshift/origin-ci-tool/blob/151e40a674f15fcf77dd7ca12b75738e06f93972/oct/ansible/oct/roles/dependencies/tasks/pre_install.yml

If not, then stock rhel/centos. We turn rate limiting off in a few test setups, but it may be that we just want to turn off rate limiting on the docker journal via openshift-ansible (for the single and multi-node setups).

Basically, dropping logs via rate limiting has no place in our test environment for end user containers. Only blocking would be acceptable.

@stevekuznetsov
Copy link
Contributor

Nope, nothing other than that

@sjenning
Copy link
Contributor

sjenning commented Aug 1, 2017

I vote for jsonfile (option 2). That's what I do on my personal cluster deployments. That and switching to overlay storage driver. Not sure what the historical reason was for doing journald support in docker. Remote log aggregation?

This is part of my overall opinion that we should align with upstream kube as much as possible.

Leads to things like this when we try to run kube e2e:
kubernetes/kubernetes#43479

@stevekuznetsov
Copy link
Contributor

I was not under the impression that we suggested anything but journald for production customer deployments?

@smarterclayton
Copy link
Contributor

smarterclayton commented Aug 1, 2017 via email

@portante
Copy link

portante commented Aug 2, 2017

@smarterclayton, proposed PR openshift/origin-ci-tool#133 to drop rate-limiting.

@soltysh
Copy link
Contributor

soltysh commented Aug 4, 2017

rate limiting with drop should be opt in for journal, not opt out

How do you want to achieve that, that's the reason you rate limit, to drop what's above limit. IMHO for our tests we should be turning off the limits. Our docs should warn users about it, until docker-journald has a proper fix in place, which is each container should be taken as a separate service.

@soltysh soltysh self-assigned this Aug 4, 2017
@soltysh
Copy link
Contributor

soltysh commented Aug 4, 2017

Bumping to P0 as the linked PR. They are both the same, the only difference is that they touch different test regions.

@tnozicka
Copy link
Contributor

also seen in #15790 (comment)

@soltysh
Copy link
Contributor

soltysh commented Dec 5, 2017

I've double checked the logs you've pointed to @tnozicka and it doesn't look like it's related to journald rate limiting. The test happened after the journald was modified and during the test time I have not seen indication that journald is rate limiting any messages. I will close this issue when #17597 merges and I'm more assured this is not happening anymore.

openshift-merge-robot added a commit that referenced this issue Dec 6, 2017
Automatic merge from submit-queue (batch tested with PRs 17217, 17597, 17606).

Remove journald limits

@mfojtik this is dropping the hacks we had in place to tweak journald, now that ansible is doing it (openshift/openshift-ansible#3753 and openshift/openshift-ansible#5796). I'm additionally bringing back the deployments e2e's that were suffering from it. Let's see how far we can go with it. 

/cc @tnozicka 

Fixes #14785
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/apps component/logging kind/test-flake Categorizes issue or PR as related to test flakes. priority/P0
Projects
None yet
Development

No branches or pull requests

8 participants