Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

DCOS e2e test is unreliable #3080

Closed
CecileRobertMichon opened this issue May 24, 2018 · 14 comments · Fixed by #3081, #3092 or #3686
Closed

DCOS e2e test is unreliable #3080

CecileRobertMichon opened this issue May 24, 2018 · 14 comments · Fixed by #3081, #3092 or #3686
Assignees

Comments

@CecileRobertMichon
Copy link
Contributor

The DCOS e2e tests fail frequently with this error:

  should be able to install marathon
  /go/src/github.com/Azure/acs-engine/test/e2e/dcos/dcos_test.go:66

• Failure [8.101 seconds]
Azure Container Cluster using the DCOS Orchestrator
/go/src/github.com/Azure/acs-engine/test/e2e/dcos/dcos_test.go:45
  regardless of agent pool type
  /go/src/github.com/Azure/acs-engine/test/e2e/dcos/dcos_test.go:46
    should be able to install marathon [It]
    /go/src/github.com/Azure/acs-engine/test/e2e/dcos/dcos_test.go:66

    Expected error:
        <*ssh.ExitError | 0xc420264ec0>: {
            Waitmsg: {status: 1, signal: "", msg: "", lang: ""},
        }
        Process exited with status 1
    not to have occurred

    /go/src/github.com/Azure/acs-engine/test/e2e/dcos/dcos_test.go:68
------------------------------


Summarizing 1 Failure:

[Fail] Azure Container Cluster using the DCOS Orchestrator regardless of agent pool type [It] should be able to install marathon ```
@CecileRobertMichon
Copy link
Contributor Author

/assign @dmitsh

@dmitsh could you please take look? We might have to disable the dcos e2e test if it's not reliable until we improve it.

@dmitsh
Copy link

dmitsh commented May 24, 2018

@CecileRobertMichon
The error message is not very descriptive.
Could you add more logs and see where the problem happens?
Also, does it happen if you run the test in Jenkins, instead of CircleCi?

@CecileRobertMichon
Copy link
Contributor Author

Here is an example: https://circleci.com/gh/Azure/acs-engine/28211

@CecileRobertMichon
Copy link
Contributor Author

This is where the error occurs:

It("should be able to install marathon", func() {

@dmitsh
Copy link

dmitsh commented May 24, 2018

@CecileRobertMichon We are running dozens of daily dcos tests in our own framework, and we don't experience this type of problem. Do you see same issue when you run this test in Jenkins? Could it be that CircleCI is slighly different?

@CecileRobertMichon
Copy link
Contributor Author

We don't run a DCOS in Jenkins, only in circleci. If it is a circleci issue, should we disable the test on circleci?

@dmitsh
Copy link

dmitsh commented May 24, 2018

No, we should not disable tests in circle ci. Could you run ad-hoc test in Jenkins and see whether you reproduce the problem?

@dmitsh
Copy link

dmitsh commented May 24, 2018

Regarding CircleCI, is there a way to ssh to the node that runs the test?

@CecileRobertMichon
Copy link
Contributor Author

I don't have time to investigate today, will look into it tomorrow/next week. We can ssh into the node given that it hasn't been cleaned up which in the above case it has. I'll set up a Jenkins job to try and repro, one test won't be enough since this error is transient.

@dmitsh
Copy link

dmitsh commented May 24, 2018

I can point you to our Jenkins server that runs multiple dcos regression tests and performs similar (actually more strict) validation of the cluster. We do not observe this type of failure.
I can pair with you in this investigation, but I would start with adding more error details. The ssh.*ExitError is returned then the command completes unsuccessfully or is interrupted by a signal. It would be helpful (necessary) to see the actual error message.

@CecileRobertMichon
Copy link
Contributor Author

I started #3081 to enable more logging. Would you be able to adapt the Jenkins test you have to the circleci e2e tests?

@dmitsh
Copy link

dmitsh commented May 25, 2018

I submitted PR #3092.
The dcos e2e test passed from the first time.

@CecileRobertMichon
Copy link
Contributor Author

Reopening as the dcos test has been failing almost consistently for the last week.

@CecileRobertMichon
Copy link
Contributor Author

CecileRobertMichon commented Jun 20, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants