Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky e2e tests #1779

Closed
nagar-ajay opened this issue Mar 16, 2023 · 11 comments · Fixed by #1801
Closed

Flaky e2e tests #1779

nagar-ajay opened this issue Mar 16, 2023 · 11 comments · Fixed by #1801

Comments

@nagar-ajay
Copy link
Contributor

nagar-ajay commented Mar 16, 2023

While testing e2e integration tests on my local, I encountered a few scenarios when these tests fail.

  • Sometimes tests fail because of the following condition.
  conditions = client.get_job_conditions(name, namespace, job_kind)
  if len(conditions) != 1:
      raise Exception(f"{job_kind} conditions are invalid: {conditions}")

https://github.com/kubeflow/training-operator/blob/master/sdk/python/test/e2e/utils.py#L22

I think the reason is if a container created by tests starts running instantaneously, then that test fails because that job will have two conditions.

  • Sometimes tests fail because of the following condition
  conditions = client.get_job_conditions(name, namespace, job_kind)
  if len(conditions) != 3:
      raise Exception(f"{job_kind} conditions are invalid: {conditions}")

https://github.com/kubeflow/training-operator/blob/master/sdk/python/test/e2e/utils.py#L40-L42

With these scenarios, I found that the running condition is missing from training job conditions.

@johnugeorge
Copy link
Member

@tenzen-y
For issue 1, Is the verification len(conditions) correct? Separately, unless job goes to running, can we say that job scheduling works?

For issue2, this looks like a bug. If job succeeds before even reconcile update to running state, this can happen.

@tenzen-y
Copy link
Member

/kind bug

@nagar-ajay Thank you for reporting this issue.

For issue 1, Is the verification len(conditions) correct? Separately, unless job goes to running, can we say that job scheduling works?

You're right. Confirming a Running condition is better.
Also, for future work, we should set the un-schedulable resources to runPolicy.minResources instead of setting the un-schedulable number of replicas to runPolicy.minAvailable.

unschedulable_tfjob = generate_tfjob(worker, V1SchedulingPolicy(min_available=10), job_namespace)

For issue2, this looks like a bug. If job succeeds before even reconcile update to running state, this can happen.

@nagar-ajay Which frameworks did you find this bug in?

@nagar-ajay
Copy link
Contributor Author

Which frameworks did you find this bug in?

XGBoost

@tenzen-y
Copy link
Member

/kind e2e-test-failure

I think if the XGBoostJob doesn't have a Running condition and reaches a succeeded or failed condition, adding a Running condition is better:

@johnugeorge @nagar-ajay WDYT?

@johnugeorge
Copy link
Member

You mean, adding running False as well if running is missing?

@tenzen-y
Copy link
Member

You mean, adding running False as well if running is missing?

Yes, that's right.

@tenzen-y
Copy link
Member

@johnugeorge Friendly ping.

@johnugeorge
Copy link
Member

Sorry for late response. It makes sense. I don't think, there is a better way if reconcile sees a terminal condition even before running state.

@nagar-ajay
Copy link
Contributor Author

As a short-term fix for 2nd issue, I think we can increase the running time of jobs. WDYT? @tenzen-y @johnugeorge

@tenzen-y
Copy link
Member

As a short-term fix for 2nd issue, I think we can increase the running time of jobs. WDYT? @tenzen-y @johnugeorge

@nagar-ajay I will work on fixing 2nd issue using my above proposal next week.

@tenzen-y
Copy link
Member

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants