Flaky e2e tests #1779

nagar-ajay · 2023-03-16T09:11:04Z

While testing e2e integration tests on my local, I encountered a few scenarios when these tests fail.

Sometimes tests fail because of the following condition.

  conditions = client.get_job_conditions(name, namespace, job_kind)
  if len(conditions) != 1:
      raise Exception(f"{job_kind} conditions are invalid: {conditions}")

https://github.com/kubeflow/training-operator/blob/master/sdk/python/test/e2e/utils.py#L22

I think the reason is if a container created by tests starts running instantaneously, then that test fails because that job will have two conditions.

Sometimes tests fail because of the following condition

  conditions = client.get_job_conditions(name, namespace, job_kind)
  if len(conditions) != 3:
      raise Exception(f"{job_kind} conditions are invalid: {conditions}")

https://github.com/kubeflow/training-operator/blob/master/sdk/python/test/e2e/utils.py#L40-L42

With these scenarios, I found that the running condition is missing from training job conditions.

The text was updated successfully, but these errors were encountered:

johnugeorge · 2023-03-16T09:25:26Z

@tenzen-y
For issue 1, Is the verification len(conditions) correct? Separately, unless job goes to running, can we say that job scheduling works?

For issue2, this looks like a bug. If job succeeds before even reconcile update to running state, this can happen.

tenzen-y · 2023-03-16T10:14:59Z

/kind bug

@nagar-ajay Thank you for reporting this issue.

For issue 1, Is the verification len(conditions) correct? Separately, unless job goes to running, can we say that job scheduling works?

You're right. Confirming a Running condition is better.
Also, for future work, we should set the un-schedulable resources to runPolicy.minResources instead of setting the un-schedulable number of replicas to runPolicy.minAvailable.

training-operator/sdk/python/test/e2e/test_e2e_tfjob.py

Line 63 in 8871962

    
           unschedulable_tfjob = generate_tfjob(worker, V1SchedulingPolicy(min_available=10), job_namespace)

For issue2, this looks like a bug. If job succeeds before even reconcile update to running state, this can happen.

@nagar-ajay Which frameworks did you find this bug in?

nagar-ajay · 2023-03-16T10:17:49Z

Which frameworks did you find this bug in?

XGBoost

tenzen-y · 2023-03-16T10:35:08Z

/kind e2e-test-failure

I think if the XGBoostJob doesn't have a Running condition and reaches a succeeded or failed condition, adding a Running condition is better:

training-operator/pkg/controller.v1/xgboost/xgboostjob_controller.go

Line 414 in 8871962

if expected == 0 {

training-operator/pkg/controller.v1/xgboost/xgboostjob_controller.go

Line 431 in 8871962

if failed > 0 {

@johnugeorge @nagar-ajay WDYT?

johnugeorge · 2023-03-16T17:03:54Z

You mean, adding running False as well if running is missing?

tenzen-y · 2023-03-16T17:07:36Z

You mean, adding running False as well if running is missing?

Yes, that's right.

tenzen-y · 2023-03-21T09:19:26Z

@johnugeorge Friendly ping.

johnugeorge · 2023-03-21T15:09:49Z

Sorry for late response. It makes sense. I don't think, there is a better way if reconcile sees a terminal condition even before running state.

nagar-ajay · 2023-03-23T09:23:23Z

As a short-term fix for 2nd issue, I think we can increase the running time of jobs. WDYT? @tenzen-y @johnugeorge

tenzen-y · 2023-03-23T09:28:44Z

As a short-term fix for 2nd issue, I think we can increase the running time of jobs. WDYT? @tenzen-y @johnugeorge

@nagar-ajay I will work on fixing 2nd issue using my above proposal next week.

tenzen-y · 2023-03-23T09:36:43Z

/assign

google-oss-prow bot added the kind/bug label Mar 16, 2023

google-oss-prow bot added the kind/e2e-test-failure label Mar 16, 2023

google-oss-prow bot assigned tenzen-y Mar 23, 2023

tenzen-y mentioned this issue Mar 28, 2023

Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition #1789

Merged

1 task

tenzen-y mentioned this issue May 7, 2023

Improve E2E tests for the gang-scheduling #1801

Merged

1 task

google-oss-prow bot closed this as completed in #1801 May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky e2e tests #1779

Flaky e2e tests #1779

nagar-ajay commented Mar 16, 2023 •

edited

Loading

johnugeorge commented Mar 16, 2023

tenzen-y commented Mar 16, 2023

nagar-ajay commented Mar 16, 2023

tenzen-y commented Mar 16, 2023

johnugeorge commented Mar 16, 2023

tenzen-y commented Mar 16, 2023

tenzen-y commented Mar 21, 2023

johnugeorge commented Mar 21, 2023

nagar-ajay commented Mar 23, 2023

tenzen-y commented Mar 23, 2023

tenzen-y commented Mar 23, 2023

Flaky e2e tests #1779

Flaky e2e tests #1779

Comments

nagar-ajay commented Mar 16, 2023 • edited Loading

johnugeorge commented Mar 16, 2023

tenzen-y commented Mar 16, 2023

nagar-ajay commented Mar 16, 2023

tenzen-y commented Mar 16, 2023

johnugeorge commented Mar 16, 2023

tenzen-y commented Mar 16, 2023

tenzen-y commented Mar 21, 2023

johnugeorge commented Mar 21, 2023

nagar-ajay commented Mar 23, 2023

tenzen-y commented Mar 23, 2023

tenzen-y commented Mar 23, 2023

nagar-ajay commented Mar 16, 2023 •

edited

Loading