-
Notifications
You must be signed in to change notification settings - Fork 724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky e2e tests #1779
Comments
@tenzen-y For issue2, this looks like a bug. If job succeeds before even reconcile update to |
/kind bug @nagar-ajay Thank you for reporting this issue.
You're right. Confirming a
@nagar-ajay Which frameworks did you find this bug in? |
XGBoost |
/kind e2e-test-failure I think if the XGBoostJob doesn't have a
@johnugeorge @nagar-ajay WDYT? |
You mean, adding |
Yes, that's right. |
@johnugeorge Friendly ping. |
Sorry for late response. It makes sense. I don't think, there is a better way if reconcile sees a terminal condition even before running state. |
As a short-term fix for 2nd issue, I think we can increase the running time of jobs. WDYT? @tenzen-y @johnugeorge |
@nagar-ajay I will work on fixing 2nd issue using my above proposal next week. |
/assign |
While testing e2e integration tests on my local, I encountered a few scenarios when these tests fail.
https://github.com/kubeflow/training-operator/blob/master/sdk/python/test/e2e/utils.py#L22
I think the reason is if a container created by tests starts running instantaneously, then that test fails because that job will have two conditions.
https://github.com/kubeflow/training-operator/blob/master/sdk/python/test/e2e/utils.py#L40-L42
With these scenarios, I found that the running condition is missing from training job conditions.
The text was updated successfully, but these errors were encountered: