-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update unittest; fix fail_count bug in issue #113 #115
Conversation
Codecov Report
@@ Coverage Diff @@
## master #115 +/- ##
==========================================
+ Coverage 49.32% 49.65% +0.32%
==========================================
Files 21 21
Lines 1865 1863 -2
==========================================
+ Hits 920 925 +5
+ Misses 945 938 -7
Continue to review full report at Codecov.
|
dpdispatcher/submission.py
Outdated
dlog.info(f"job: {self.job_hash} {self.job_id} terminated;" | ||
"fail_cout is {self.fail_count}; resubmitting job") | ||
if self.fail_count > 3: | ||
raise RuntimeError(f"job:{self.job_hash}failed 3 times.job_detail:{self}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise RuntimeError(f"job:{self.job_hash}failed 3 times.job_detail:{self}") | |
raise RuntimeError(f"job:{self.job_hash} failed 4 times. job_detail:{self}") |
Also, I suggest to give more guides to users. In DP-GEN's issues, many users asked what to do for this error.
It's actually failing 4 times... Is it expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be 4 times in the print information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could print detailed debug information and indicate the user how to debug before exit ? For example, we could provide the scirpt location, file directory etc ? @njzjz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. You can see many users don't know what to do:
deepmodeling/dpgen#434, deepmodeling/dpgen#398, deepmodeling/dpgen#394, deepmodeling/dpgen#360
dpdispatcher/submission.py
Outdated
self.fail_count += 1 | ||
dlog.info(f"job: {self.job_hash} {self.job_id} terminated;" | ||
"fail_cout is {self.fail_count}; resubmitting job") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this "resubmitting" should only be printed when fail_count<=3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right.
@@ -519,17 +519,18 @@ def handle_unexpected_job_state(self): | |||
raise RuntimeError("job_state for job {job} is unknown".format(job=self)) | |||
|
|||
if job_state == JobStatus.terminated: | |||
dlog.info(f"job: {self.job_hash} {self.job_id} terminated; restarting job") | |||
if self.fail_count > 3: | |||
raise RuntimeError("job:job {job} failed 3 times".format(job=self)) | |||
self.fail_count += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior may not be correct when it is recovered from the error: +1, error, restart, +1, error, restart...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, Maybe raise an error and reset fail_count 0? or every N times failed raise an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is usually causing JobStatus to be terminated? Is it an error of hpc resource allocations or timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
some system like dpcloudserver maybe unstable. some jobs may be killed while running.(Due to aliyun/AWS spot instance)
-
some commands cannot be executed correctly due to incorrect environment. (like cannot find
gmx
orlmp
command.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felix5572 I assume in most cases it would be the second situation. It would cost us core time and debug effort if these incorrect tasks are reattempted many times. Is there any way to forward shell error to dlog.info
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I will have a think how to implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the following snippet:
dpdispatcher/dpdispatcher/pbs.py
Lines 56 to 60 in 87e3ee8
if str("qstat: Unknown Job Id") in err_str or str("Job has finished") in err_str: | |
if self.check_finish_tag(job=job) : | |
return JobStatus.finished | |
else : | |
return JobStatus.terminated |
I don't have a lot of experience in standard error outputs. Is there something in stderr which will be a flag of task actually being killed, or will stderr report incorrect environment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only use exit code to determine whether the program exits normally.
@njzjz misoperation. fix now. for now if fail_count % 3 ==0 will raise errors. This may help the user restart if the user continue calculating when some jobs fail. |
raise RuntimeError("job:job {job} failed 3 times".format(job=self)) | ||
# self.fail_count += 1 | ||
# if self.fail_count > 3: | ||
# raise RuntimeError("job:job {job} failed 3 times".format(job=self)) | ||
self.submit_job() | ||
dlog.info("job: {job_hash} submit; job_id is {job_id}".format(job_hash=self.job_hash, job_id=self.job_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something unrelated: we need check job status here because sometimes submission is not successful due to full queue.
update unittest; fix fail_count bug in issue #113