update unittest; fix fail_count bug in issue #113 #115

felix5572 · 2021-07-26T09:05:50Z

update unittest; fix fail_count bug in issue #113

codecov-commenter · 2021-07-26T09:06:50Z

Codecov Report

Merging #115 (7aceb74) into master (84a3e47) will increase coverage by 0.32%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master     #115      +/-   ##
==========================================
+ Coverage   49.32%   49.65%   +0.32%     
==========================================
  Files          21       21              
  Lines        1865     1863       -2     
==========================================
+ Hits          920      925       +5     
+ Misses        945      938       -7

Impacted Files	Coverage Δ
dpdispatcher/submission.py	`76.06% <0.00%> (+1.84%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 84a3e47...7aceb74. Read the comment docs.

njzjz · 2021-07-26T09:20:36Z

dpdispatcher/submission.py

+            dlog.info(f"job: {self.job_hash} {self.job_id} terminated;"
+                "fail_cout is {self.fail_count}; resubmitting job")
+            if self.fail_count > 3:
+                raise RuntimeError(f"job:{self.job_hash}failed 3 times.job_detail:{self}")


Suggested change

raise RuntimeError(f"job:{self.job_hash}failed 3 times.job_detail:{self}")

raise RuntimeError(f"job:{self.job_hash} failed 4 times. job_detail:{self}")

Also, I suggest to give more guides to users. In DP-GEN's issues, many users asked what to do for this error.

It's actually failing 4 times... Is it expected?

It should be 4 times in the print information.

Maybe we could print detailed debug information and indicate the user how to debug before exit ? For example, we could provide the scirpt location, file directory etc ? @njzjz

Make sense. You can see many users don't know what to do:
deepmodeling/dpgen#434, deepmodeling/dpgen#398, deepmodeling/dpgen#394, deepmodeling/dpgen#360

njzjz · 2021-07-26T09:22:29Z

dpdispatcher/submission.py

            self.fail_count += 1
+            dlog.info(f"job: {self.job_hash} {self.job_id} terminated;"
+                "fail_cout is {self.fail_count}; resubmitting job")


I think this "resubmitting" should only be printed when fail_count<=3?

Yes, you are right.

njzjz · 2021-07-26T09:26:49Z

dpdispatcher/submission.py

@@ -519,17 +519,18 @@ def handle_unexpected_job_state(self):
            raise RuntimeError("job_state for job {job} is unknown".format(job=self))

        if job_state == JobStatus.terminated:
-            dlog.info(f"job: {self.job_hash} {self.job_id} terminated; restarting job")
-            if self.fail_count > 3:
-                raise RuntimeError("job:job {job} failed 3 times".format(job=self))
            self.fail_count += 1


This behavior may not be correct when it is recovered from the error: +1, error, restart, +1, error, restart...

Ok, Maybe raise an error and reset fail_count 0? or every N times failed raise an error?

What is usually causing JobStatus to be terminated? Is it an error of hpc resource allocations or timeout?

@Feiyang472

some system like dpcloudserver maybe unstable. some jobs may be killed while running.(Due to aliyun/AWS spot instance)

some commands cannot be executed correctly due to incorrect environment. (like cannot find gmx or lmp command.)

@felix5572 I assume in most cases it would be the second situation. It would cost us core time and debug effort if these incorrect tasks are reattempted many times. Is there any way to forward shell error to dlog.info?

Agree, I will have a think how to implement it.

In the following snippet:

dpdispatcher/dpdispatcher/pbs.py

Lines 56 to 60 in 87e3ee8

if str("qstat: Unknown Job Id") in err_str or str("Job has finished") in err_str:

if self.check_finish_tag(job=job) :

return JobStatus.finished

else :

return JobStatus.terminated

I don't have a lot of experience in standard error outputs. Is there something in stderr which will be a flag of task actually being killed, or will stderr report incorrect environment?

We only use exit code to determine whether the program exits normally.

njzjz · 2021-07-31T08:15:33Z

Do you revert 927f4cc in d6b2b72? What happened here?

felix5572 · 2021-07-31T08:27:23Z

@njzjz misoperation. fix now. for now if fail_count % 3 ==0 will raise errors. This may help the user restart if the user continue calculating when some jobs fail.

njzjz · 2021-07-31T08:32:30Z

dpdispatcher/submission.py

-                raise RuntimeError("job:job {job} failed 3 times".format(job=self))
-            # self.fail_count += 1
+            # if self.fail_count > 3:
+            #     raise RuntimeError("job:job {job} failed 3 times".format(job=self))
            self.submit_job()
            dlog.info("job: {job_hash} submit; job_id is {job_id}".format(job_hash=self.job_hash, job_id=self.job_id))


Something unrelated: we need check job status here because sometimes submission is not successful due to full queue.

felix5572 added 2 commits July 26, 2021 16:41

update unittest; fix fail_count bug in issue deepmodeling#113

6e49b7f

Merge remote-tracking branch 'origin'

77500cc

felix5572 requested a review from njzjz July 26, 2021 09:05

njzjz reviewed Jul 26, 2021

View reviewed changes

njzjz linked an issue Jul 26, 2021 that may be closed by this pull request

terminated job resubmitted 4 times. #113

Closed

felix5572 and others added 4 commits July 29, 2021 12:51

Merge branch 'deepmodeling:master' into master

0e32ea7

fix fail_count error; will raise error when （ fail_count % 3 ） == 0

927f4cc

better error message in handle_unexpected_submission_state.

d6b2b72

Merge remote-tracking branch 'origin'

f463cf0

felix5572 requested a review from njzjz July 31, 2021 07:24

fix misoperation

7aceb74

njzjz approved these changes Jul 31, 2021

View reviewed changes

njzjz merged commit b1cba4e into deepmodeling:master Aug 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update unittest; fix fail_count bug in issue #113 #115

update unittest; fix fail_count bug in issue #113 #115

felix5572 commented Jul 26, 2021

codecov-commenter commented Jul 26, 2021 •

edited

Loading

njzjz Jul 26, 2021

felix5572 Jul 26, 2021

felix5572 Jul 26, 2021 •

edited

Loading

njzjz Jul 26, 2021

njzjz Jul 26, 2021

felix5572 Jul 26, 2021

njzjz Jul 26, 2021

felix5572 Jul 26, 2021

Feiyang472 Jul 26, 2021

felix5572 Jul 26, 2021 •

edited

Loading

Feiyang472 Jul 27, 2021

felix5572 Jul 27, 2021 •

edited

Loading

Feiyang472 Jul 28, 2021 •

edited

Loading

njzjz Jul 28, 2021

njzjz commented Jul 31, 2021

felix5572 commented Jul 31, 2021

njzjz Jul 31, 2021

	raise RuntimeError(f"job:{self.job_hash}failed 3 times.job_detail:{self}")
	raise RuntimeError(f"job:{self.job_hash} failed 4 times. job_detail:{self}")

	if str("qstat: Unknown Job Id") in err_str or str("Job has finished") in err_str:
	if self.check_finish_tag(job=job) :
	return JobStatus.finished
	else :
	return JobStatus.terminated

update unittest; fix fail_count bug in issue #113 #115

update unittest; fix fail_count bug in issue #113 #115

Conversation

felix5572 commented Jul 26, 2021

codecov-commenter commented Jul 26, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felix5572 Jul 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felix5572 Jul 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felix5572 Jul 27, 2021 • edited Loading

Choose a reason for hiding this comment

Feiyang472 Jul 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

njzjz commented Jul 31, 2021

felix5572 commented Jul 31, 2021

Choose a reason for hiding this comment

codecov-commenter commented Jul 26, 2021 •

edited

Loading

felix5572 Jul 26, 2021 •

edited

Loading

felix5572 Jul 26, 2021 •

edited

Loading

felix5572 Jul 27, 2021 •

edited

Loading

Feiyang472 Jul 28, 2021 •

edited

Loading