Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update unittest; fix fail_count bug in issue #113 #115

Merged
merged 7 commits into from
Aug 1, 2021

Conversation

felix5572
Copy link
Collaborator

update unittest; fix fail_count bug in issue #113

@felix5572 felix5572 requested a review from njzjz July 26, 2021 09:05
@codecov-commenter
Copy link

codecov-commenter commented Jul 26, 2021

Codecov Report

Merging #115 (7aceb74) into master (84a3e47) will increase coverage by 0.32%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #115      +/-   ##
==========================================
+ Coverage   49.32%   49.65%   +0.32%     
==========================================
  Files          21       21              
  Lines        1865     1863       -2     
==========================================
+ Hits          920      925       +5     
+ Misses        945      938       -7     
Impacted Files Coverage Δ
dpdispatcher/submission.py 76.06% <0.00%> (+1.84%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 84a3e47...7aceb74. Read the comment docs.

dlog.info(f"job: {self.job_hash} {self.job_id} terminated;"
"fail_cout is {self.fail_count}; resubmitting job")
if self.fail_count > 3:
raise RuntimeError(f"job:{self.job_hash}failed 3 times.job_detail:{self}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise RuntimeError(f"job:{self.job_hash}failed 3 times.job_detail:{self}")
raise RuntimeError(f"job:{self.job_hash} failed 4 times. job_detail:{self}")

Also, I suggest to give more guides to users. In DP-GEN's issues, many users asked what to do for this error.

It's actually failing 4 times... Is it expected?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be 4 times in the print information.

Copy link
Collaborator Author

@felix5572 felix5572 Jul 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could print detailed debug information and indicate the user how to debug before exit ? For example, we could provide the scirpt location, file directory etc ? @njzjz

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. You can see many users don't know what to do:
deepmodeling/dpgen#434, deepmodeling/dpgen#398, deepmodeling/dpgen#394, deepmodeling/dpgen#360

self.fail_count += 1
dlog.info(f"job: {self.job_hash} {self.job_id} terminated;"
"fail_cout is {self.fail_count}; resubmitting job")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this "resubmitting" should only be printed when fail_count<=3?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right.

@@ -519,17 +519,18 @@ def handle_unexpected_job_state(self):
raise RuntimeError("job_state for job {job} is unknown".format(job=self))

if job_state == JobStatus.terminated:
dlog.info(f"job: {self.job_hash} {self.job_id} terminated; restarting job")
if self.fail_count > 3:
raise RuntimeError("job:job {job} failed 3 times".format(job=self))
self.fail_count += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior may not be correct when it is recovered from the error: +1, error, restart, +1, error, restart...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, Maybe raise an error and reset fail_count 0? or every N times failed raise an error?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is usually causing JobStatus to be terminated? Is it an error of hpc resource allocations or timeout?

Copy link
Collaborator Author

@felix5572 felix5572 Jul 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Feiyang472

  1. some system like dpcloudserver maybe unstable. some jobs may be killed while running.(Due to aliyun/AWS spot instance)

  2. some commands cannot be executed correctly due to incorrect environment. (like cannot find gmx or lmp command.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felix5572 I assume in most cases it would be the second situation. It would cost us core time and debug effort if these incorrect tasks are reattempted many times. Is there any way to forward shell error to dlog.info?

Copy link
Collaborator Author

@felix5572 felix5572 Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I will have a think how to implement it.

Copy link
Contributor

@Feiyang472 Feiyang472 Jul 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the following snippet:

if str("qstat: Unknown Job Id") in err_str or str("Job has finished") in err_str:
if self.check_finish_tag(job=job) :
return JobStatus.finished
else :
return JobStatus.terminated

I don't have a lot of experience in standard error outputs. Is there something in stderr which will be a flag of task actually being killed, or will stderr report incorrect environment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only use exit code to determine whether the program exits normally.

@njzjz njzjz linked an issue Jul 26, 2021 that may be closed by this pull request
@felix5572 felix5572 requested a review from njzjz July 31, 2021 07:24
@njzjz
Copy link
Member

njzjz commented Jul 31, 2021

Do you revert 927f4cc in d6b2b72? What happened here?

@felix5572
Copy link
Collaborator Author

@njzjz misoperation. fix now. for now if fail_count % 3 ==0 will raise errors. This may help the user restart if the user continue calculating when some jobs fail.

raise RuntimeError("job:job {job} failed 3 times".format(job=self))
# self.fail_count += 1
# if self.fail_count > 3:
# raise RuntimeError("job:job {job} failed 3 times".format(job=self))
self.submit_job()
dlog.info("job: {job_hash} submit; job_id is {job_id}".format(job_hash=self.job_hash, job_id=self.job_id))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something unrelated: we need check job status here because sometimes submission is not successful due to full queue.

@njzjz njzjz merged commit b1cba4e into deepmodeling:master Aug 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

terminated job resubmitted 4 times.
4 participants