Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running multiple jobs prevents detection of stuck jobs #26

Open
rkwalters opened this issue Dec 18, 2015 · 0 comments
Open

Running multiple jobs prevents detection of stuck jobs #26

rkwalters opened this issue Dec 18, 2015 · 0 comments

Comments

@rkwalters
Copy link
Contributor

The motherscripts detect failing/stuck tasks by comparing the current call in send_jobarray to that last task queued (as recorded in that master log file for that motherscript). If it's submitting the same job attempted previously, it errors and stops as intended.

This comparison relies on reading the last line of the master log file to match against. If 2+ jobs with the same master log file are running simultaneously (e.g. postimp with different phenotypes, pca with and without reference population samples, etc), the previous task submission may no longer be on the last line of the log file if one of the other jobs has submitted a task in the interim.

In the worst case scenario, 2+ jobs that are all failing can cycle on the same task infinitely, never stopping since they don't see their own previous task on the last line of the log file, and depending on the task potentially generating 100,000s of temp files in the process.

Same log below based on real-world example:

/path/to/datadir    postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_w_cov --addcov cov.txt danerlong.000921    Thu_Dec_17_13:00:22_2015
/path/to/datadir    postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_wo_cov danerlong.000921    Thu_Dec_17_13:02:02_2015
/path/to/datadir    postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_w_cov --addcov cov.txt danerlong.000921    Thu_Dec_17_13:03:37_2015
/path/to/datadir    postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_wo_cov danerlong.000921    Thu_Dec_17_13:05:05_2015
/path/to/datadir    postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_w_cov --addcov cov.txt danerlong.000921    Thu_Dec_17_13:06:21_2015
/path/to/datadir    postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_wo_cov danerlong.000921    Thu_Dec_17_13:07:39_2015
/path/to/datadir    postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_w_cov --addcov cov.txt danerlong.000921    Thu_Dec_17_13:09:01_2015
/path/to/datadir    postimp_navi_17 --mds file.mds_cov --coco 1,2,3,4 --out gwas_wo_cov danerlong.000921    Thu_Dec_17_13:10:18_2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant