Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with red step, having 477k jobs #935

Closed
drashutosh opened this issue Jun 1, 2018 · 7 comments
Closed

Problems with red step, having 477k jobs #935

drashutosh opened this issue Jun 1, 2018 · 7 comments

Comments

@drashutosh
Copy link

Hi
I am using canu v1.7 +75 , having a similar issue like @ptranvan
The memory (job size) configuration in my canu run is

--
--                            (tag)Concurrency
--                     (tag)Threads          |
--            (tag)Memory         |          |
--        (tag)         |         |          |     total usage     algorithm
--        -------  ------  --------   --------  -----------------  -----------------------------
-- Local: meryl     64 GB   16 CPUs x   1 job     64 GB   16 CPUs  (k-mer counting)
-- Local: cormhap   32 GB   16 CPUs x   6 jobs   192 GB   96 CPUs  (overlap detection with mhap)
-- Local: obtmhap   32 GB   16 CPUs x   6 jobs   192 GB   96 CPUs  (overlap detection with mhap)
-- Local: utgmhap   32 GB   16 CPUs x   6 jobs   192 GB   96 CPUs  (overlap detection with mhap)
-- Local: ovb        4 GB    1 CPU  x  96 jobs   384 GB   96 CPUs  (overlap store bucketizer)
-- Local: ovs       16 GB    1 CPU  x  63 jobs  1008 GB   63 CPUs  (overlap store sorting)
-- Local: red        8 GB    4 CPUs x  24 jobs   192 GB   96 CPUs  (read error detection)
-- Local: oea        4 GB    1 CPU  x  96 jobs   384 GB   96 CPUs  (overlap error adjustment)
-- Local: bat      256 GB   16 CPUs x   1 job    256 GB   16 CPUs  (contig construction)
-- Local: gfa       16 GB   16 CPUs x   1 job     16 GB   16 CPUs  (GFA alignment and processing)
--
-- In 'epauciflora.gkpStore', found Nanopore reads:
--   Raw:        0
--   Corrected:  477295
--   Trimmed:    477295
--
-- Generating assembly 'ep' in '/home/ashutosh/canu_assembly'
--
-- Parameters:
--
--  genomeSize        500000000

The job is stopped at the following point after 477295 red.sh

  -- Starting 'red' concurrent execution on Sun May 20 03:15:50 2018 with 6287.189 GB free disk space(477295 processes; 24 concurrently)

   cd unitigging/3-overlapErrorAdjustment
   ./red.sh 1 > ./red.000001.out 2>&1
   ./red.sh 2 > ./red.000002.out 2>&1
   ./red.sh 3 > ./red.000003.out 2>&1
.
.
.
./red.sh 474655 > ./red.474655.out 2>&1
   ./red.sh 475704 > ./red.475704.out 2>&1
   ./red.sh 476813 > ./red.476813.out 2>&1
-- Finished on Thu May 31 04:46:20 2018 (955830 seconds) with 29331.671 GB free disk space
----------------------------------------
--
-- Read error detection jobs failed, retry.
--   job 00737.red FAILED.
--   job 02142.red FAILED.
--   job 02256.red FAILED.
--   job 02521.red FAILED.

In my run redMemory is 8gb and probably oeaMemory is 4gb. I added AS_UTL_closeFile() in "ovStoreFile.C" as shown in 6bb19fc commit.

I renamed the 3-overlapErrorAdjustment directory and rerun CANU, it still shows 477295 red processes, as I remember it took 11 days (955830 seconds) to finish 477295 red processes
Can I rerun the canu by keeping these files or need to delete the 3-overlapErrorAdjustment directory

Please advise me, how I can get rid of the issue

Thanks in advance

@brianwalenz
Copy link
Member

You've got enough memory to support 16 or 24 gb jobs, so give that a try. You're aiming for a few hundred jobs.

Most, if not all, of the time used in these is just I/O overhead -- it spends all it's time loading data and very very little time actually computing. It won't take 11 days the next time.

At the start of the logging, it'll report "Configure RED for ...." then give a list of each job. I'd like to see this logging - and also for the broken one if you've still got it (I don't need the full list of jobs, just 50 or so).

@drashutosh
Copy link
Author

Thank you @brianwalenz for your reply
the logging job list for current run is as follows

-- Loading read lengths.
-- Loading number of overlaps per read.
sh: line 1: 76994 Aborted                 (core dumped) /home/ashutosh/tools/canu/Linux-amd64/bin/ovStoreDump -G unitigging/epauciflora.gkpStore -O unitigging/epauciflora.ovlStore -counts 2> /dev/null
--
-- Configure RED for 8gb memory.
--                   Batches of at most (unlimited) reads.
--                                      500000000 bases.
--                   Expecting evidence of at most 4989260841 bases per iteration.
--
--           Total                                               Reads                 Olaps Evidence
--    Job   Memory      Read Range         Reads        Bases   Memory        Olaps   Memory   Memory  (Memory in MB)
--   ---- -------- ------------------- --------- ------------ -------- ------------ -------- --------
--      1 11564.77         1-1                 1        44560     0.51           43     0.00  9516.26
--      2 11565.01         2-2                 1        64576     0.74          542     0.01  9516.26
--      3 11564.88         3-3                 1        53982     0.62           99     0.00  9516.26
--      4 11565.01         4-4                 1        65360     0.75           87     0.00  9516.26
--      5 11564.88         5-5                 1        54234     0.62           85     0.00  9516.26
--      6 11564.90         6-6                 1        55717     0.64           77     0.00  9516.26
--      7 11565.04         7-7                 1        67945     0.78           53     0.00  9516.26
--      8 11564.85         8-8                 1        51375     0.59           15     0.00  9516.26
--      9 11564.89         9-9                 1        55113     0.63          127     0.00  9516.26
--     10 11564.68        10-10                1        36387     0.42           79     0.00  9516.26
--     11 11564.73        11-11                1        40970     0.47           63     0.00  9516.26
--     12 11564.92        12-12                1        57230     0.65           83     0.00  9516.26
--     13 11564.80        13-13                1        47109     0.54          151     0.00  9516.26
--     14 11564.77        14-14                1        44702     0.51           94     0.00  9516.26
--     15 11564.87        15-15                1        53459     0.61           53     0.00  9516.26
--     16 11564.68        16-16                1        36743     0.42           80     0.00  9516.26
--     17 11564.85        17-17                1        49765     0.57         2161     0.02  9516.26
--     18 11564.73        18-18                1        40716     0.47           73     0.00  9516.26
--     19 11564.70        19-19                1        38515     0.44           43     0.00  9516.26
--     20 11565.30        20-20                1        90850     1.04           65     0.00  9516.26
--     21 11565.42        21-21                1        70548     0.81        31240     0.36  9516.26
--     22 11564.96        22-22                1        61116     0.70           67     0.00  9516.26
--     23 11564.72        23-23                1        40290     0.46           52     0.00  9516.26
--     24 11564.77        24-24                1        44500     0.51           50     0.00  9516.26
--     25 11565.68        25-25                1        82562     0.94        41449     0.47  9516.26
--     26 11565.21        26-26                1        55327     0.63        27570     0.32  9516.26
--     27 11564.93        27-27                1        58183     0.67           42     0.00  9516.26
--     28 11564.81        28-28                1        47621     0.55           35     0.00  9516.26
--     29 11564.71        29-29                1        39217     0.45           32     0.00  9516.26
--     30 11564.75        30-30                1        42495     0.49           60     0.00  9516.26
--     31 11564.71        31-31                1        38901     0.45           33     0.00  9516.26
--     32 11564.66        32-32                1        35076     0.40           37     0.00  9516.26
--     33 11564.73        33-33                1        40692     0.47           84     0.00  9516.26
--     34 11564.95        34-34                1        60611     0.69           84     0.00  9516.26
--     35 11564.73        35-35                1        41376     0.47           38     0.00  9516.26
--     36 11565.41        36-36                1        66760     0.76        33340     0.38  9516.26
--     37 11564.70        37-37                1        38256     0.44           68     0.00  9516.26
--     38 11565.29        38-38                1        89816     1.03          120     0.00  9516.26
--     39 11565.51        39-39                1       109367     1.25          113     0.00  9516.26
--     40 11564.72        40-40                1        40079     0.46           25     0.00  9516.26
--     41 11564.85        41-41                1        51818     0.59           31     0.00  9516.26
--     42 11564.69        42-42                1        37485     0.43           51     0.00  9516.26
--     43 11564.87        43-43                1        51866     0.59         1098     0.01  9516.26
--     44 11564.79        44-44                1        46112     0.53           74     0.00  9516.26
--     45 11564.90        45-45                1        56181     0.64           91     0.00  9516.26
--     46 11564.86        46-46                1        52711     0.60           67     0.00  9516.26
--     47 11565.08        47-47                1        43885     0.50        27517     0.31  9516.26
--     48 11564.95        48-48                1        60428     0.69           71     0.00  9516.26
--     49 11564.69        49-49                1        37784     0.43           49     0.00  9516.26
--     50 11564.78        50-50                1        45685     0.52           61     0.00  9516.26
--     51 11564.97        51-51                1        61687     0.71           78     0.00  9516.26
--     52 11565.38        52-52                1        57484     0.66        40255     0.46  9516.26
--     53 11564.69        53-53                1        37819     0.43           70     0.00  9516.26
--     54 11564.80        54-54                1        46851     0.54           75     0.00  9516.26
--     55 11564.89        55-55                1        55397     0.63           37     0.00  9516.26
--     56 11564.73        56-56                1        41275     0.47           51     0.00  9516.26
--     57 11565.10        57-57                1        73085     0.84           80     0.00  9516.26
--     58 11564.73        58-58                1        41235     0.47           67     0.00  9516.26
--     59 11564.99        59-59                1        64037     0.73           74     0.00  9516.26

and for the broken one is here

-- Loading read lengths.
-- Loading number of overlaps per read.
--
-- Configure RED for 8gb memory.
--                   Batches of at most (unlimited) reads.
--                                      500000000 bases.
--                   Expecting evidence of at most 4989260841 bases per iteration.
--
--           Total                                               Reads                 Olaps Evidence
--    Job   Memory      Read Range         Reads        Bases   Memory        Olaps   Memory   Memory  (Memory in MB)
--   ---- -------- ------------------- --------- ------------ -------- ------------ -------- --------
--      1 11564.77         1-1                 1        44560     0.51           43     0.00  9516.26
--      2 11565.01         2-2                 1        64576     0.74          542     0.01  9516.26
--      3 11564.88         3-3                 1        53982     0.62           99     0.00  9516.26
--      4 11565.01         4-4                 1        65360     0.75           87     0.00  9516.26
--      5 11564.88         5-5                 1        54234     0.62           85     0.00  9516.26
--      6 11564.90         6-6                 1        55717     0.64           77     0.00  9516.26
--      7 11565.04         7-7                 1        67945     0.78           53     0.00  9516.26
--      8 11564.85         8-8                 1        51375     0.59           15     0.00  9516.26
--      9 11564.89         9-9                 1        55113     0.63          127     0.00  9516.26
--     10 11564.68        10-10                1        36387     0.42           79     0.00  9516.26
--     11 11564.73        11-11                1        40970     0.47           63     0.00  9516.26
--     12 11564.92        12-12                1        57230     0.65           83     0.00  9516.26
--     13 11564.80        13-13                1        47109     0.54          151     0.00  9516.26
--     14 11564.77        14-14                1        44702     0.51           94     0.00  9516.26
--     15 11564.87        15-15                1        53459     0.61           53     0.00  9516.26
--     16 11564.68        16-16                1        36743     0.42           80     0.00  9516.26
--     17 11564.85        17-17                1        49765     0.57         2161     0.02  9516.26
--     18 11564.73        18-18                1        40716     0.47           73     0.00  9516.26
--     19 11564.70        19-19                1        38515     0.44           43     0.00  9516.26
--     20 11565.30        20-20                1        90850     1.04           65     0.00  9516.26
--     21 11565.42        21-21                1        70548     0.81        31240     0.36  9516.26
--     22 11564.96        22-22                1        61116     0.70           67     0.00  9516.26
--     23 11564.72        23-23                1        40290     0.46           52     0.00  9516.26
--     24 11564.77        24-24                1        44500     0.51           50     0.00  9516.26
--     25 11565.68        25-25                1        82562     0.94        41449     0.47  9516.26
--     26 11565.21        26-26                1        55327     0.63        27570     0.32  9516.26
--     27 11564.93        27-27                1        58183     0.67           42     0.00  9516.26
--     28 11564.81        28-28                1        47621     0.55           35     0.00  9516.26
--     29 11564.71        29-29                1        39217     0.45           32     0.00  9516.26
--     30 11564.75        30-30                1        42495     0.49           60     0.00  9516.26
--     31 11564.71        31-31                1        38901     0.45           33     0.00  9516.26
--     32 11564.66        32-32                1        35076     0.40           37     0.00  9516.26
--     33 11564.73        33-33                1        40692     0.47           84     0.00  9516.26
--     34 11564.95        34-34                1        60611     0.69           84     0.00  9516.26
--     35 11564.73        35-35                1        41376     0.47           38     0.00  9516.26
--     36 11565.41        36-36                1        66760     0.76        33340     0.38  9516.26
--     37 11564.70        37-37                1        38256     0.44           68     0.00  9516.26
--     38 11565.29        38-38                1        89816     1.03          120     0.00  9516.26
--     39 11565.51        39-39                1       109367     1.25          113     0.00  9516.26
--     40 11564.72        40-40                1        40079     0.46           25     0.00  9516.26
--     41 11564.85        41-41                1        51818     0.59           31     0.00  9516.26
--     42 11564.69        42-42                1        37485     0.43           51     0.00  9516.26
--     43 11564.87        43-43                1        51866     0.59         1098     0.01  9516.26
--     44 11564.79        44-44                1        46112     0.53           74     0.00  9516.26
--     45 11564.90        45-45                1        56181     0.64           91     0.00  9516.26
--     46 11564.86        46-46                1        52711     0.60           67     0.00  9516.26
--     47 11565.08        47-47                1        43885     0.50        27517     0.31  9516.26
--     48 11564.95        48-48                1        60428     0.69           71     0.00  9516.26
--     49 11564.69        49-49                1        37784     0.43           49     0.00  9516.26
--     50 11564.78        50-50                1        45685     0.52           61     0.00  9516.26
--     51 11564.97        51-51                1        61687     0.71           78     0.00  9516.26
--     52 11565.38        52-52                1        57484     0.66        40255     0.46  9516.26
--     53 11564.69        53-53                1        37819     0.43           70     0.00  9516.26
--     54 11564.80        54-54                1        46851     0.54           75     0.00  9516.26
--     55 11564.89        55-55                1        55397     0.63           37     0.00  9516.26
--     56 11564.73        56-56                1        41275     0.47           51     0.00  9516.26
--     57 11565.10        57-57                1        73085     0.84           80     0.00  9516.26
--     58 11564.73        58-58                1        41235     0.47           67     0.00  9516.26
--     59 11564.99        59-59                1        64037     0.73           74     0.00  9516.26

Nothing have changed in the current run

Please advise

@skoren
Copy link
Member

skoren commented Jun 4, 2018

Did you increase redMemory and oeaMemory? Remove the 3-overlapErrorAdjustment directory and add redMemory=24 oeaMemory=24 to your canu command.

@brianwalenz
Copy link
Member

I'm troubled by the crash reported in the current run output. If that appears when you restart with 24gb, see if it occurs when the ovStoreDump command reported is run by hand.

Increasing to 24gb will solve the problem. It's caused by the long read length (and a broken algorithm for setting up the jobs).

Also, some reads have an incredible number of overlaps, e.g., read 25 is 82Kbp and has 40,000 overlaps. This could just be repeat pile ups, but I'd suggest analyzing that read - it's the 25th read in the trimmedReads.fasta file. You can also dump the overlaps with ovStoreDump -G *gkpStore -O *ovlStore -p 25 (warning, the dump will have 40,000 lines in it).

@drashutosh
Copy link
Author

drashutosh commented Jun 5, 2018

@skoren and @brianwalenz thank you for your suggestions
By increasing redMemory=24 oeaMemory=24 to my canu command I got 47 red jobs, however it was also failed. Should I need to dump all the incredible number of overlaps e.g >3k overlaps

-- Loading read lengths.
-- Loading number of overlaps per read.
sh: line 1: 75369 Aborted                 (core dumped) /home/ashutosh/tools/canu/Linux-amd64/bin/ovStoreDump -G unitigging/epauciflora.gkpStore -O unitigging/epauciflora.ovlStore -counts 2> /dev/null
--
-- Configure RED for 24gb memory.
--                   Batches of at most (unlimited) reads.
--                                      500000000 bases.
--                   Expecting evidence of at most 4989260841 bases per iteration.
--
--           Total                                               Reads                 Olaps Evidence
--    Job   Memory      Read Range         Reads        Bases   Memory        Olaps   Memory   Memory  (Memory in MB)
--   ---- -------- ------------------- --------- ------------ -------- ------------ -------- --------
--      1 17472.38         1-10105         10105    500020844  5722.60     16210869   185.52  9516.26
--      2 17469.47     10106-20134         10029    500038780  5722.81     15939093   182.41  9516.26
--      3 17504.70     20135-29999          9865    500014722  5722.52     19042045   217.92  9516.26
--      4 17633.88     30000-40362         10363    500011176  5722.50     30331529   347.12  9516.26
--      5 17649.88     40363-50307          9945    500023165  5722.62     31719304   363.00  9516.26
--      6 17652.87     50308-60283          9976    500036633  5722.78     31966794   365.83  9516.26
--      7 17596.68     60284-70778         10495    500001913  5722.40     27089963   310.02  9516.26
--      8 17610.00     70779-81261         10483    500029549  5722.71     28226091   323.02  9516.26
--      9 17624.93     81262-91770         10509    500023164  5722.64     29537682   338.03  9516.26
--     10 17662.17     91771-102559        10789    500000496  5722.39     32813294   375.52  9516.26
--     11 17625.33    102560-113247        10688    500002611  5722.41     29591957   338.65  9516.26
--     12 17644.21    113248-123673        10426    500015501  5722.55     31230020   357.40  9516.26
--     13 17638.01    123674-134080        10407    500034400  5722.77     30669533   350.98  9516.26
--     14 17638.90    134081-144589        10509    500050749  5722.96     30730547   351.68  9516.26
--     15 17641.55    144590-155113        10524    500028912  5722.71     30983855   354.58  9516.26
--     16 17635.58    155114-165649        10536    500013532  5722.53     30477870   348.79  9516.26
--     17 17552.09    165650-175889        10240    500020918  5722.61     23175419   265.22  9516.26
--     18 17535.53    175890-186113        10224    500018156  5722.58     21731624   248.70  9516.26
--     19 17580.00    186114-196516        10403    500025113  5722.66     25609365   293.08  9516.26
--     20 17607.56    196517-207006        10490    500036768  5722.80     28006226   320.51  9516.26
--     21 17592.83    207007-217472        10466    500018849  5722.59     26736707   305.98  9516.26
--     22 17544.32    217473-227503        10031    500055519  5723.00     22462910   257.07  9516.26
--     23 17542.50    227504-237461         9958    500028962  5722.69     22330520   255.55  9516.26
--     24 17529.90    237462-247351         9890    500046495  5722.89     21212118   242.75  9516.26
--     25 17578.09    247352-257342         9991    500031111  5722.72     25438326   291.12  9516.26
--     26 17621.19    257343-267378        10036    500037339  5722.79     29197677   334.14  9516.26
--     27 17611.22    267379-277279         9901    500058864  5723.03     28304934   323.92  9516.26
--     28 17617.25    277280-287061         9782    500011718  5722.49     28879555   330.50  9516.26
--     29 17635.94    287062-296916         9855    500014573  5722.52     30510128   349.16  9516.26
--     30 17670.23    296917-306855         9939    500019214  5722.58     33500956   383.39  9516.26
--     31 17659.01    306856-317024        10169    500013883  5722.52     32525877   372.23  9516.26
--     32 17638.21    317025-327737        10713    500015018  5722.55     30705052   351.39  9516.26
--     33 17660.43    327738-338477        10740    500063499  5723.11     32598611   373.06  9516.26
--     34 17623.37    338478-348985        10508    500025634  5722.67     29398192   336.44  9516.26
--     35 17548.30    348986-359070        10085    500058254  5723.03     22807094   261.01  9516.26
--     36 17567.99    359071-369278        10208    500031973  5722.73     24553612   280.99  9516.26
--     37 17588.34    369279-379732        10454    500029799  5722.72     26333290   301.36  9516.26
--     38 17587.69    379733-390165        10433    500040343  5722.84     26266511   300.60  9516.26
--     39 17584.09    390166-401130        10965    500033959  5722.78     25956626   297.05  9516.26
--     40 17647.26    401131-411462        10332    500048748  5722.93     31463766   360.07  9516.26
--     41 17649.95    411463-421448         9986    500012297  5722.50     31736180   363.19  9516.26
--     42 17639.13    421449-431604        10156    500032697  5722.74     30769468   352.13  9516.26
--     43 17667.64    431605-442355        10751    500055755  5723.02     33236491   380.36  9516.26
--     44 17655.25    442356-453023        10668    500006676  5722.46     32203176   368.54  9516.26
--     45 17676.51    453024-463699        10676    500053552  5722.99     34013526   389.25  9516.26
--     46 17682.61    463700-474339        10640    500000813  5722.39     34599304   395.96  9516.26
--     47 13271.94    474340-477295         2956    139098567  1591.95     10112371   115.73  9516.26
--   ---- -------- ------------------- --------- ------------ -------- ------------ -------- --------
--                                                23140391213            1302936058
--
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting 'red' concurrent execution on Tue Jun  5 10:59:02 2018 with 30140.936 GB free disk space (47 processes; 24 concurrently)

    cd unitigging/3-overlapErrorAdjustment
    ./red.sh 1 > ./red.000001.out 2>&1
    ./red.sh 2 > ./red.000002.out 2>&1
:
    ./red.sh 46 > ./red.000046.out 2>&1
    ./red.sh 47 > ./red.000047.out 2>&1

-- Finished on Tue Jun  5 11:46:02 2018 (2820 seconds) with 29859.274 GB free disk space
----------------------------------------
--
-- Read error detection jobs failed, retry.
--   job 00001.red FAILED.
--   job 00002.red FAILED.
--   job 00003.red FAILED.
--   job 00004.red FAILED.

@brianwalenz
Copy link
Member

If run by hand, what does ovStoreDump -G unitigging/epauciflora.gkpStore -O unitigging/epauciflora.ovlStore -counts > counts fail with? The output of this isn't huge, just one number per read, and it should run quickly.

What's the error(s) reported at the end of the unitigging/3-overlapErrorAdjustment/red.*.out files?

@brianwalenz
Copy link
Member

Idle. Reopen if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants