Getting a number of CPU cores on PBS_pro #1912

mmokrejs · 2021-03-04T10:33:34Z

Hi,
when rerunning a canu-2.1.1 job on a different machine I realized that canu is picking up the number of total CPU cores locally available instead of respecting what I have reserved through the queing system. Here is how I started the job:

#PBS -l select=1:ncpus=240:mem=6000gb:scratch_local=12tb,walltime=48:00:00

...

canu useGrid=false ... genomeSize=6.8g correctedErrorRate=0.16 corMhapSensitivity=high ovsMemory=1024 ovsConcurrency=5

-- Detected 504 CPUs and 10074 gigabytes of memory.
-- Detected PBSPro '19.0.0' with 'pbsnodes' binary in /opt/pbs/bin/pbsnodes.
-- Grid engine and staging disabled per useGrid=false option.
--
--                                (tag)Concurrency
--                         (tag)Threads          |
--                (tag)Memory         |          |
--        (tag)             |         |          |       total usage      algorithm
--        -------  ----------  --------   --------  --------------------  -----------------------------
-- Local: meryl     64.000 GB    8 CPUs x  63 jobs  4032.000 GB 504 CPUs  (k-mer counting)
-- Local: hap       16.000 GB   63 CPUs x   8 jobs   128.000 GB 504 CPUs  (read-to-haplotype assignment)
-- Local: cormhap   64.000 GB   14 CPUs x  36 jobs  2304.000 GB 504 CPUs  (overlap detection with mhap)
-- Local: obtovl    24.000 GB   14 CPUs x  36 jobs   864.000 GB 504 CPUs  (overlap detection)
-- Local: utgovl    24.000 GB   14 CPUs x  36 jobs   864.000 GB 504 CPUs  (overlap detection)
-- Local: cor       24.000 GB    4 CPUs x 126 jobs  3024.000 GB 504 CPUs  (read correction)
-- Local: ovb        4.000 GB    1 CPU  x 504 jobs  2016.000 GB 504 CPUs  (overlap store bucketizer)
-- Local: ovs      1024.000 GB    1 CPU  x   5 jobs  5120.000 GB   5 CPUs  (overlap store sorting)
-- Local: red       64.000 GB    9 CPUs x  56 jobs  3584.000 GB 504 CPUs  (read error detection)
-- Local: oea        8.000 GB    1 CPU  x 504 jobs  4032.000 GB 504 CPUs  (overlap error adjustment)
-- Local: bat      1024.000 GB   64 CPUs x   1 job   1024.000 GB  64 CPUs  (contig construction with bogart)
-- Local: cns        -.--- GB    8 CPUs x   - jobs     -.--- GB   - CPUs  (consensus)

It picked 504 CPU cores and 10TB of RAM although I have in the environment:

PBS_NCPUS=240
PBS_NGPUS=0
PBS_NUM_NODES=1
PBS_NUM_PPN=240
PBS_RESC_MEM=6442450944000
PBS_RESC_SCRATCH_SSD=13194139533312
PBS_RESC_SCRATCH_VOLUME=13194139533312
PBS_RESC_TOTAL_MEM=6442450944000
PBS_RESC_TOTAL_PROCS=240
PBS_RESC_TOTAL_SCRATCH_VOLUME=13194139533312
PBS_RESC_TOTAL_WALLTIME=172800
SCRATCH=/scratch.ssd/mmokrejs/job_2227881.cerit-pbs.cerit-sc.cz
SCRATCHDIR=/scratch.ssd/mmokrejs/job_2227881.cerit-pbs.cerit-sc.cz
SCRATCH_TYPE=ssd
SCRATCH_VOLUME=13194139533312
TORQUE_RESC_MEM=6442450944000
TORQUE_RESC_PROC=240
TORQUE_RESC_SCRATCH_SSD=13194139533312
TORQUE_RESC_SCRATCH_VOLUME=13194139533312
TORQUE_RESC_TOTAL_MEM=6442450944000
TORQUE_RESC_TOTAL_PROCS=240
TORQUE_RESC_TOTAL_SCRATCH_VOLUME=13194139533312
TORQUE_RESC_TOTAL_WALLTIME=172800

I see some code in canu/src/utility/src/utility/system.C but although in comments there are more PBS_pro variables mentioned, only PBS_NUM_PPN is looked up (in theory).

Could it be that this code is neglected altogether because I started canu with useGrid=false? That's bad. I just wanted to avoid submitting childs jobs into the queing system but of course, I expected canu to understand it is being run under a job scheduling system anyway on an exec host picked by me, and respect its limits (6TB RAM and only 240 CPUs).

-- BEGIN CORRECTION
--
--
-- Creating overlap store correction/my_genome.ovlStore using:
--    147 buckets
--    616 slices
--        using at most 29 GB memory each
-- Finished stage 'cor-overlapStoreConfigure', reset canuIteration.
--
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting 'ovB' concurrent execution on Thu Mar  4 09:11:40 2021 with 214055.941 GB free disk space (147 processes; 504 concurrently)

    cd correction/my_genome.ovlStore.BUILDING
    ./scripts/1-bucketize.sh 1 > ./logs/1-bucketize.000001.out 2>&1

The text was updated successfully, but these errors were encountered:

brianwalenz · 2021-03-04T11:04:54Z

Job configuration in canu is pretty messy. Canu wants to do it all itself, telling the various binaries explicitly how many cores and how much memory to use. The code in utility/ isn't used by canu directly, but is used by some of the binaries (e.g., meryl) that are useful outside of canu.

Canu gets its max memory/thread limits (when not using a grid) from getNumberOfCPUs() and getPhysicalMemorySize() in src/pipelines/canu/Defaults.pm. Checking the various environment variables and using those numbers if they're valid is a good suggestion. I'll add that in the next day or two.

The other way is to set maxThreads=$PBS_NCPUS and maxMemory=.... when canu is run. maxMemory is a bit of a challenge since it wants memory in GB.

mmokrejs · 2021-03-04T11:45:01Z

That would be excellent, I can recompile from git master. Using $PBS_NUM_PPN would be better than IMHO.

mmokrejs · 2021-03-04T11:53:09Z

BTW, ideally improve the following output to show the variable names and their respective values:

-- Reset concurrency from 64 to 1.
-- Reset concurrency from 5 to 1.

It is probably about ovsConcurrency and ovbConcurrency, it is hard to get a list of these from the manual. I am just guessing their names and blindly trying.

…hat for configuration when running in non-grid mode. Issue #1912.

mmokrejs · 2021-03-24T18:14:42Z

I think canu could try to use somewhere the following, which works for me

maxThreads=$PBS_NCPUS maxMemory=`python -c "print(int($PBS_RESC_TOTAL_MEM/1024/1024/1024))"`

Probably you can come up with a syntax to call the bc(1) calculator instead.

I see TORQUE_RESC_MEM=5476083302400 and PBS_RESC_MEM=5476083302400 being set, for example, to the amount I was assigned by PBSPro. Likewise, pick a non-empty value from $PBS_NCPUS and $PBS_NUM_PPN.

brianwalenz · 2021-03-31T10:21:38Z

Added PBS_NCPUS (PBS_NUM_PPN was already there) and PBS_RESC_MEM. I can't find any documentation on the Torque variant - and only spotty documentation on the PBS variables.

brianwalenz added a commit that referenced this issue Mar 16, 2021

Detect grid CPUs and memory allocated to the executive process, use t…

39c0efe

…hat for configuration when running in non-grid mode. Issue #1912.

brianwalenz added a commit that referenced this issue Mar 16, 2021

Log which concurrency is being reset. Issue #1912.

5821545

brianwalenz added a commit that referenced this issue Mar 31, 2021

Add PBS environment detection - PBS_RESC_MEM and PBS_NCPUS. Issue #1912.

404540a

brianwalenz closed this as completed Mar 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting a number of CPU cores on PBS_pro #1912

Getting a number of CPU cores on PBS_pro #1912

mmokrejs commented Mar 4, 2021 •

edited

Loading

brianwalenz commented Mar 4, 2021

mmokrejs commented Mar 4, 2021 •

edited

Loading

mmokrejs commented Mar 4, 2021

mmokrejs commented Mar 24, 2021 •

edited

Loading

brianwalenz commented Mar 31, 2021

Getting a number of CPU cores on PBS_pro #1912

Getting a number of CPU cores on PBS_pro #1912

Comments

mmokrejs commented Mar 4, 2021 • edited Loading

brianwalenz commented Mar 4, 2021

mmokrejs commented Mar 4, 2021 • edited Loading

mmokrejs commented Mar 4, 2021

mmokrejs commented Mar 24, 2021 • edited Loading

brianwalenz commented Mar 31, 2021

mmokrejs commented Mar 4, 2021 •

edited

Loading

mmokrejs commented Mar 4, 2021 •

edited

Loading

mmokrejs commented Mar 24, 2021 •

edited

Loading