Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting a number of CPU cores on PBS_pro #1912

Closed
mmokrejs opened this issue Mar 4, 2021 · 5 comments
Closed

Getting a number of CPU cores on PBS_pro #1912

mmokrejs opened this issue Mar 4, 2021 · 5 comments

Comments

@mmokrejs
Copy link

mmokrejs commented Mar 4, 2021

Hi,
when rerunning a canu-2.1.1 job on a different machine I realized that canu is picking up the number of total CPU cores locally available instead of respecting what I have reserved through the queing system. Here is how I started the job:

#PBS -l select=1:ncpus=240:mem=6000gb:scratch_local=12tb,walltime=48:00:00

...

canu useGrid=false ... genomeSize=6.8g correctedErrorRate=0.16 corMhapSensitivity=high ovsMemory=1024 ovsConcurrency=5
-- Detected 504 CPUs and 10074 gigabytes of memory.
-- Detected PBSPro '19.0.0' with 'pbsnodes' binary in /opt/pbs/bin/pbsnodes.
-- Grid engine and staging disabled per useGrid=false option.
--
--                                (tag)Concurrency
--                         (tag)Threads          |
--                (tag)Memory         |          |
--        (tag)             |         |          |       total usage      algorithm
--        -------  ----------  --------   --------  --------------------  -----------------------------
-- Local: meryl     64.000 GB    8 CPUs x  63 jobs  4032.000 GB 504 CPUs  (k-mer counting)
-- Local: hap       16.000 GB   63 CPUs x   8 jobs   128.000 GB 504 CPUs  (read-to-haplotype assignment)
-- Local: cormhap   64.000 GB   14 CPUs x  36 jobs  2304.000 GB 504 CPUs  (overlap detection with mhap)
-- Local: obtovl    24.000 GB   14 CPUs x  36 jobs   864.000 GB 504 CPUs  (overlap detection)
-- Local: utgovl    24.000 GB   14 CPUs x  36 jobs   864.000 GB 504 CPUs  (overlap detection)
-- Local: cor       24.000 GB    4 CPUs x 126 jobs  3024.000 GB 504 CPUs  (read correction)
-- Local: ovb        4.000 GB    1 CPU  x 504 jobs  2016.000 GB 504 CPUs  (overlap store bucketizer)
-- Local: ovs      1024.000 GB    1 CPU  x   5 jobs  5120.000 GB   5 CPUs  (overlap store sorting)
-- Local: red       64.000 GB    9 CPUs x  56 jobs  3584.000 GB 504 CPUs  (read error detection)
-- Local: oea        8.000 GB    1 CPU  x 504 jobs  4032.000 GB 504 CPUs  (overlap error adjustment)
-- Local: bat      1024.000 GB   64 CPUs x   1 job   1024.000 GB  64 CPUs  (contig construction with bogart)
-- Local: cns        -.--- GB    8 CPUs x   - jobs     -.--- GB   - CPUs  (consensus)

It picked 504 CPU cores and 10TB of RAM although I have in the environment:

PBS_NCPUS=240
PBS_NGPUS=0
PBS_NUM_NODES=1
PBS_NUM_PPN=240
PBS_RESC_MEM=6442450944000
PBS_RESC_SCRATCH_SSD=13194139533312
PBS_RESC_SCRATCH_VOLUME=13194139533312
PBS_RESC_TOTAL_MEM=6442450944000
PBS_RESC_TOTAL_PROCS=240
PBS_RESC_TOTAL_SCRATCH_VOLUME=13194139533312
PBS_RESC_TOTAL_WALLTIME=172800
SCRATCH=/scratch.ssd/mmokrejs/job_2227881.cerit-pbs.cerit-sc.cz
SCRATCHDIR=/scratch.ssd/mmokrejs/job_2227881.cerit-pbs.cerit-sc.cz
SCRATCH_TYPE=ssd
SCRATCH_VOLUME=13194139533312
TORQUE_RESC_MEM=6442450944000
TORQUE_RESC_PROC=240
TORQUE_RESC_SCRATCH_SSD=13194139533312
TORQUE_RESC_SCRATCH_VOLUME=13194139533312
TORQUE_RESC_TOTAL_MEM=6442450944000
TORQUE_RESC_TOTAL_PROCS=240
TORQUE_RESC_TOTAL_SCRATCH_VOLUME=13194139533312
TORQUE_RESC_TOTAL_WALLTIME=172800

I see some code in canu/src/utility/src/utility/system.C but although in comments there are more PBS_pro variables mentioned, only PBS_NUM_PPN is looked up (in theory).

Could it be that this code is neglected altogether because I started canu with useGrid=false? That's bad. I just wanted to avoid submitting childs jobs into the queing system but of course, I expected canu to understand it is being run under a job scheduling system anyway on an exec host picked by me, and respect its limits (6TB RAM and only 240 CPUs).

-- BEGIN CORRECTION
--
--
-- Creating overlap store correction/my_genome.ovlStore using:
--    147 buckets
--    616 slices
--        using at most 29 GB memory each
-- Finished stage 'cor-overlapStoreConfigure', reset canuIteration.
--
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting 'ovB' concurrent execution on Thu Mar  4 09:11:40 2021 with 214055.941 GB free disk space (147 processes; 504 concurrently)

    cd correction/my_genome.ovlStore.BUILDING
    ./scripts/1-bucketize.sh 1 > ./logs/1-bucketize.000001.out 2>&1
@brianwalenz
Copy link
Member

Job configuration in canu is pretty messy. Canu wants to do it all itself, telling the various binaries explicitly how many cores and how much memory to use. The code in utility/ isn't used by canu directly, but is used by some of the binaries (e.g., meryl) that are useful outside of canu.

Canu gets its max memory/thread limits (when not using a grid) from getNumberOfCPUs() and getPhysicalMemorySize() in src/pipelines/canu/Defaults.pm. Checking the various environment variables and using those numbers if they're valid is a good suggestion. I'll add that in the next day or two.

The other way is to set maxThreads=$PBS_NCPUS and maxMemory=.... when canu is run. maxMemory is a bit of a challenge since it wants memory in GB.

@mmokrejs
Copy link
Author

mmokrejs commented Mar 4, 2021

That would be excellent, I can recompile from git master. Using $PBS_NUM_PPN would be better than IMHO.

@mmokrejs
Copy link
Author

mmokrejs commented Mar 4, 2021

BTW, ideally improve the following output to show the variable names and their respective values:

-- Reset concurrency from 64 to 1.
-- Reset concurrency from 5 to 1.

It is probably about ovsConcurrency and ovbConcurrency, it is hard to get a list of these from the manual. I am just guessing their names and blindly trying.

brianwalenz added a commit that referenced this issue Mar 16, 2021
…hat for configuration when running in non-grid mode. Issue #1912.
@mmokrejs
Copy link
Author

mmokrejs commented Mar 24, 2021

I think canu could try to use somewhere the following, which works for me

maxThreads=$PBS_NCPUS maxMemory=`python -c "print(int($PBS_RESC_TOTAL_MEM/1024/1024/1024))"`

Probably you can come up with a syntax to call the bc(1) calculator instead.

I see TORQUE_RESC_MEM=5476083302400 and PBS_RESC_MEM=5476083302400 being set, for example, to the amount I was assigned by PBSPro. Likewise, pick a non-empty value from $PBS_NCPUS and $PBS_NUM_PPN.

@brianwalenz
Copy link
Member

Added PBS_NCPUS (PBS_NUM_PPN was already there) and PBS_RESC_MEM. I can't find any documentation on the Torque variant - and only spotty documentation on the PBS variables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants