diff --git a/CHANGES.md b/CHANGES.md index 2934e84..f1690e9 100644 --- a/CHANGES.md +++ b/CHANGES.md @@ -1,8 +1,24 @@ # Release notes # -Data: 2025-01-15 + +## [v0.2.4] +Data: 2025-01-22 +## Enhancement +- `pairs2depth`, speed up by pipeline +- `alleles`, add `--trim-length` to trim the both end of contigs to remove the + effect of overlapping from hifiasm assembly graph. +- `pipeline` + - `--preset precision`: Optimize parameters to improve accuracy at the expense of anchor rate + - `--preset sensitive`: Using in some complex genome, which contain many fragmented contigs and low signals contigs + ## Bug fixes - `activate_cphasing`, fixed bug that exit shell window when pixi failed to install -- `pipeline`, fixed bug that program can not exit when error occurred +- `pipeline` + - fixed bug that program can not exit when error occurred + - fixed bug that alleles parameters cannot affect in phasing mode + - fixed bug that reported in issue #14, which `--chimeric-correct` mode cannot use correct fasta in 3.hyperpartition +- `plot`, fixed bug that program can not coarsen adjusted matrix to plot another binsize matrix + +## [v0.2.3] Data: 2025-01-14 ## Enhancement - `pipeline`, report peak memory usage @@ -13,8 +29,8 @@ Data: 2025-01-14 ## Bug fixes - `pairs2cool`, fixed bug that "bin1_id > bin2_id" in some cases -Data: 2025-01-10 ## [v0.2.2] +Data: 2025-01-10 ## Enhancement - `pipeline`, When mode=phasing, pipeline integrating 1.alleles into 3.hyperpartition to speed up ## Bug fixes diff --git a/README.md b/README.md index d96ab51..4dea1a2 100644 --- a/README.md +++ b/README.md @@ -27,6 +27,9 @@ git clone https://github.com/wangyibin/CPhasing.git ## activate environment (For the first configuration, run it when the network is accessible.) source ./CPhasing/bin/activate_cphasing + +## deactivate environment +exit ``` ### Via Anaconda @@ -43,22 +46,6 @@ export PYTHONPATH=/path/to/CPhasing:$PYTHONPATH ## The hic pipeline require GLIBCXX_3.4.29, or you can add this command to your environment (.bash_profile) export LD_LIBRARY_PATH=/path/to/anaconda3/envs/cphasing/lib:$LD_LIBRARY_PATH ``` -### Custom install the dependencies -#### Install C-Phasing -```bash -## Download C-Phasing and install python dependencies -git clone https://github.com/wangyibin/CPhasing.git -cd CPhasing -pip install . - -## Add following to the .bash_profile or .bashrc -export PATH=/path/to/CPhasing/bin:$PATH -``` -#### Dependencies -1. For core function - - [pigz](https://github.com/madler/pigz) -2. For Pore-C pipeline - - [minimap2](https://github.com/lh3/minimap2)(>= v2.24) ## One command pipeline of C-Phasing @@ -73,8 +60,8 @@ cphasing pipeline -f draft.asm.fasta -pcd sample.fastq.gz -t 10 -n 8:4 ```bash cphasing pipeline -f draft.asm.fasta -pcd sample1.fastq.gz -pcd sample2.fastq.gz -t 10 -n 8:4 ``` - -Note: If you want to run on cluster system and submit them to multiple nodes, you can use `cphasing mapper` and `cphasing-rs porec-merge` to generate the merged `porec.gz` file and input it by `-pct` parameter. +> [!NOTE] +> If you want to run on cluster system and submit them to multiple nodes, you can use `cphasing mapper` and `cphasing-rs porec-merge` to generate the merged `porec.gz` file and input it by `-pct` parameter. - Start from a **pore-c table (porec.gz)**, which is generated by `cphasing mapper`. ```bash @@ -85,8 +72,10 @@ cphasing pipeline -f draft.asm.fasta -pct sample.porec.gz -t 10 -n 8:4 ```bash cphasing pipeline -f draft.asm.fasta -hic1 Lib_R1.fastq.gz -hic2 Lib_R2.fastq.gz -t 10 -n 8:4 ``` -Note1: If you want to run multiple samples, you can use `cphasing hic mapper` and `cphasing-rs pairs-merge` to generate the merged `pairs.gz` file, and input it by `-prs` parameter. -Note2: If the total length of your input genome is larger than 8 Gb, the `-hic-mapper-k 27 -hic-mapper-w 14` should be specified, to avoid the error of chromap. +> [!NOTE] +> If you want to run multiple samples, you can use `cphasing hic mapper` and `cphasing-rs pairs-merge` to generate the merged `pairs.gz` file, and input it by `-prs` parameter. +> [!NOTE] +> If the total length of your input genome is larger than 8 Gb, the `-hic-mapper-k 27 -hic-mapper-w 14` should be specified, to avoid the error of chromap. - Start from a 4DN pairs file, @@ -117,7 +106,8 @@ cphasing pairs2mnd sample.pairs.gz -o sample.mnd.txt cphasing utils agp2assembly groups.agp > groups.assembly bash ~/software/3d-dna/visualize/run-assembly-visualizer.sh sample.assembly sample.mnd.txt ``` -Note: if chimeric corrected, please use `groups.corrected.agp` and generate a new `corrected.pairs.gz` by `cphasing-rs pairs-break` +> [!NOTE] +> if chimeric corrected, please use `groups.corrected.agp` and generate a new `corrected.pairs.gz` by `cphasing-rs pairs-break` - After curation ```bash ## convert assembly to agp @@ -130,120 +120,23 @@ cphasing agp2fasta groups.review.agp draft.asm.fasta --contigs > contigs.fasta cphasing agp2fasta groups.review.agp draft.asm.fasta > groups.review.asm.fasta ``` - ### Rename Rename and orient chromosome according a monoploid reference (or genome of closely related species). ```bash cphasing rename -r mono.fasta -f draft.asm.fasta -a groups.review.agp -t 20 ``` -Note: To reduce the time consumed, we only align the first haplotype (g1) to the monoploid, which the orientation among different haplotypes has already been set to the same in the `scaffolding` step. If not, you can set `—unphased` to align all haplotypes to the monoploid to adjust the orientation. +> [!NOTE] +> To reduce the time consumed, we only align the first haplotype (g1) to the monoploid, which the orientation among different haplotypes has already been set to the same in the `scaffolding` step. If not, you can set `—unphased` to align all haplotypes to the monoploid to adjust the orientation. ----------------------------------------------- +---------------------------------------------- ## Pipeline of Ultra-long data [Optional] C-Phasing enable to use ultra-long to correct chimeric and identify the high confidence regions (HCRs) to help assembly. ### **[hitig tutorial](cphasing/hitig)** -## Step by step pipeline of Pore-C data -0. **mapping** - > Mapping pore-c data into contig-level assembly and output the pore-c table and 4DN pairs. - ```bash - ## results are `sample.porec.gz` and `sample.pairs.gz` - cphasing mapper draft.asm.fasta sample.fastq.gz -t 10 - - ``` - > For Hi-C data please use `cphasing hic mapper`. - - Note: If you are mapping multiple pore-c data, the multiple `pairs.gz` files should be merged by following steps: - ```bash - cphasing-rs pairs-merge sample1.pairs.gz sample2.pairs.gz -o sample.merge.pairs.gz - ``` - -1. **alleles** (Optional for phasing mode) - > Identify the inter-allelic contig pairs. - ```bash - ## result is `draft.asm.allele.table` - cphasing alleles -f draft.asm.fasta - ``` - Note: the memory usage of this step is depend on the genome size of input fasta, when input the length larger 10 Gb, the peak memory will reach 130 Gb, or more. If you want to reduce the memory usage, you can increase the `k` and `w`, for example, when we benchmark a 20 Gb hexaploid, we set the `k` and `w` to 27 and 24, respectively. - -2. **prepare** - > Prepare some data for subsequence analysis - ```bash - ## results are `sample.counts_AAGCTT.txt`, `sample.clm.gz`, `sample.split.contacts`, `sample.contacts` - cphasing prepare draft.asm.fasta sample.pairs.gz - ``` - -3. **hyperpartition** - ```bash - ## result is `output.clusters.txt` - ## for haploid scaffolding - cphasing hyperpartition sample.porec.gz draft.asm.contigsizes output.clusters.txt - ## for polyploid or diploid phasing must add prune information and use the incremental partition mode - ### chromsome number aware, 8:4 indicate that this is a tetraploid with 8 chromosome in each haplotype - cphasing hyperpartition sample.porec.gz draft.asm.contigsizes output.clusters.txt -pt prune.contig.table -inc -n 8:4 -t 4 - ### auto generate groups - cphasing hyperpartition sample.porec.gz draft.asm.contigsizes output.clusters.txt -pt prune.contig.table -inc -t 4 - ``` - - Note: If you are not satisfied with the number of groups, you can adjust the resolution parameters (`-r1` or `-r2`) to make difference groups. `-r1` controls the first cluster while `-r2` controls the second cluster in phasing mode. Increasing the resolution can generate more groups while decreasing it will get fewer groups. - If you want to group some contigs, you can input a contig list with the `-wl` parameter. - -4. **scaffolding** - ```bash - ## result is `groups.agp` - cphasing scaffolding output.clusters.txt sample.counts_AAGCTT.txt sample.clm.gz -sc sample.split.contacts -f draft.asm.fasta -t 4 - - ## for polyploid can specified allele table to adjust the orientation of different haplotypes - cphasing scaffolding output.clusters.txt sample.counts_AAGCTT.txt sample.clm -at draft.asm.allele.table -sc sample.split.contacts -f draft.asm.fasta - ``` -5. **plot** - > Adjust the contig-level contacts matrix into chromosome-level and plot a heatmap. - - ```bash - ## result is `sample.10000.cool` and `groups.wg.png` - cphasing pairs2cool sample.pairs.gz draft.asm.contigsizes sample.10000.cool - cphasing plot -a groups.agp -m sample.10000.cool -o groups.wg.png - ``` - - > `plot` also enable only plot the heatmap from a matrix file (.cool) - - ```bash - ## directly plot the heatmap in the input resolution, - cphasing plot -m sample.500k.chrom.cool -o sample.500k.chrom.png - - ## or first coarsen small binsize to larger binsize, e.g. 10k to 500k, specify the `-bs 500k` parameter. - - cphasing plot -m sample.10k.chrom.cool -o sample.500k.chrom.png -bs 500k - ``` - - Note: If the number of bins in matrix is too large (large genome with small binsize), the memory may overflow. To fix it, users can improve the binsize or specified several chromosomes with `-c` parameter. - - ```bash - cphasing plot -m sample.100k.chrom -c Chr01g1,Chr01g2,Chr01g3,Chr01g4 -o Chr01.100k.png - ``` - -## FAQ -### The results of the first round partition are unsatisfactory. -In our two-round partition algorithm, the first round partition depends on the h-trans errors between homologous chromosomes; if you input a contig assembly with low level switch errors or input a high accuracy pore-c data, the h-trans will be not enough to cluster all contigs to correct homologous groups, resulting in unsatisfactory results. You can set the `-q1 0` for `hyperpartition` to increase the rate of h-trans errors. However, this parameter may raise error of `out of memory` when you input huge pore-c data in porec table or hic contacts in pairs file. - -### How to set the `-n` parameter when assembling an aneuploid genome. -The aneuploid genome, such as modern cultivated sugarcane, contains unequal homologous chromosomes. The `-n` parameter can be set to zero (`-n 10:0`) to automatically partition contigs into different chromosomes within a homologous group. -However, we also allow the user to input a file with two columns: the first column is the index(1-base) of the first round partition, and the second column is the chromosome number of each homologous. And then specified the `-n 10:second.number.tsv` in `cphasing pipeline` or `cphasing hyperpartition`. -- `second.number.tsv` -```text -1 13 -2 12 -3 12 -4 11 -5 10 -6 12 -7 12 -8 10 -9 12 -10 12 -``` - +## More +More details please check the documentation: +[Documentation](https://wangyibin.github.io/CPhasing/latest) | [中文文档](https://wangyibin.github.io/CPhasing/latest/zh) diff --git a/bin/cphasing-rs b/bin/cphasing-rs index 25821ac..0f9882a 100755 Binary files a/bin/cphasing-rs and b/bin/cphasing-rs differ diff --git a/bin/deactivate_cphasing b/bin/deactivate_cphasing index eb4d711..c7cc48f 100755 --- a/bin/deactivate_cphasing +++ b/bin/deactivate_cphasing @@ -13,7 +13,7 @@ software_dir="$(dirname "$script_dir")" if [ ! -f $software_dir/pixi.toml ]; then echo "pixi.toml not found in $software_dir" - exit 1 + return 1 fi is_tty=$(tty) diff --git a/cphasing/__init__.py b/cphasing/__init__.py index ed1a571..271e6b8 100644 --- a/cphasing/__init__.py +++ b/cphasing/__init__.py @@ -29,7 +29,7 @@ __email__ = ("yibinwang96@outlook.com", "zhangxingtan@caas.cn") __license__ = "BSD" __status__ = "Development" -__version__ = "0.2.3.r287" +__version__ = "0.2.4.r288" __epilog__ = f""" \b Version: {__version__} | \n diff --git a/cphasing/_config.py b/cphasing/_config.py index d9bedb8..2ea2ca0 100644 --- a/cphasing/_config.py +++ b/cphasing/_config.py @@ -14,4 +14,24 @@ ## Datatype of HyperGraph order HYPERGRAPH_ORDER_DTYPE = "int8" -HYPERGRAPH_COL_DTYPE = "int32" \ No newline at end of file +HYPERGRAPH_COL_DTYPE = "int32" + + +## Default parameters for CLI +ALLELES_TRIM_LENGTH = 25_000 + +## HyperGraph construction +EDGE_LENGTH = "500k" + + +## HyperPartition +HCR_LOWER = 0.01 +HCR_UPPER = 1.5 +MIN_CONTACTS = 5 +MIN_CIS_WEIGHT = 5.0 +MIN_WEIGHT = 0.1 + +MIN_SCAFFOLD_LENGTH = 2e6 +MIN_LENGTH = 10_000 + + diff --git a/cphasing/algorithms/hypergraph.py b/cphasing/algorithms/hypergraph.py index fb4e59a..7cd3883 100644 --- a/cphasing/algorithms/hypergraph.py +++ b/cphasing/algorithms/hypergraph.py @@ -240,9 +240,10 @@ def clique_expansion_init(H, P_allelic_idx=None, if min_weight > 0: mask = A >= min_weight - A.multiply(mask) - - + # mask = (A >= min_weight).astype(bool) + # mask += (A <= -min_weight).astype(bool) + A = A.multiply(mask) + if P_allelic_idx or P_weak_idx: # P = np.ones((H.shape[0], H.shape[0]), dtype=np.int8) # P[np.diag_indices_from(P)] = 0 @@ -288,7 +289,7 @@ def to_contacts(A, nodes, df = pd.DataFrame({0: V[A.row], 1: V[A.col], 2: A.data}) df = df[df[0] <= df[1]] - df = df[df[2] >= min_weight] + df = df[(df[2] >= min_weight) | (df[2] <= -min_weight)] df = df[abs(df[2]) != np.inf] # df = df.query(f"0 <= 1 & 2 >= {min_weight} & abs(2) != inf") df.to_csv(output, sep='\t', header=False, index=False) @@ -461,6 +462,7 @@ def IRMM(H, A=None, -------- >>> IRMM(H, vertices) """ + if max_round > 1: import dask.array as da # if P_allelic_idx or P_weak_idx: @@ -497,6 +499,8 @@ def IRMM(H, A=None, if min_weight > 0: mask = A >= min_weight + # mask = (A >= min_weight).astype(bool) + # mask += (A <= -min_weight).astype(bool) A = A.multiply(mask) # normalization if NW is not None: diff --git a/cphasing/algorithms/recluster.py b/cphasing/algorithms/recluster.py new file mode 100644 index 0000000..94b2d22 --- /dev/null +++ b/cphasing/algorithms/recluster.py @@ -0,0 +1,34 @@ +#!/usr/bin/env python +# -*- coding:utf-8 -*- + +""" +recluster the contigs on existing clusters +""" + +import argparse +import logging +import os +import os.path as op +import sys + +import pandas as pd +import numpy as np + +from ..core import AlleleTable, PruneTable, ClusterTable + + +class Recluster: + """ + Developping... + """ + def __init__(self, clustertable, contacts, countre, prunetable, min_length=50000): + self.clustertable = clustertable + self.contacts = contacts + self.countre = countre + self.prunetable = prunetable + + self.min_length = min_length + + + + diff --git a/cphasing/cli.py b/cphasing/cli.py index 59d821b..e157a9c 100644 --- a/cphasing/cli.py +++ b/cphasing/cli.py @@ -22,14 +22,18 @@ from shutil import which from . import __version__, __epilog__ +from ._config import * from .core import ( ClusterTable, CountRE, ) from .utilities import ( + compress_cmd, + decompress_cmd, run_cmd, humanized2numeric, to_humanized2, + to_humanized3, is_compressed_table_empty, ) @@ -81,6 +85,17 @@ "alleles", "hyperpartition", "scaffolding", "build", "rename", "pairs2cool", "plot"] }, + + { + "name": "Formats Converter (link to cphasing-rs)", + "commands": ["bam2pairs", "bam2paf", + + "paf2porec", "paf2pairs", + "porec-merge", "porec2pairs", + "pairs-merge", "pairs-filter", + "pairs2mnd", "pairs2cool",] + + }, { "name": "Other Useful Commands", "commands": ["hic", "alignments", "hcr", "alleles2", @@ -109,9 +124,7 @@ ] } - -click.rich_click.OPTION_GROUPS = { - "cphasing pipeline": [ +PIPE_OPTION_GROUPS = [ { "name": "Options of Input Files", "options": ["--fasta", "--porec-data", "--porectable", @@ -145,83 +158,19 @@ }, { "name": "Options of Alleles", - "options": ["--alleles-k", "--alleles-w", "--alleles-m", "--alleles-d"] - }, - { - "name": "Options of HyperPartition", - "options": ["-n", "--use-pairs", "--min-contacts", - "--Nx", "--min-length", "--min-weight", - "--min-cis-weight", "--whitelist", - "--resolution1", "--resolution2", - "--init-resolution1", "--init-resolution2", - "--first-cluster", "--normalize", - "--min-quality1", "--min-quality2", - "--min-scaffold-length", - "--allelic-similarity", "--min-allelic-overlap", - "--disable-merge-in-first", "--exclude-group-from-first", - "--exclude-group-to-second", "--enable-misassembly-remove" - ] - }, - { - "name": "Options of Scaffolding", - "options": ["--scaffolding-method"] - }, - { - "name": "Options of Plot", - "options": ["--binsize"] - }, - { - "name": "Global Options", - "options": ["--low-memory", "--threads", "--help"] - } - - ], - "cphasing pipe": [ - { - "name": "Options of Input Files", - "options": ["--fasta", "--porec-data", "--porectable", - "--pairs", "--hic1", "--hic2"] - }, - { - "name": "Advance Options of Pipeline", - "options": ["--mode", "--steps", "--skip-steps"] - }, - { - "name": "Options of Hitig", - "options": ["--ul-data", "--use-existed-hitig"] - }, - { - "name": "Options of Pore-C Mapper", - "options": ["--mapper-k", "--mapper-w", "--mapping-quality"] - }, - { - "name": "Options of Hi-C Mapper", - "options": ["--hic-mapper-k", "--hic-mapper-w", "--mapping-quality"] - }, - { "name": "Options of chimeric correction", - "options": ["--chimeric-correct", "--chimeric-corrected"] - - }, - { - "name": "Options of HCR", - "options": ["--hcr", "--pattern", "--hcr-lower", "--hcr-upper", - "--collapsed-contigs-ratio", "--hcr-binsize", - "--hcr-bed", "--hcr-invert"] - }, - { - "name": "Options of Alleles", - "options": ["--alleles-k", "--alleles-w", "--alleles-m", "--alleles-d"] + "options": ["--alleles-k", "--alleles-w", "--alleles-m", "--alleles-d", + "--alleles-trim-length"] }, { "name": "Options of HyperPartition", - "options": ["-n", "--use-pairs", "--min-contacts", + "options": ["-n", "--use-pairs", "--edge-length", "--min-contacts", "--Nx", "--min-length", "--min-weight", "--min-cis-weight", "--whitelist", "--resolution1", "--resolution2", "--init-resolution1", "--init-resolution2", "--first-cluster", "--normalize", "--min-quality1", "--min-quality2", - "--min-scaffold-length", + "--min-scaffold-length", "--allelic-factor", "--allelic-similarity", "--min-allelic-overlap", "--disable-merge-in-first", "--exclude-group-from-first", "--exclude-group-to-second", "--enable-misassembly-remove" @@ -233,14 +182,17 @@ }, { "name": "Options of Plot", - "options": ["--binsize"] + "options": ["--binsize", "--colormap", "--whitered"] }, { "name": "Global Options", - "options": ["--low-memory", "--threads", "--help"] - } - - ], + "options": ["--preset", "--low-memory", "--threads", "--help"] + } +] + +click.rich_click.OPTION_GROUPS = { + "cphasing pipeline": PIPE_OPTION_GROUPS, + "cphasing pipe": PIPE_OPTION_GROUPS, "cphasing plot": [ { "name": "Options of Matrix Operation", @@ -256,6 +208,7 @@ "name": "Options of Heatmap", "options": [ "--chromosomes", + "--regex", "--disable-natural-sort", "--per-chromosomes", "--only-chr", "--chr-prefix", @@ -483,20 +436,22 @@ def cli(verbose, quiet): ) @click.option( "-hcr-l", + "-hcr-lower", "--hcr-lower", "hcr_lower", help="Lower value of peak value.", type=float, - default=.01, + default=HCR_LOWER, show_default=True, ) @click.option( "-hcr-u", + "-hcr-upper", "--hcr-upper", "hcr_upper", help="Upper value of peak value.", type=float, - default=1.5, + default=HCR_UPPER, show_default=True, ) @click.option( @@ -614,6 +569,15 @@ def cli(verbose, quiet): default=.1, show_default=True ) +@click.option( + "-alleles-tl", + "--alleles-trim-length", + help="Trim sequences in the both end of contig when run `alleles`, " + "which aim to remove the effect of the overlapping of unitig in the assembly graph from hifiasm or other assembler. " + "If set to 0, will not trim.", + type=int, + default=ALLELES_TRIM_LENGTH, +) @click.option( "--use-pairs", "use_pairs", @@ -623,6 +587,15 @@ def cli(verbose, quiet): default=False, is_flag=True ) +@click.option( + '-e', + '--edge-length', + 'edge_length', + help='Only load contacts that in the both ends of contigs to construct hypergraph', + metavar="STR", + default="500k", + show_default=True +) @click.option( "-n", help=""" @@ -653,7 +626,7 @@ def cli(verbose, quiet): metavar="FLOAT", help="Resolution of the second partition", type=click.FloatRange(-1.0, 10.0), - default=1.0, + default=-1.0, show_default=True ) @click.option( @@ -675,7 +648,7 @@ def cli(verbose, quiet): type=float, help="Initial resolution of the automatic search resolution in second partition" ", only used when `--resolution2=-1`.", - default=0.8, + default=1.0, show_default=True ) @click.option( @@ -726,6 +699,16 @@ def cli(verbose, quiet): show_default=True, default=False, ) +@click.option( + "-af", + "--allelic-factor", + "allelic_factor", + metavar="INT", + help="Factor of inter-allelic weight.", + default=-1, + type=float, + show_default=True +) @click.option( "-as", "--allelic-similarity", @@ -752,7 +735,7 @@ def cli(verbose, quiet): "min_weight", help="Minimum weight of graph", type=float, - default=0.1, + default=MIN_WEIGHT, show_default=True ) @click.option( @@ -761,7 +744,7 @@ def cli(verbose, quiet): "min_cis_weight", help="Minimum conitg's cis weight of graph", type=float, - default=5.0, + default=MIN_CIS_WEIGHT, show_default=True ) @click.option( @@ -789,7 +772,7 @@ def cli(verbose, quiet): help="Minimum contacts of contigs", metavar="INT", type=int, - default=25, + default=MIN_CONTACTS, show_default=True ) @click.option( @@ -799,7 +782,7 @@ def cli(verbose, quiet): help="Minimum length of contigs", metavar="INT", type=int, - default=10000, + default=MIN_LENGTH, show_default=True ) @click.option( @@ -817,7 +800,7 @@ def cli(verbose, quiet): "min_scaffold_length", help="The minimum length of the output scaffolding.", type=float, - default=5e5, + default=MIN_SCAFFOLD_LENGTH, show_default=True ) @click.option( @@ -845,25 +828,16 @@ def cli(verbose, quiet): '--scaffolding-method', 'scaffolding_method', metavar='STR', - help="The method of scaffolding, `['precision', 'allhic', 'fast']`." - "precision: haphic_fastsort + allhic, which will quicker than allhic only. " - "It is worth noting that `allhic` in `C-Phasing` is parameterized to " - "achieve better results than the previous version", + help=""" + The method of scaffolding, `['precision', 'allhic', 'fast']`. + precision: fast + allhic, which will quicker than allhic only. + It is worth noting that `allhic` in `C-Phasing` is parameterized to + achieve better results than the previous version + """, default="precision", type=click.Choice(["precision", "allhic", "fast"]), show_default=True ) -# @click.option( -# '--factor', -# '-k', -# metavar="INT", -# help='Factor of plot matrix. ' -# 'If you input 10k matrix and want to plot heatmap at 500k, ' -# 'factor should be set with 50.', -# type=int, -# default=50, -# show_default=True -# ) @click.option( '-bs', '--binsize', @@ -872,6 +846,52 @@ def cli(verbose, quiet): default='auto', show_default=True ) +@click.option( + '-cmap', + "-cm", + '--cmap', + "--colormap", + help=""" + Colormap of heatmap. + Available values can be seen : + https://pratiman-91.github.io/colormaps/ + and http://matplotlib.org/examples/color/colormaps_reference.html and `whitered` . + """, + default='redp1_r', + show_default=True +) +@click.option( + '-wr', + '--whitered', + help=""" + Preset of `--scale none --colormap whitered` + """, + is_flag=True, + default=False, + show_default=True +) +@click.option( + '-x', + '--preset', + metavar='STR', + help="""Preset of the pipeline: +   precision: -msc 20 -mc 25 -mw 3 +   **sensitive**: -msc 5 -mc 5 -mw 0.1 +   very-sensitive -e 1m -msc 1.0 -mc 5 -mw 0.1 +   simulation: -e 0 -msc 0 -mc 5 -mw 0.1 --alleles-tl 0 + + > This parameter will overwrite the aboving parameters. + \ is designed for remove more low quality contigs to increase the precision of the result; + \ is designed for cluster more low signal or fragmented contigs; + \ is designed for cluster more low signal or fragmented contigs, and it is very sensitive; + \ is designed for simulation data in our study which is simulated from chromosome-level assembly. + + ["precision", "sensitive", "very-sensitive", "simulation"] + """, + default="sensitive", + show_default=True, + type=click.Choice(["precision", "sensitive", "very-sensitive", "simulation"]), +) @click.option( '--low-memory', help="Reduce memory usage. Only used in chimeric correct and hyperpartition, depend on input data size.", @@ -895,7 +915,7 @@ def cli(verbose, quiet): '--threads', help='Number of threads.', type=int, - default=4, + default=8, metavar='INT', show_default=True, ) @@ -929,8 +949,10 @@ def pipeline(fasta, alleles_window_size, alleles_minimum_similarity, alleles_diff_thres, + alleles_trim_length, n, use_pairs, + edge_length, resolution1, resolution2, init_resolution1, @@ -940,6 +962,7 @@ def pipeline(fasta, exclude_group_from_first, exclude_group_to_second, normalize, + allelic_factor, allelic_similarity, min_allelic_overlap, min_weight, @@ -954,6 +977,9 @@ def pipeline(fasta, whitelist, scaffolding_method, binsize, + cmap, + whitered, + preset, low_memory, outdir, threads): @@ -1012,7 +1038,45 @@ def pipeline(fasta, if all([(len(porec_data) > 0), ((hic1 is not None) and (hic2 is not None))]): logger.warning("Simulataneously process Pore-C and Hi-C is not yet supported, only use Pore-C data for subsequently steps.") - + + if preset == "precision": + min_contacts = 25 + min_weight = 2.0 + min_cis_weight = 20 + logger.info(f"Use the preset of {preset}. Set min_contacts={min_contacts}, " + f"min_weight={min_weight}, min_cis_weight={min_cis_weight}, edge_length={edge_length}") + + + # elif preset == "sensitive": + # min_contacts = 5 + # min_weight = 2.0 + # min_cis_weight = 5 + # logger.info(f"Use the preset of {preset}. Set min_contacts={min_contacts}, " + # f"min_weight={min_weight}, min_cis_weight={min_cis_weight}, edge_length={edge_length}") + + elif preset == "very-sensitive": + edge_length = "1m" + min_contacts = 5 + min_weight = 0.1 + min_cis_weight = 1.0 + logger.info(f"Use the preset of {preset}. Set min_contacts={min_contacts}, " + f"min_weight={min_weight}, min_cis_weight={min_cis_weight}, edge_length={edge_length}") + elif preset == "simulation": + min_contacts = 5 + min_weight = 0.1 + min_cis_weight = 0 + edge_length = 0 + alleles_trim_length = 0 + logger.info(f"Use the preset of {preset}. Set min_contacts={min_contacts}, " + f"min_weight={min_weight}, min_cis_weight={min_cis_weight}, " + f"edge_length={edge_length}, alleles_trim_length={alleles_trim_length}") + + + if edge_length != "None" and edge_length != "none" and edge_length is not None: + edge_length = edge_length + else: + edge_length = None + if steps: if steps == "all": @@ -1059,15 +1123,18 @@ def pipeline(fasta, alleles_window_size=alleles_window_size, alleles_minimum_similarity=alleles_minimum_similarity, alleles_diff_thres=alleles_diff_thres, + alleles_trim_length=alleles_trim_length, scaffolding_method=scaffolding_method, n=n, use_pairs=use_pairs, + edge_length=edge_length, resolution1=resolution1, resolution2=resolution2, init_resolution1=init_resolution1, init_resolution2=init_resolution2, first_cluster=first_cluster, normalize=normalize, + allelic_factor=allelic_factor, disable_merge_in_first=disable_merge_in_first, exclude_group_to_second=exclude_group_to_second, exclude_group_from_first=exclude_group_from_first, @@ -1084,6 +1151,8 @@ def pipeline(fasta, min_scaffold_length=min_scaffold_length, enable_misassembly_remove=enable_misassembly_remove, binsize=binsize, + colormap=cmap, + whitered=whitered, outdir=outdir, threads=threads, low_memory=low_memory) @@ -1647,6 +1716,7 @@ def pairs_intersect(pairs, bed, output): # p.intersection(bed, output) log_dir = Path("logs") log_dir.mkdir(parents=True, exist_ok=True) + cmd = ["cphasing-rs", "pairs-intersect", pairs, bed, "-o", output] flag = run_cmd(cmd, log=f"logs/pairs-intersect.log") assert flag == 0, "Failed to execute command, please check log." @@ -1927,7 +1997,7 @@ def chimeric(fasta, pairs, depth, window_size, "--lower", help="Lower value of peak value.", type=float, - default=.01, + default=HCR_LOWER, show_default=True, ) @click.option( @@ -1935,7 +2005,7 @@ def chimeric(fasta, pairs, depth, window_size, "--upper", help="Upper value of peak value.", type=float, - default=1.5, + default=HCR_UPPER, show_default=True, ) @click.option( @@ -2072,12 +2142,20 @@ def hcr(fasta, porectable, pairs, contigsize, logger.warning(f"Use exists depth file of `{depth_file}`.") else: - if is_file_changed(str(pairs)) or not Path(depth_file).exists(): - cmd = ['cphasing-rs', 'pairs2depth', - '-q', str(min_quality), - str(pairs), "-o", depth_file] - flag = run_cmd(cmd, log=f"logs/pairs2depth.log") + if str(pairs).endswith(".gz"): + cmd0 = decompress_cmd(pairs, threads=10) + logger.info("Calculating the depth from pairs ...") + cmd = ['cphasing-rs', 'pairs2depth', + '-q', str(min_quality), + '-', "-o", depth_file] + flag = os.system(" ".join(cmd0) + f" 2>logs/pairs2depth.decompress.log" + " | " + " ".join(cmd) + f" 2>logs/pairs2depth.log") + + else: + cmd = ['cphasing-rs', 'pairs2depth', + '-q', str(min_quality), + str(pairs), "-o", depth_file] + flag = run_cmd(cmd, log=f"logs/pairs2depth.log") assert flag == 0, "Failed to execute command, please check log." else: @@ -2097,7 +2175,8 @@ def hcr(fasta, porectable, pairs, contigsize, df1 = pd.read_csv(depth_file, sep='\t', header=None, index_col=None, names=['chrom', 'start', 'end', 'depth']) - df2 = pd.read_csv("tmp.depth.countre.txt", sep='\s+', header=0, index_col=None, + df2 = pd.read_csv("tmp.depth.countre.txt", sep='\s+', + header=0, index_col=None, usecols=[0, 1], names=['item', "count"]) @@ -2216,7 +2295,7 @@ def hcr(fasta, porectable, pairs, contigsize, '--threads', help="Number of threads. ", type=int, - default=4, + default=10, metavar='INT', show_default=True, ) @@ -2236,6 +2315,9 @@ def prepare(fasta, pairs, min_mapq, """ from .prepare import pipe + if pattern == "None" or pattern == "none": + pattern = None + pipe(fasta, pairs, pattern, min_mapq, min_contacts, skip_pairs2clm=skip_pairs2clm, @@ -2322,6 +2404,15 @@ def prepare(fasta, pairs, min_mapq, show_default=True, hidden=True ) +@click.option( + "-tl", + "--trim-length", + help="Trim sequences in the both end of contig, " + "which aim to remove the effect of the overlapping of unitig in the assembly graph from hifiasm or other assembler. " + "If set to 0, will not trim.", + type=int, + default=25000, +) @click.option( "-wl", "--whitelist", @@ -2358,6 +2449,7 @@ def alleles(fasta, output, kmer_size, window_size, max_occurance, min_cnt, minimum_similarity, diff_thres, + trim_length, whitelist, first_cluster, min_length, threads): """ @@ -2368,7 +2460,7 @@ def alleles(fasta, output, from .alleles import PartigAllele from .core import AlleleTable, ClusterTable from .utilities import is_file_changed - from joblib import Parallel, delayed + from joblib import Parallel, delayed, parallel_backend if not output: fasta_prefix = Path(Path(fasta).name).with_suffix("") @@ -2389,7 +2481,7 @@ def alleles(fasta, output, ct = ClusterTable(first_cluster) fasta = Path(fasta).absolute() Path("alleles_workdir/").mkdir(exist_ok=True) - fastas = ct.to_fasta(fasta, outdir="alleles_workdir") + fastas = ct.to_fasta(fasta, trim_length=trim_length, outdir="alleles_workdir") fastas = list(map(lambda x: Path(x).name, fastas)) os.chdir("alleles_workdir") @@ -2421,10 +2513,11 @@ def func(fa, kmer_size, window_size, max_occurance, logger.info("Identifing allelic contig pairs in each homologous group ...") if len(args) > 0: - Parallel(n_jobs=min(threads, len(args)))(delayed(func)(*a) for a in args) + with parallel_backend('loky'): + Parallel(n_jobs=min(threads, len(args)))(delayed(func)(*a) for a in args) else: logger.info("Load existing allele table to generate total allele table.") - + for fa in fastas: if Path(fa).exists(): os.remove(fa) @@ -2434,7 +2527,10 @@ def func(fa, kmer_size, window_size, max_occurance, os.chdir('..') ## merge allele table logger.setLevel(logging.WARNING) - allele_df = pd.concat([AlleleTable(at, sort=False, fmt="allele2").data for at in allele_tables], axis=0) + res = [AlleleTable(at, sort=False, fmt="allele2").data for at in allele_tables] + + res = list(filter(lambda x: len(x) > 0, res)) + allele_df = pd.concat(res, axis=0) allele_df = allele_df.reset_index(drop=True).reset_index().reset_index() logger.setLevel(logging.INFO) @@ -2819,6 +2915,15 @@ def kprune(alleletable, contacts, show_default=True, hidden=True ) +@click.option( + '-e', + '--edge-length', + 'edge_length', + help='Only load contacts that in the both ends of contigs', + metavar="STR", + default=EDGE_LENGTH, + show_default=True +) @click.option( '-q', '--min_quality', @@ -2908,7 +3013,8 @@ def hypergraph(contacts, output, min_order, max_order, - min_alignments, + min_alignments, + edge_length, min_quality, pairs, fofn, @@ -2940,6 +3046,10 @@ def hypergraph(contacts, contigsizes = read_chrom_sizes(contigsize) contigs = contigsizes.index.values.tolist() # contigs = natsorted(contigsizes.index.values.tolist()) + if edge_length != "None" and edge_length is not None and edge_length != "none": + edge_length = humanized2numeric(edge_length) + else: + edge_length = None if whitelist: whitelist = set([i.strip() for i in open(whitelist) if i.strip()]) @@ -2959,7 +3069,7 @@ def hypergraph(contacts, if not split: he = HyperExtractor(pore_c_tables, contig_idx, contigsizes.to_dict()['length'], min_order, max_order, min_alignments, - hcr_bed=hcr_bed, hcr_invert=hcr_invert, + hcr_bed=hcr_bed, hcr_invert=hcr_invert, edge_length=edge_length, min_quality=min_quality, threads=threads) he.save(output) else: @@ -2977,7 +3087,7 @@ def hypergraph(contacts, if not split: e = Extractor(pairs_files, contig_idx, contigsizes.to_dict()['length'], - hcr_bed= hcr_bed, hcr_invert=hcr_invert, + hcr_bed= hcr_bed, hcr_invert=hcr_invert, edge_length=edge_length, min_quality=min_quality, threads=threads) e.save(output) else: @@ -3023,11 +3133,20 @@ def hypergraph(contacts, default=False, show_default=True, ) +@click.option( + '-e', + '--edge-length', + 'edge_length', + help='Only load contacts that in the both ends of contigs to construct hypergraph', + metavar="STR", + default=EDGE_LENGTH, + show_default=True +) @click.option( '-c', '--contacts', help=""" - contacts file for kprune, generate from prepare + contacts file for kprune, generate from prepare, default generate from graph or hypergraph. """, default=None, show_default=True, @@ -3128,6 +3247,15 @@ def hypergraph(contacts, default=.1, show_default=True ) +@click.option( + "-alleles-tl", + "--alleles-trim-length", + help="Trim sequences in the both end of contig when run `alleles`, " + "which aim to remove the effect of the overlapping of unitig in the assembly graph from hifiasm or other assembler. " + "If set to 0, will not trim.", + type=int, + default=ALLELES_TRIM_LENGTH, +) @click.option( "-at", "--alleletable", @@ -3162,7 +3290,7 @@ def hypergraph(contacts, "--allelic-factor", "allelic_factor", metavar="INT", - help="Factor of allelic weight.", + help="Factor of inter-allelic weight.", default=-1, type=float, show_default=True @@ -3327,7 +3455,7 @@ def hypergraph(contacts, help="Minimum contacts of contigs", metavar="INT", type=int, - default=25, + default=MIN_CONTACTS, show_default=True ) @click.option( @@ -3337,7 +3465,7 @@ def hypergraph(contacts, help="Minimum length of contigs", metavar="INT", type=int, - default=10000, + default=MIN_LENGTH, show_default=True ) @click.option( @@ -3355,7 +3483,7 @@ def hypergraph(contacts, metavar="FLOAT", help="Resolution of the second partition", type=click.FloatRange(-1.0, 10.0), - default=1.0, + default=-1.0, show_default=True ) @click.option( @@ -3377,7 +3505,7 @@ def hypergraph(contacts, type=float, help="Initial resolution of the automatic search resolution in second partition" ", only used when `--resolution2=-1`.", - default=0.8, + default=1.0, show_default=True ) @click.option( @@ -3386,7 +3514,7 @@ def hypergraph(contacts, "min_weight", help="Minimum weight of graph", type=float, - default=0.1, + default=MIN_WEIGHT, show_default=True ) @click.option( @@ -3395,7 +3523,7 @@ def hypergraph(contacts, "min_cis_weight", help="Minimum conitg's cis weight of graph", type=float, - default=5.0, + default=MIN_CIS_WEIGHT, show_default=True ) @click.option( @@ -3422,7 +3550,7 @@ def hypergraph(contacts, "min_scaffold_length", help="The minimum length of the output scaffolding.", type=float, - default=5e5, + default=MIN_SCAFFOLD_LENGTH, show_default=True ) @click.option( @@ -3482,6 +3610,7 @@ def hyperpartition(hypergraph, output, pairs, porec, + edge_length, contacts, ultra_long, ul_weight, @@ -3492,6 +3621,7 @@ def hyperpartition(hypergraph, alleles_window_size, alleles_minimum_similarity, alleles_diff_thres, + alleles_trim_length, alleletable, prunetable, normalize, @@ -3548,6 +3678,12 @@ def hyperpartition(hypergraph, assert not all([porec, pairs]), "confilct parameters, only support one type data" + if edge_length != "None" and edge_length is not None and edge_length != "none": + edge_length = humanized2numeric(edge_length) + else: + edge_length = None + + ultra_complex = None # mode = "basal" if mode == "haploid" else mode n = re.split(":|x|\|", n) if n else None @@ -3655,36 +3791,44 @@ def hyperpartition(hypergraph, # contigs = natsorted(contigs) contig_idx = defaultdict(None, dict(zip(contigs, range(len(contigs))))) - if is_file_changed(hcr_bed) or is_file_changed(hypergraph) or not Path(f"{prefix}.q{min_quality1}.hg").exists(): + if edge_length: + hypergraph_path = f"{prefix}.q{min_quality1}.e{to_humanized2(edge_length)}.hg" + else: + hypergraph_path = f"{prefix}.q{min_quality1}.hg" + if is_file_changed(hcr_bed) or is_file_changed(hypergraph) or not Path(hypergraph_path).exists(): logger.info(f"Load raw hypergraph from porec table `{hypergraph}`") he = HyperExtractor(hypergraph, contig_idx, contigsizes.to_dict()['length'], - min_quality=min_quality1, hcr_bed=hcr_bed, + min_quality=min_quality1, hcr_bed=hcr_bed, edge_length=edge_length, hcr_invert=hcr_invert, threads=threads) - he.save(f"{prefix}.q{min_quality1}.hg") + he.save(hypergraph_path) hypergraph = he.edges else: if hcr_bed: - logger.warning(f"Load raw hypergraph from existed file of `{prefix}.q{min_quality1}.hg`, if the {hcr_bed} changed, you should remove this existing hg.") + logger.warning(f"Load raw hypergraph from existed file of `{hypergraph_path}`, if the {hcr_bed} changed, you should remove this existing hg.") else: - logger.warning(f"Load raw hypergraph from existed file of `{prefix}.q{min_quality1}.hg`.") - hypergraph = msgspec.msgpack.decode(open(f"{prefix}.q{min_quality1}.hg", 'rb').read(), type=HyperEdges) + logger.warning(f"Load raw hypergraph from existed file of `{hypergraph_path}`.") + hypergraph = msgspec.msgpack.decode(open(hypergraph_path, 'rb').read(), type=HyperEdges) elif pairs: contigs = contigsizes.index.values.tolist() contigs = list(map(str, contigs)) # contigs = natsorted(contigs) contig_idx = defaultdict(None, dict(zip(contigs, range(len(contigs))))) - if is_file_changed(hcr_bed) or is_file_changed(hypergraph) or not Path(f"{prefix}.q{min_quality1}.hg").exists(): + if edge_length: + hypergraph_path = f"{prefix}.q{min_quality1}.e{to_humanized2(edge_length)}.hg" + else: + hypergraph_path = f"{prefix}.q{min_quality1}.hg" + if is_file_changed(hcr_bed) or is_file_changed(hypergraph) or not Path(hypergraph_path).exists(): logger.info(f"Load raw hypergraph from pairs file `{hypergraph}`") he = Extractor(hypergraph, contig_idx, contigsizes.to_dict()['length'], - min_quality=min_quality1, hcr_bed=hcr_bed, + min_quality=min_quality1, hcr_bed=hcr_bed, edge_length=edge_length, hcr_invert=hcr_invert, threads=threads) - he.save(f"{prefix}.q{min_quality1}.hg") + he.save(hypergraph_path) hypergraph = he.edges else: - logger.warning(f"Load raw hypergraph from exists file of `{prefix}.q{min_quality1}.hg`, if the input pairs changed, you should remove this existing hg.") - hypergraph = msgspec.msgpack.decode(open(f"{prefix}.q{min_quality1}.hg", 'rb').read(), type=HyperEdges) + logger.warning(f"Load raw hypergraph from exists file of `{hypergraph_path}`, if the input pairs changed, you should remove this existing hg.") + hypergraph = msgspec.msgpack.decode(open(hypergraph_path, 'rb').read(), type=HyperEdges) else: @@ -3770,6 +3914,7 @@ def hyperpartition(hypergraph, alleles_window_size, alleles_minimum_similarity, alleles_diff_thres, + alleles_trim_length, alleletable, prunetable, normalize, @@ -4155,7 +4300,39 @@ def scaffolding(clustertable, count_re, clm, allele_table=allele_table, fasta=fasta, output=output, threads=threads) hs.run() -@cli.command(hidden=True) + if corrected: + output_agp = f"{Path(output).stem}.corrected.agp" + else: + output_agp = output + if Path(output_agp).exists(): + try: + statagp.main(args=[ + output, + "-o", + f"{output_agp}.stat"], + prog_name='statagp') + except SystemExit as e: + exc_info = sys.exc_info() + exit_code = e.code + if exit_code is None: + exit_code = 0 + + if exit_code != 0: + raise e + + if Path(f"{output_agp}.stat").exists(): + df = pd.read_csv(f"{output_agp}.stat", sep='\t', index_col=0) + anchor_rate = df.loc["Anchor rate (%):"].iloc[0] + ctg_length = df.loc["Total Length of contigs (bp):"].iloc[0] + anchored_ctg_length = df.loc["Total Length of anchored contigs (bp):"].iloc[0] + + logger.info(f"Anchor rate: {anchor_rate}%, " + f"Total Length of anchored contigs: {to_humanized3(anchored_ctg_length)}, " + f"Total Length of contigs: {to_humanized3(ctg_length)}" + ) + logger.info(f"Stat the agp file to `{output}.stat`") + +@cli.command(cls=RichCommand, hidden=True) @click.argument( "hypergraph", metavar="HyperGraph", @@ -4310,7 +4487,7 @@ def chromsizes(fasta, output): assert flag == 0, "Failed to execute command, please check log." -@cli.command(hidden=HIDDEN, short_help="Convert paf to pairs. (hidden)") +@cli.command(cls=RichCommand, short_help="Convert porec alignment paf to pairs. ") @click.argument( "paf", metavar="INPUT_PAF_PATH", @@ -4417,7 +4594,7 @@ def paf2pairs(paf, contigsizes, assert flag == 0, "Failed to execute command, please check log." -@cli.command(cls=RichCommand, hidden=HIDDEN, short_help="Convert paf to porec table. (hidden)") +@cli.command(cls=RichCommand, short_help="Convert paf to porec table.") @click.argument( "paf", metavar="INPUT_PAF_PATH", @@ -4505,7 +4682,41 @@ def paf2porec(paf, bed, min_quality, assert flag == 0, "Failed to execute command, please check log." -@cli.command(cls=RichCommand, hidden=HIDDEN, short_help="Convert porec to pairs. (hidden)") +@cli.command(cls=RichCommand, short_help="Merge multiple porec tables.") +@click.argument( + "porec", + metavar="INPUT_POREC_PATH", + required=True, + nargs=-1, + type=click.Path(exists=True) +) +@click.option( + "-o", + "--output", + metavar="OUTPUT", + help="output path", + default="-", + show_default=True +) +def porec_merge(porec, output): + """ + Merge multiple porec tables. + + INPUT_POREC_PATH: The porec table generated from `cphasing-rs paf2porec`. + + """ + import shutil + if len(porec) == 1: + logger.warning("Only one porec table, skip merge.") + shutil.copy(porec[0], output) + cmd = ["cphasing-rs", "porec-merge", *porec, "-o", output] + + flag = run_cmd(cmd) + assert flag == 0, "Failed to execute command, please check log." + + + +@cli.command(cls=RichCommand, short_help="Convert porec table to pairs. ") @click.argument( "porec", metavar="INPUT_POREC_PATH", @@ -4560,8 +4771,37 @@ def porec2pairs(porec, min_mapq, flag = run_cmd(cmd) assert flag == 0, "Failed to execute command, please check log." +@cli.command(cls=RichCommand, short_help="Convert porec alignment bam to paf file.") +@click.argument( + "bam", + metavar="INPUT_BAM_PATH", + type=click.Path(exists=True) +) +@click.option( + '-s', + '--secondary', + is_flag=True, + default=False, + help='Output secondary alignments.' +) +@click.option( + "-o", + "--output", + metavar="OUTPUT", + default="-", + show_default=True +) +def bam2paf(bam, secondary, output): + cmd = ["cphasing-rs", "bam2paf", f"{bam}", + + "-o", f"{output}"] + if secondary: + cmd.append("-s") + + flag = run_cmd(cmd) + assert flag == 0, "Failed to execute command, please check log." -@cli.command(cls=RichCommand, hidden=HIDDEN, short_help="Convert pairs to bam file. (hidden)") +@cli.command(cls=RichCommand, short_help="Convert bwa bam (or another paired bam) to pairs file.") @click.argument( "bam", metavar="INPUT_BAM_PATH", @@ -4590,7 +4830,7 @@ def bam2pairs(bam, min_quality, output): flag = run_cmd(cmd) assert flag == 0, "Failed to execute command, please check log." -@cli.command(cls=RichCommand, hidden=HIDDEN, short_help="Convert pairs to bam file. (hidden)") +@cli.command(cls=RichCommand, short_help="Filter pairs by mapping quality.") @click.argument( "pairs", metavar="INPUT_PAIRS_PATH", @@ -4605,6 +4845,18 @@ def bam2pairs(bam, min_quality, output): default=1, show_default=True ) +@click.option( + "-wl", + "--whitelist", + metavar="PATH", + help=""" + Path to 1-column list file containing + contigs to include in pairs file. + """, + default=None, + show_default=True, + type=click.Path(exists=True) +) @click.option( '-t', '--threads', @@ -4621,9 +4873,12 @@ def bam2pairs(bam, min_quality, output): default="-", show_default=True ) -def pairs_filter(pairs, min_quality, threads, output): +def pairs_filter(pairs, min_quality, whitelist, threads, output): cmd = ["cphasing-rs", "pairs-filter", f"{pairs}", "-t", f"{threads}", "-q", f"{min_quality}", "-o", f"{output}"] + + if whitelist: + cmd.extend(["-w", f"{whitelist}"]) flag = run_cmd(cmd) assert flag == 0, "Failed to execute command, please check log." @@ -4650,7 +4905,7 @@ def pairs_filter(pairs, min_quality, threads, output): help='Minimum quality of mapping [0, 255].', metavar='INT', type=click.IntRange(0, 255, clamp=True), - default=1, + default=0, show_default=True ) @click.option( @@ -4658,7 +4913,7 @@ def pairs_filter(pairs, min_quality, threads, output): '--threads', help='Number of threads. (unused)', type=int, - default=4, + default=8, metavar='INT', show_default=True, ) @@ -4670,9 +4925,16 @@ def pairs_filter(pairs, min_quality, threads, output): show_default=True ) def pairs2clm(pairs, min_contacts, min_quality, threads, output): - cmd = ["cphasing-rs", "pairs2clm", f"{pairs}", "-c", f"{min_contacts}", - "-q", f"{min_quality}", "-t", f"{threads}", "-o", f"{output}"] - flag = run_cmd(cmd) + if pairs.endswith(".gz"): + cmd0 = decompress_cmd(pairs, threads=threads) + cmd = ["cphasing-rs", "pairs2clm", "-", "-c", f"{min_contacts}", + "-q", f"{min_quality}", "-t", f"{threads}", "-o", f"{output}"] + flag = os.system(" ".join(cmd0) + " | " + " ".join(cmd)) + else: + cmd = ["cphasing-rs", "pairs2clm", f"{pairs}", "-c", f"{min_contacts}", + "-q", f"{min_quality}", "-t", f"{threads}", "-o", f"{output}"] + flag = run_cmd(cmd) + assert flag == 0, "Failed to execute command, please check log." @cli.command(cls=RichCommand, hidden=HIDDEN, short_help="Convert pairs to clm. (hidden)") @@ -4706,10 +4968,27 @@ def pairs2clm(pairs, min_contacts, min_quality, threads, output): default="-", show_default=True ) -def pairs2depth(pairs, binsize, min_quality, output): - cmd = ["cphasing-rs", "pairs2depth", f"{pairs}", "-b", f"{binsize}", - "-q", f"{min_quality}", "-o", f"{output}"] - flag = run_cmd(cmd) +@click.option( + '-t', + '--threads', + help='Number of threads.', + type=int, + default=10, + metavar='INT', + show_default=True, +) +def pairs2depth(pairs, binsize, min_quality, output, threads): + if pairs.endswith(".gz"): + cmd0 = decompress_cmd(pairs, threads=threads) + cmd = ["cphasing-rs", "pairs2depth", "-", "-b", f"{binsize}", + "-t", str(threads), "-q", f"{min_quality}", "-o", f"{output}"] + flag = os.system(" ".join(cmd0) + " | " + " ".join(cmd)) + + else: + cmd = ["cphasing-rs", "pairs2depth", f"{pairs}", "-b", f"{binsize}", + "-t", str(threads), "-q", f"{min_quality}", "-o", f"{output}"] + flag = run_cmd(cmd) + assert flag == 0, "Failed to execute command, please check log." @cli.command(hidden=True, short_help="Convert pairs to contacts. (hidden)") @@ -4763,7 +5042,7 @@ def pairs2contacts(pairs, min_contacts, split_num, assert flag == 0, "Failed to execute command, please check log." -@cli.command(cls=RichCommand, hidden=HIDDEN, short_help="Convert pairs to mnd file. (hidden)") +@cli.command(cls=RichCommand, short_help="Convert pairs to mnd file for generating .hic file.") @click.argument( "pairs", metavar="INPUT_PAIRS_PATH", @@ -4793,6 +5072,40 @@ def pairs2mnd(pairs, output, min_mapq): assert flag == 0, "Failed to execute command, please check log." +@cli.command(cls=RichCommand, short_help="Merge multiple pairs file into one.") +@click.argument( + "pairs", + metavar="INPUT_PAIRS_PATH", + nargs=-1, + required=True, + type=click.Path(exists=True) +) +@click.option( + "-o", + "--output", + metavar="OUTPUT", + default="-", + show_default=True +) +def pairs_merge(pairs, output): + """ + Merge multiple pairs file into one. + + INPUT_PAIRS_PATH : Path of pairs file, can be compressed. + + + """ + if len(pairs) == 0: + logger.error("At least one pairs file must be specified.") + return 1 + + cmd = ["cphasing-rs", "pairs-merge", *pairs, "-o", output] + flag = run_cmd(cmd) + assert flag == 0, "Failed to execute command, please check log." + + + + @cli.command(cls=RichCommand, epilog=__epilog__) @click.argument( "pairs", @@ -5159,6 +5472,13 @@ def pairs2cool2(pairs, outcool, 'Comma seperated. or a one column file', default='' ) +@click.option( + '-r', + '--regex', + is_flag=True, + help='Regular expression of chromosomes, only used for `--chromosomes`, e.g. `Chr01g.*`', + default=False +) @click.option( "-dns", "--disable-natural-sort", @@ -5316,6 +5636,7 @@ def plot(matrix, vmin, vmax, chromosomes, + regex, disable_natural_sort, per_chromosomes, only_chr, @@ -5368,10 +5689,10 @@ def plot(matrix, if agp is None: only_plot = True if not only_coarsen: - logger.warning( "Only plot the matrix. " + logger.info( "Only plot the matrix. " "If you want to adjust matrix to chromosome-level, please provide agp file. ") else: - logger.warning("Only coarsen the input matrix.") + logger.info("Only coarsen the input matrix.") if not cooler.fileops.is_cooler(matrix): logger.error(f"Input file `{matrix}` is not a cool file.") @@ -5379,6 +5700,7 @@ def plot(matrix, cool = cooler.Cooler(matrix) cool_binsize = cool.info['bin-size'] + bins = cool.bins()[:] if cool_binsize is None: cool_binsize = np.argmax(np.bincount(bins['end'] - bins['start'])) @@ -5409,6 +5731,7 @@ def plot(matrix, else: factor = 1 binsize = cool_binsize + if not only_plot: @@ -5421,11 +5744,11 @@ def plot(matrix, sys.exit() if not no_coarsen and factor > 1: - matrix = coarsen_matrix(matrix, factor, None, threads) + matrix = coarsen_matrix(matrix, cool_binsize, factor, None, threads) else: if only_coarsen or coarsen: - matrix = coarsen_matrix(matrix, factor, None, threads) + matrix = coarsen_matrix(matrix, cool_binsize, factor, None, threads) if balance: cool = cooler.Cooler(matrix) @@ -5443,24 +5766,46 @@ def plot(matrix, balance_matrix(matrix, threads) balanced = True + if op.exists(chromosomes): logger.info(f"Load chromosomes list from the file `{chromosomes}`.") chromosomes = [i.strip().split("\t")[0] for i in open(chromosomes) if i.strip()] else: + if chromosomes: + chromnames = cooler.Cooler(matrix).chromnames + if regex: + _regex = re.compile(chromosomes) + chromosomes = list(filter(lambda x: _regex.findall(x), chromnames)) + logger.info(f"Find chromosomes: {' '.join(chromosomes)}") + + else: + chromosomes = chromosomes.strip().strip(",").split(',') + + chromosomes = list(filter(lambda x: x in chromnames, chromosomes)) + if not chromosomes: + logger.warning(f"Specified chromosomes not found in the matrix, please check the input, plotting all chromosomes.") + else: + not_found_chromosomes = list(filter(lambda x: x not in chromnames, chromosomes)) + if not_found_chromosomes: + logger.warning(f"These chromosomes not found in the matrix: {', '.join(not_found_chromosomes)}") + - chromosomes = chromosomes.strip().strip(",").split(',') if chromosomes else None - + if not chromosomes and only_chr: + logger.info(f"Only plot the chromosomes (--chr-prefix='{chr_prefix}.*') that ignore unanchored contigs.") regex = re.compile(f"^{chr_prefix}") - chroms = cooler.Cooler(matrix).chromnames + chroms = cooler.Cooler(matrix).chromnames chromosomes = list(filter(lambda x: regex.findall(x), chroms)) + if len(chromosomes) == 0: + logger.warning(f"Chromosomes with prefix `{chr_prefix}` not found in the matrix, plotting all chromosomes.") if chromosomes and only_chr: if not disable_natural_sort: chromosomes = natsorted(chromosomes) + if no_ticks: xticks = False @@ -6256,6 +6601,91 @@ def cool2depth(coolfile, output): cool2depth(coolfile, output) +@utils.command(cls=RichCommand) +@click.argument( + "hypergraph", + metavar="INPUT_HYPERGRAPH_PATH", + +) +@click.option( + "-q", + "--min-mapq", + "min_mapq", + default=1, + metavar="INT", + help="Minimum mapping quality of alignments", + type=click.IntRange(0, 60), + show_default=True, +) +@click.option( + "-mc", + "--min-contacts", + "min_contacts", + help="Minimum contacts of contigs", + metavar="INT", + type=int, + default=5, + show_default=True +) +@click.option( + "-pt", + "--prunetable", + metavar="PATH", + help="Path to prune table that is generated from `kprune`. [defualt: None]", + default=None, + show_default=True, + type=click.Path(exists=True) +) +@click.option( + "-o", + "--output", + metavar="PATH", + help="Output of expansion contacts", + default="file_pefix.expansion.contacts", + show_default=True +) +def hg2contacts(hypergraph, min_mapq, min_contacts, prunetable, output): + """ + Convert hypergraph to contacts. + """ + import msgspec + from .algorithms.hypergraph import HyperEdges, HyperGraph + from .core import PruneTable + + + he = msgspec.msgpack.decode(open(hypergraph, 'rb').read(), type=HyperEdges) + he.to_numpy() + hg = HyperGraph(he, min_quality=min_mapq) + H = hg.incidence_matrix(min_contacts=min_contacts) + nodes = hg.nodes + + if prunetable: + vertices_idx = {v: i for i, v in enumerate(nodes)} + prunetable = PruneTable(prunetable) + pair_df = prunetable.data + pair_df['contig1'] = pair_df['contig1'].map(lambda x: vertices_idx.get(x, np.nan)) + pair_df['contig2'] = pair_df['contig2'].map(lambda x: vertices_idx.get(x, np.nan)) + pair_df = pair_df.dropna(axis=0).astype({'contig1': 'int', 'contig2': 'int', 'type': 'int'}) + + prunetable.data = pair_df + prunetable.symmetric_table() + pair_df = prunetable.data + pair_df = pair_df.drop_duplicates(subset=['contig1', 'contig2']) + pair_df = pair_df.reset_index(drop=True) + P_allelic_idx = [pair_df.loc[pair_df['type'] == 0, 'contig1'], pair_df.loc[pair_df['type'] == 0, 'contig2']] + P_weak_idx = [pair_df.loc[pair_df['type'] == 1, 'contig1'], pair_df.loc[pair_df['type'] == 1, 'contig2']] + else: + P_allelic_idx = None + P_weak_idx = None + min_weight = 2.0 + A = hg.clique_expansion_init(H, + min_weight=min_weight, + P_allelic_idx=P_allelic_idx, + P_weak_idx=P_weak_idx) + hg.to_contacts(A, nodes, output=output, min_weight=min_weight) + + + @utils.command(cls=RichCommand) @click.argument( "coolfile", diff --git a/cphasing/core.py b/cphasing/core.py index 0c865f8..30f0df6 100644 --- a/cphasing/core.py +++ b/cphasing/core.py @@ -325,7 +325,10 @@ def sort_row(row): usecols=self.columns, comment="#") df.index = df.index.astype('category') df = df.dropna(how='all', axis=1) - + if df.empty: + self.data = df + return + if self.fmt == "allele1": df = df.drop(1, axis=1) df = df.apply(sort_row, axis=1) if self.sort else df @@ -1128,7 +1131,7 @@ def to_agp(self, contigsizes, output, orientation_db=None): - def to_fasta(self, fasta, outdir="./"): + def to_fasta(self, fasta, trim_length=25000, outdir="./"): """ extract fasta by cluster table """ @@ -1145,7 +1148,16 @@ def to_fasta(self, fasta, outdir="./"): for record in fasta: if record.id in db: - file_db[db[record.id]].write(f'>{record.id}\n{record.seq}\n') + seq = record.seq + length = len(record.seq) + if length > trim_length * 3: + seq = seq[trim_length: length - trim_length + 1] + # if length > 1000_000: + # seq1 = seq[:500_000] + # seq2 = seq[500_000:] + # seq = seq1 + seq2 + + file_db[db[record.id]].write(f'>{record.id}\n{seq}\n') for group in file_db: file_db[group].close() diff --git a/cphasing/hitig/hcr/hcr_by_contacts.py b/cphasing/hitig/hcr/hcr_by_contacts.py index f830933..de16ac1 100644 --- a/cphasing/hitig/hcr/hcr_by_contacts.py +++ b/cphasing/hitig/hcr/hcr_by_contacts.py @@ -147,7 +147,8 @@ def hcr_by_contacts_cool(cool_file, output, lower_value=0.1, upper_value=1.75, def hcr_by_contacts(depth_file, output, lower_value=0.1, upper_value=1.75, - min_remove_whole_collapsed_contigs_rate=0.9): + min_remove_whole_collapsed_contigs_rate=0.9, + edge_length=None): depth = pl.read_csv(depth_file, separator='\t', has_header=False, @@ -223,8 +224,39 @@ def hcr_by_contacts(depth_file, output, lower_value=0.1, upper_value=1.75, collapsed_pr = pr.PyRanges(collapsed_df) hcr_regions_pr = hcr_regions_pr.intersect(collapsed_pr, invert=True) + if edge_length: + contigsizes = contigsizes.rename("length").to_frame() + large_contigs = contigsizes[contigsizes['length'] >= edge_length * 2] + small_contigs = contigsizes[contigsizes['length'] < edge_length * 2] + def func1(row): + row['start'] = 0 + row['end'] = int(2e6) - hcr_regions_pr.df.to_csv(output, sep='\t', index=None, header=None) + return row + + def func2(row2): + row2['start'] = row2['length'] - 2e6 + row2['end'] = row2['length'] + + return row2 + + res_df = pd.concat([large_contigs.reset_index().apply(func1, axis=1), + large_contigs.reset_index().apply(func2, axis=1), + ], axis=0).drop(['length'], axis=1).astype({"start": np.int64, "end": np.int64}) + small_contigs['start'] = 0 + small_contigs = small_contigs.reset_index().rename(columns={"length": "end"}) + + res_df = pd.concat([res_df, small_contigs], axis=0) + res_df.columns = ["Chromosome", "Start", "End"] + + edge_gr = pr.PyRanges(res_df) + report_regions_pr = edge_gr.join(hcr_regions_pr).new_position("intersection") + report_regions_pr.df[['Chromosome', 'Start_b', 'End_b']].to_csv( + output, sep='\t', index=None, header=None + ) + + else: + hcr_regions_pr.df.to_csv(output, sep='\t', index=None, header=None) logger.info(f"Successful output HCRs into `{output}`.") def main(args): diff --git a/cphasing/hypergraph.py b/cphasing/hypergraph.py index 6606ff7..01e078b 100644 --- a/cphasing/hypergraph.py +++ b/cphasing/hypergraph.py @@ -24,7 +24,10 @@ from subprocess import Popen, PIPE from .algorithms.hypergraph import HyperEdges -from .utilities import listify, list_flatten, is_compressed_table_empty +from .utilities import (listify, + list_flatten, + is_compressed_table_empty, + decompress_cmd) from ._config import * logger = logging.getLogger(__name__) @@ -60,7 +63,8 @@ class Extractor: """ def __init__(self, pairs_pathes, contig_idx, contigsizes, min_quality=1, hcr_bed=None, hcr_invert=False, - threads=4, low_memory=True): + threads=4, edge_length=2e6, low_memory=True, + log_dir="logs"): self.pairs_pathes = listify(pairs_pathes) self.contig_idx = contig_idx self.contigsizes = contigsizes @@ -68,7 +72,12 @@ def __init__(self, pairs_pathes, contig_idx, contigsizes, self.hcr_bed = hcr_bed self.hcr_invert = hcr_invert self.threads = threads + self.edge_length = edge_length self.low_memory = low_memory + + self.log_dir = Path(log_dir) + self.log_dir.mkdir(parents=True, exist_ok=True) + self.edges = self.generate_edges() @staticmethod @@ -90,6 +99,13 @@ def generate_edges(self): else: dtype={'mapq': pl.Int8} + if self.edge_length: + columns = ['chrom1', 'pos1', 'chrom2', 'pos2', 'mapq'] + usecols = [1, 2, 3, 4, 7] + else: + columns = ['chrom1', 'chrom2', 'mapq'] + usecols = [1, 3, 7] + if len(self.pairs_pathes) == 1: if is_compressed_table_empty(self.pairs_pathes[0]): @@ -101,23 +117,52 @@ def generate_edges(self): if not self.hcr_bed: input_file = self.pairs_pathes[0] else: - cmd = f"cphasing-rs pairs-intersect {self.pairs_pathes[0]} {self.hcr_bed} -q {self.min_mapq}" + pairs_prefix = Path(Path(self.pairs_pathes[0]).stem).with_suffix('') + while pairs_prefix.suffix in {'.pairs', '.gz'}: + pairs_prefix = pairs_prefix.with_suffix('') + + if str(self.pairs_pathes[0]).endswith(".gz"): + cmd0 = decompress_cmd(str(self.pairs_pathes[0]), str(self.threads)) + cmd = f"{' '.join(cmd0)} 2>{self.log_dir}/{pairs_prefix}.decompress.hcr.log | cphasing-rs pairs-intersect - {self.hcr_bed} -q {self.min_mapq} -o temp.{pairs_prefix}.hcr.pairs" + else: + cmd = f"cphasing-rs pairs-intersect {self.pairs_pathes[0]} {self.hcr_bed} -q {self.min_mapq} -o temp.{pairs_prefix}.hcr.pairs" + if self.hcr_invert: cmd += " --invert" - process = Popen(cmd, shell=True, stdout=PIPE, stderr=PIPE) - stdout, stderr = process.communicate() - input_file = stdout + cmd += f" 2>{self.log_dir}/{pairs_prefix}.pairs.intersect.log" + # process = Popen(cmd, shell=True, stdout=PIPE, stderr=PIPE) + + # stdout, stderr = process.communicate() + logger.info(f"Generating hcr pairs by {self.hcr_bed} ...") + flag = os.system(cmd) + assert flag == 0, "Failed to execute command, please check log." + input_file = f"temp.{pairs_prefix}.hcr.pairs" p = pd.read_csv(self.pairs_pathes[0], sep='\t', comment="#", header=None, index_col=None, nrows=1) if len(p.columns) >= 8 and isinstance(p[7].values[0], np.int64) and p[7].values[0] <= 60: p = pl.read_csv(input_file, separator='\t', has_header=False, - comment_prefix="#", columns=[1, 3, 7], - new_columns=['chrom1', 'chrom2', 'mapq'], + comment_prefix="#", columns=usecols, + new_columns=columns, dtypes=dtype) + if self.hcr_bed: + if Path(f"temp.{pairs_prefix}.hcr.pairs").exists(): + os.remove(f"temp.{pairs_prefix}.hcr.pairs") + + if self.min_mapq > 0: p = p.filter(pl.col('mapq') >= self.min_mapq) + + + if self.edge_length: + p = p.with_columns([ + pl.col('chrom1').map_elements(self.contigsizes.get).alias('length1'), + pl.col('chrom2').map_elements(self.contigsizes.get).alias('length2') + ]).filter( + (pl.col('pos1') < self.edge_length) | (pl.col('pos2') > pl.col('length2') - self.edge_length) + ).select(['chrom1', 'chrom2', 'mapq']) + p = p.to_pandas() res = Extractor._process_df(p, self.contig_idx, self.threads) @@ -136,7 +181,12 @@ def generate_edges(self): comment_prefix="#", columns=[1, 3], new_columns=['chrom1', 'chrom2'], dtypes=dtype) - + + if self.hcr_bed: + if Path(f"temp.{pairs_prefix}.hcr.pairs").exists(): + os.remove(f"temp.{pairs_prefix}.hcr.pairs") + + p = p.to_pandas() res = Extractor._process_df(p, self.contig_idx, self.threads) @@ -412,38 +462,21 @@ def save(self, output): logger.info(f"Successful output graph into `{output}`") -def process_pore_c_table(df, contig_idx, threads=1, +def process_pore_c_table(df, contig_idx, contigsizes, threads=1, min_order=2, max_order=50, - min_alignments=50, is_parquet=False): - - # pandarallel.initialize(nb_workers=threads, verbose=0) - # df['chrom_idx'] = df['chrom'].parallel_map(contig_idx.get) - # df.dropna(axis=0, inplace=True) - # df['chrom_idx'] = df['chrom_idx'].astype('int') - - - # # chrom_db = dict(zip(range(len(df.chrom.cat.categories)), - # # df.chrom.cat.categories)) - # # df = (df.assign(alignment_length=lambda x: x.end - x.start) - # # .query(f"alignment_length >= {min_alignments}") - # # .set_index('read_idx')) - # df = df.set_index('read_idx') - # df = df[['chrom_idx', 'mapping_quality']] - # # df_grouped = df.groupby('read_idx')['chrom_idx'] - # # df_grouped_nunique = df_grouped.nunique() - # # df = df.loc[(df_grouped_nunique >= min_order) - # # & (df_grouped_nunique <= max_order)] - # df_grouped_nunique = df.groupby('read_idx')['chrom_idx'].transform('nunique') - # df = df[(df_grouped_nunique >= min_order) & (df_grouped_nunique <= max_order)] - - # df = df[['chrom_idx', 'mapping_quality']].reset_index().drop_duplicates(['read_idx', 'chrom_idx']) - - # df['read_idx'] = df['read_idx'].astype('category').cat.codes + min_alignments=50, is_parquet=False, + edge_length=2e6 + ): df = df.with_columns(pl.col('chrom').map_elements(contig_idx.get).alias('chrom_idx')).drop_nulls() df = df.with_columns(pl.col('chrom_idx').cast(pl.UInt32)) + if edge_length: + df = df.with_columns(pl.col('chrom').map_elements(contigsizes.get).alias('length')).drop_nulls() + df = df.with_columns(pl.col('length').cast(pl.UInt32)) + df = df.filter(( pl.col('start') < edge_length) | (pl.col('end') > pl.col('length') - edge_length)) + + # df = df.with_columns(pl.col('read_idx').cast(pl.UInt32)) - df = df.with_columns(pl.col('read_idx').cast(pl.UInt32)) df = df.select(['read_idx', 'chrom_idx', 'mapping_quality']) df_grouped_nunique = df.group_by('read_idx').agg(pl.col('chrom_idx').n_unique().alias('chrom_idx_nunique')) df = df.join(df_grouped_nunique, on='read_idx') @@ -451,7 +484,7 @@ def process_pore_c_table(df, contig_idx, threads=1, df = df.unique(['read_idx', 'chrom_idx'], maintain_order=True) df = df.with_columns(pl.col('read_idx').cast(pl.Utf8).cast(pl.Categorical).to_physical()) - + df = df.to_pandas() return df @@ -486,10 +519,12 @@ def __init__(self, pore_c_table_pathes, max_order=50, min_alignments=30, min_quality=1, + edge_length=2e6, hcr_bed=None, hcr_invert=False, threads=4, - is_parquet=False): + is_parquet=False, + log_dir="logs"): self.pore_c_table_pathes = listify(pore_c_table_pathes) self.contig_idx = contig_idx @@ -498,10 +533,14 @@ def __init__(self, pore_c_table_pathes, self.max_order = max_order self.min_alignments = min_alignments self.min_quality = min_quality + self.edge_length = edge_length self.hcr_bed = hcr_bed self.hcr_invert = hcr_invert self.threads = threads self.is_parquet = is_parquet + + self.log_dir = Path(log_dir) + self.log_dir.mkdir(parents=True, exist_ok=True) self.pore_c_tables = self.import_pore_c_table() self.edges = self.generate_edges() @@ -512,20 +551,34 @@ def import_pore_c_table(self): import pore-c table from pore """ logger.info("Loading Pore-C table ...") + + if self.edge_length: + columns = ['read_idx', 'chrom', 'start', 'end', 'mapping_quality'] + usecols = [0, 5, 6, 7, 8] + else: + columns = ['read_idx', 'chrom', 'mapping_quality'] + usecols = [0, 5, 8] + if len(self.pore_c_table_pathes) == 1: - # if Path(self.pore_c_table_pathes[0]).is_symlink(): - # infile = os.readlink(self.pore_c_table_pathes[0]) - # else: if self.hcr_bed is None: infile = self.pore_c_table_pathes[0] else: - cmd = f"cphasing-rs porec-intersect {self.pore_c_table_pathes[0]} {self.hcr_bed}" + porec_prefix = str(Path(self.pore_c_table_pathes[0]).name).replace(".gz", "").rsplit(".", 1)[0] + if str(self.pore_c_table_pathes[0]).endswith(".gz"): + cmd0 = decompress_cmd(str(self.pore_c_table_pathes[0]), str(self.threads)) + cmd = f"{' '.join(cmd0)} 2>{porec_prefix}.decompress.hcr.log | cphasing-rs porec-intersect - {self.hcr_bed} -o temp.{porec_prefix}.hcr.porec" + else: + cmd = f"cphasing-rs porec-intersect {self.pore_c_table_pathes[0]} {self.hcr_bed} -o temp.{porec_prefix}.hcr.porec" if self.hcr_invert: cmd += " --invert" - process = Popen(cmd, shell=True, stdout=PIPE, stderr=PIPE) - stdout, stderr = process.communicate() - infile = stdout - + + cmd += f" 2>{self.log_dir}/{porec_prefix}.porec.intersect.log" + logger.info(f"Generating hcr porec table by {self.hcr_bed} ...") + flag = os.system(cmd) + assert flag == 0, "Failed to execute command, please check log." + + infile = f"temp.{porec_prefix}.hcr.porec" + if self.is_parquet: df = pd.read_parquet(infile, columns=['read_idx', 'chrom', @@ -534,50 +587,28 @@ def import_pore_c_table(self): #'filter_reason', ], engine=PQ_ENGINE) + else: logger.debug("Start to load one porec table.") try: df = pl.read_csv(infile, separator='\t', has_header=False, - columns=[0, 5, 8], - dtypes={'mapping_quality': pl.Int8, 'chrom': pl.Categorical}, - new_columns=['read_idx', 'chrom', 'mapping_quality']) + columns=usecols, + dtypes={'read_idx': pl.UInt32, 'mapping_quality': pl.Int8, 'chrom': pl.Categorical}, + # new_columns=['read_idx', 'chrom', 'mapping_quality']) + new_columns=columns) except pl.exceptions.NoDataError: logger.error(f"The pore-c table `{infile}` is empty, can not load anything, please check it.") sys.exit(-1) + except pl.exceptions.OutOfBoundsError: + logger.error(f"The pore-c table `{infile}` is incorrect, it's not a pore-c table, " + f"may be it is a pairs, if that you should change `-pct` to `-prs`.") + sys.exit(-1) - # try: - # logger.debug("Start to load one porec table.") - # if pandas_version == 2: - # df = pd.read_csv(infile, - # sep='\t', - # header=None, - # index_col=None, - # engine=CSV_ENGINE, - # dtype_backend=CSV_ENGINE, - # usecols=[0, 5, 8]) - # # usecols=['read_idx', 'chrom', - # # # 'start', 'end', - # # 'mapping_quality', - # # #'filter_reason' - # # ], - # else: - # df = pd.read_csv(infile, - # sep='\t', - # header=None, - # index_col=None, - # # engine=CSV_ENGINE, - # usecols=[0, 5, 8]) - # df.columns = ['read_idx', 'chrom', 'mapping_quality'] - # # names=self.HEADER) - - # except pd.errors.ParserError: - # logger.error("The input contacts are not the pore-c table format" - # "do you want to parse Pairs file? please add parameters of `--pairs`") - # sys.exit(-1) - # df = df.query("filter_reason == 'pass'") + if self.hcr_bed: + if Path(f"temp.{porec_prefix}.hcr.porec").exists(): + os.remove(f"temp.{porec_prefix}.hcr.porec") + if self.min_quality > 0: - - # df = df.query(f"mapping_quality >= {self.min_quality}") df = df.filter(pl.col('mapping_quality') >= self.min_quality) @@ -587,9 +618,6 @@ def import_pore_c_table(self): else: infiles = [] for i in self.pore_c_table_pathes: - # if Path(i).is_symlink(): - # infile = os.readlink(i) - # else: infiles.append(i) if not self.hcr_bed: @@ -597,13 +625,14 @@ def get_file(x): return x else: def get_file(x): - cmd = f"cphasing-rs porec-intersect {x} {self.hcr_bed}" + porec_prefix = str(Path(x).name).replace(".gz", "").rsplit(".", 1)[0] + cmd = f"cphasing-rs porec-intersect {x} {self.hcr_bed} -o temp.{porec_prefix}.hcr.porec" if self.hcr_invert: cmd += " --invert" process = Popen(cmd, shell=True, stdout=PIPE, stderr=PIPE) stdout, stderr = process.communicate() - return stdout + return f"temp.{porec_prefix}.hcr.porec" if self.is_parquet: df_list = list(map(lambda x: pd.read_parquet( @@ -654,25 +683,23 @@ def get_file(x): df_list = list(map(lambda x: pl.read_csv( get_file(x), separator='\t', has_header=False, - columns=[0, 5, 8], dtypes={'mapping_quality': pl.Int8, 'chrom': pl.Categorical}, - new_columns=['read_idx', 'chrom', 'mapping_quality'] + columns=usecols, dtypes={'mapping_quality': pl.Int8, 'chrom': pl.Categorical}, + new_columns=columns, ), infiles)) if self.min_quality > 0: - # df_list = Parallel(n_jobs=self.threads)(delayed( - # lambda x: x.query(#f"filter_reason == 'pass' &" - # f"'min_quality >= {self.min_quality}'") - # # .drop("filter_reason", axis=1) - # )(i) for i in df_list) df_list = Parallel(n_jobs=self.threads)(delayed( lambda x: x.filter(pl.col('mapping_quality') >= self.min_quality))(i) for i in df_list) - # else: - # df_list = Parallel(n_jobs=self.threads)(delayed( - # lambda x: x.query("filter_reason == 'pass'") - # .drop("filter_reason", axis=1) - # )(i) for i in df_list) - + + if self.edge_length: + df_list = Parallel(n_jobs=self.threads)(delayed( + lambda x: x.with_columns([ + pl.col('chrom').map_elements(self.contigsizes.get).alias('length'), + ]).filter( + (pl.col('start') < self.edge_length) | (pl.col('end') > pl.col('length') - self.edge_length) + ).select(['read_idx', 'chrom', 'mapping_quality']) + )(i) for i in df_list) return df_list @@ -708,17 +735,20 @@ def generate_edges(self): logger.debug("Processing Pore-C table ...") if len(self.pore_c_tables) > 1: for i, pore_c_table in enumerate(self.pore_c_tables): - args.append((pore_c_table, self.contig_idx, threads_2, + args.append((pore_c_table, self.contig_idx, + self.contigsizes, threads_2, self.min_order, self.max_order, - self.min_alignments, self.is_parquet)) + self.min_alignments, self.is_parquet, + self.edge_length)) res = Parallel(n_jobs=threads_1)( - delayed(process_pore_c_table)(i, j, k, l, m, n, o) - for i, j, k, l, m, n, o in args) + delayed(process_pore_c_table)(*a) for a in args) else: - res = [process_pore_c_table(self.pore_c_tables[0], self.contig_idx, threads_2, + res = [process_pore_c_table(self.pore_c_tables[0], self.contig_idx, + self.contigsizes, threads_2, self.min_order, self.max_order, - self.min_alignments, self.is_parquet)] + self.min_alignments, self.is_parquet, + self.edge_length)] idx = 0 mapping_quality_res = [] diff --git a/cphasing/hyperpartition.py b/cphasing/hyperpartition.py index e4b0d12..763e574 100644 --- a/cphasing/hyperpartition.py +++ b/cphasing/hyperpartition.py @@ -38,7 +38,7 @@ PruneTable ) from .kprune import KPruneHyperGraph -from .utilities import list_flatten, run_cmd +from .utilities import list_flatten, run_cmd, to_humanized2 from ._config import * logger = logging.getLogger(__name__) @@ -93,6 +93,7 @@ def __init__(self, edges, alleles_window_size=19, alleles_minimum_similarity=0.5, alleles_diff_thres=0.2, + alleles_trim_length=25000, alleletable=None, prunetable=None, normalize=False, @@ -139,6 +140,8 @@ def __init__(self, edges, self.alleles_window_size = alleles_window_size self.alleles_minimum_similarity = alleles_minimum_similarity self.alleles_diff_thres = alleles_diff_thres + self.alleles_trim_length = alleles_trim_length + self.prunetable = prunetable # self.alleletable = AlleleTable(alleletable, sort=False, fmt='allele2') if alleletable else None if alleletable: @@ -510,7 +513,7 @@ def single_partition(self, k=None, sort_by_length=True, sort_group=False): self.H, _, _ = extract_incidence_matrix2(self.H, retain_idx) self.vertices = self.vertices[retain_idx] - logger.info(f"Removed {raw_contig_counts - contig_counts} contigs that self edge weight < {self.min_cis_weight} (--min-cis-weight).") + logger.info(f"Removed {raw_contig_counts - contig_counts:,} contigs that self edge weight < {self.min_cis_weight} (--min-cis-weight).") idx_to_vertices = self.idx_to_vertices @@ -652,7 +655,7 @@ def _incremental_partition( if contig_counts == 0: return None, None, None if (raw_contig_counts - contig_counts) > 0: - logger.info(f"Removed {raw_contig_counts - contig_counts} contigs that self edge weight < {min_cis_weight} (--min-cis-weight).") + logger.info(f"Removed {raw_contig_counts - contig_counts:,} contigs that self edge weight < {min_cis_weight} (--min-cis-weight).") sub_H, _, sub_edge_idx = extract_incidence_matrix2(H, K) @@ -683,6 +686,7 @@ def _incremental_partition( sub_vertives_idx, kprune_norm_method=kprune_norm_method, threads=threads*2) + elif prune_pair_df is not None: sub_prune_pair_df = prune_pair_df.reindex(list(permutations(K, 2))).dropna().reset_index() @@ -740,10 +744,10 @@ def _incremental_partition( lambda x: sub_vertices_new_idx_sizes.reindex(x).sum().values[0] \ >= min_scaffold_length, new_K) ) - filtered_K = list(filter( - lambda x: sub_vertices_new_idx_sizes.reindex(x).sum().values[0] \ - < min_scaffold_length, new_K) - ) + # filtered_K = list(filter( + # lambda x: sub_vertices_new_idx_sizes.reindex(x).sum().values[0] \ + # < min_scaffold_length, new_K) + # ) new_K = list(map(list, new_K)) tmp_resolution += 0.2 @@ -892,8 +896,9 @@ def kprune(self, alleletable, first_cluster_file=None, contacts=None, is_run=Tru @staticmethod def alleles(fasta, first_cluster, k=19, w=19, m=0.5, d=0.2, c=60, - threads=4): + tl=25000, threads=4): from .cli import alleles + cwd = Path.cwd() try: alleles.main( args=[ @@ -907,7 +912,8 @@ def alleles(fasta, first_cluster, "-k", str(k), "-w", str(w), "-m", str(m), - "-d", str(d) + "-d", str(d), + "-tl", str(tl), ], prog_name='alleles' ) @@ -920,6 +926,7 @@ def alleles(fasta, first_cluster, if exit_code != 0: raise e + fasta_prefix = Path(Path(fasta).name).with_suffix("") while fasta_prefix.suffix in {".fasta", "gz", "fa", ".fa", ".gz"}: @@ -971,16 +978,18 @@ def incremental_partition(self, k, first_cluster=None): allelic_factor=self.allelic_factor, min_weight=self.min_weight) + dia = A.diagonal() + raw_contig_counts = A.shape[0] retain_idx = np.where(dia > self.min_cis_weight )[0] contig_counts = len(retain_idx) - + if len(retain_idx) < raw_contig_counts: A = A[retain_idx, :][:, retain_idx] self.H, _, _ = extract_incidence_matrix2(self.H, retain_idx) self.vertices = self.vertices[retain_idx] - logger.info(f"Removed {raw_contig_counts - contig_counts} contigs that self edge weight < {self.min_cis_weight} (--min-cis-weight).") + logger.info(f"Removed {raw_contig_counts - contig_counts:,} contigs that self edge weight < {self.min_cis_weight} (--min-cis-weight).") vertices_idx_sizes = self.vertices_idx_sizes vertices_idx_sizes = pd.DataFrame(vertices_idx_sizes, index=['length']).T @@ -994,7 +1003,7 @@ def incremental_partition(self, k, first_cluster=None): while result_K_length < k[0] and k[0] != 0: if tmp_resolution > 10.0 or auto_round > 50: break - logger.info(f"Automatic search best resolution ... {tmp_resolution:.1f}") + logger.info(f"Automatic search for best resolution ... {tmp_resolution:.1f}") _, A, _, self.K = IRMM(self.H, A, self.NW, None, None, self.allelic_factor, self.cross_allelic_factor, tmp_resolution, @@ -1127,18 +1136,22 @@ def incremental_partition(self, k, first_cluster=None): results = [] self.exclude_groups = [] idx_to_vertices = self.idx_to_vertices - + if self.fasta is not None and self.alleletable is None and self.prunetable is None: + self.alleletable = str(Path(self.alleles(self.fasta, first_cluster_file, k=self.alleles_kmer_size, w=self.alleles_window_size, d=self.alleles_diff_thres, m=self.alleles_minimum_similarity, + tl=self.alleles_trim_length, threads=self.threads)).absolute()) + + Path("kprune_workdir").mkdir(exist_ok=True) os.chdir("kprune_workdir") - + for num, sub_k in enumerate(self.K, 1): if isinstance(k[1], dict): sub_group_number = int(k[1][num - 1]) @@ -1167,8 +1180,9 @@ def incremental_partition(self, k, first_cluster=None): self.kprune_norm_method, sub_threads)) # results.append(HyperPartition._incremental_partition(args[-1]) - + with parallel_backend('loky'): + try: results = Parallel(n_jobs=min(self.threads, len(args)), return_as="generator")( delayed(HyperPartition._incremental_partition)(*a) for a in args) @@ -1176,7 +1190,7 @@ def incremental_partition(self, k, first_cluster=None): results = Parallel(n_jobs=min(self.threads, len(args)))( delayed(HyperPartition._incremental_partition)(*a) for a in args) results = list(filter(lambda x: x[2] is not None, results)) - + os.chdir("..") self.sub_A_list, self.cluster_assignments, results = zip(*results) @@ -1813,8 +1827,8 @@ def get_k_size(k, contigsizes, idx_to_vertices): return contigsizes.loc[k].sum().values[0] def filter_cluster(self, verbose=1): - if verbose == 1: - logger.info(f"Removed groups less than {self.min_scaffold_length} in length.") + if verbose == 1: + logger.info(f"Removed groups less than {to_humanized2(self.min_scaffold_length)} in length.") try: self.inc_chr_idx diff --git a/cphasing/mapper.py b/cphasing/mapper.py index 46439ce..180ee89 100644 --- a/cphasing/mapper.py +++ b/cphasing/mapper.py @@ -256,9 +256,13 @@ def __init__(self, reference, read1, read2, else: self._path = self._path.rstrip("_") - - if which('pigz') is None: + if which('crabz') is None and which('pigz') is None: raise ValueError(f"pigz: command not found") + + if which('crabz') is None: + self.compressor = 'pigz' + else: + self.compressor = 'crabz' self.prefix = Path(self.read1.stem).with_suffix('') while self.prefix.suffix in {'.fastq', 'gz', 'fq', '.fq', '.gz', '_R1', '_1', '_2', '.fasta', 'fasta', '.fa'}: @@ -323,7 +327,11 @@ def mapping(self): logger.info("Done.") def compress(self): - cmd = ['pigz', '-f', '-p', str(self.threads), f'{str(self.output_pairs)}'] + if self.compressor == 'crabz': + cmd = ['crabz', '-I', '-p', str(self.threads), f'{str(self.output_pairs)}'] + else: + cmd = ['pigz', '-f', '-p', str(self.threads), f'{str(self.output_pairs)}'] + flag = run_cmd(cmd, log=f'{str(self.log_dir)}/{self.prefix}.compress.log') assert flag == 0, "Failed to execute command, please check log." logger.info("Done.") diff --git a/cphasing/pipeline/pipeline.py b/cphasing/pipeline/pipeline.py index 1257531..0c0656c 100644 --- a/cphasing/pipeline/pipeline.py +++ b/cphasing/pipeline/pipeline.py @@ -73,15 +73,18 @@ def run(fasta, alleles_window_size=19, alleles_minimum_similarity=0.2, alleles_diff_thres=0.1, - scaffolding_method="haphic", + alleles_trim_length=25000, + scaffolding_method="precision", n="", use_pairs=False, + edge_length=2e6, resolution1=1, resolution2=1, init_resolution1=1, init_resolution2=1, first_cluster=None, normalize=False, + allelic_factor=-1, disable_merge_in_first=False, exclude_group_to_second=None, exclude_group_from_first=None, @@ -99,6 +102,8 @@ def run(fasta, whitelist=None, blacklist=None, binsize="500k", + colormap="viridis", + whitered=False, low_memory=False, outdir="cphasing_output", threads=4): @@ -220,10 +225,13 @@ def run(fasta, allele_table = None logger.info("The mode is `haploid`, skip step '1. alleles.'") elif mode == "phasing": - skip_steps.add("1") - allele_table = None - logger.info("The mode is `phasing`, the step '1. alleles' will be intergated into '3. hyparpartion'.") - + if "1" in steps: + skip_steps.add("1") + allele_table = None + logger.info("The mode is `phasing`, the step '1. alleles' will be intergated into '3. hyparpartion'.") + else: + allele_table = None + filtered_pairs = None if ul_data: @@ -422,10 +430,11 @@ def run(fasta, corrected_items = (f"{fasta_prefix}.chimeric.contigs.bed", f"{fasta_prefix}.corrected.fasta", f"{pairs_prefix}.corrected.pairs.gz") - if not all(map(lambda x: Path(x).exists, corrected_items)): + if not all(map(lambda x: Path(x).exists(), corrected_items)): corrected_items = () - if corrected_items and not Path(f"{porec_prefix}.corrected.porec.gz").exists: + + if corrected_items and not Path(f"{porec_prefix}.corrected.porec.gz").exists(): break_bed, fasta, pairs = corrected_items corrected = True cmd = ["cphasing-rs", "porec-break", f"../{porec_table}", @@ -474,8 +483,9 @@ def run(fasta, corrected_items = (f"{fasta_prefix}.chimeric.contigs.bed", f"{fasta_prefix}.corrected.fasta", f"{pairs_prefix}.corrected.pairs.gz") - if not all(map(lambda x: Path(x).exists, corrected_items)): + if not all(map(lambda x: Path(x).exists(), corrected_items)): corrected_items = () + logger.warning("The corrected files are not exists, please specify `--chimeric-correct`.") else: logger.info("Using exists corrected results.") @@ -510,6 +520,14 @@ def run(fasta, Path(fasta).symlink_to(f"{correct_dir}/{fasta}") except FileExistsError: pass + + contigsizes = f"{fasta_prefix}.contigsizes" + if not Path(contigsizes).exists() or is_empty(contigsizes): + cmd = ["cphasing-rs", "chromsizes", fasta, "-o", contigsizes] + flag = run_cmd(cmd, log=os.devnull) + assert flag == 0, "Failed to execute command, please check log." + + fasta_path = Path(fasta).absolute() else: pass # logger.info(f"Do not need correct, removed `{correct_dir}`") @@ -758,6 +776,8 @@ def run(fasta, hcr_upper, "-l", hcr_lower, + "-cr", + collapsed_contig_ratio, # "-q", # min_quality1, hcr_invert_string] @@ -778,6 +798,8 @@ def run(fasta, hcr_upper, "-l", hcr_lower, + "-cr", + collapsed_contig_ratio, # "-q", # min_quality1, ] @@ -797,6 +819,20 @@ def run(fasta, hcr_bed = f"{pairs_prefix}.{hcr_bs}.hcr.bed" + with open("hcr.cmd.sh", 'w') as _out_sh: + _out_sh.write("cphasing hcr \\") + _out_sh.write("\n") + + _hcr_args = [] + for i in range(0, len(args), 2): + if args[i+1] is not None: + _hcr_args.append(args[i]) + _hcr_args.append(args[i+1]) + + _out_sh.write(" ".join(pretty_cmd(map(str, _hcr_args)))) + _out_sh.write("\n") + + _mode = mode _mode = "phasing" if _mode == "phasing2" else _mode hyperpartition_args = [ @@ -806,6 +842,8 @@ def run(fasta, input_param, "--mode", _mode, + "-e", + edge_length, "-c", hyperpartition_contacts, "-r1", @@ -847,6 +885,9 @@ def run(fasta, "-n", n, ] + + if allelic_factor != -1: + hyperpartition_args.extend(["-af", allelic_factor]) if disable_merge_in_first: hyperpartition_args.append("--disable-merge-in-first") @@ -863,6 +904,11 @@ def run(fasta, if mode == "phasing": hyperpartition_args.extend(["-f", fasta_path]) + hyperpartition_args.extend(["--alleles-k", alleles_kmer_size, + "--alleles-w", alleles_window_size, + "--alleles-minimum-similarity", alleles_minimum_similarity, + "--alleles-diff-thres", alleles_diff_thres, + "--alleles-trim-length", alleles_trim_length]) else: hyperpartition_args.extend(["-at", @@ -1057,7 +1103,7 @@ def run(fasta, f"../{contigsizes}", f"../{scaffolding_dir}/{input_agp}", min_quality1, init_binsize, - binsize) + binsize, colormap, whitered) args = [ "-a", f"../{scaffolding_dir}/{input_agp}", @@ -1070,7 +1116,12 @@ def run(fasta, "-oc", ] - + if colormap != "redp1_r": + args.extend(["--colormap", colormap]) + if whitered: + args.append("--whitered") + + try: plot.main(args=args, prog_name='plot') @@ -1091,7 +1142,7 @@ def run(fasta, monitor.join() peak_memory = max(monitor.memory_buffer) / 1024 / 1024 / 1024 - logger.info(f"Pipeline finished in {today}. Elapsed time {end_time:.2f} s. " + logger.info(f"The pipeline finished in {today}. Elapsed time {end_time:.2f} s. " f"Peak memory: {peak_memory:.2f} Gb" ) logger.info(f"Results are store in `{outdir}`") diff --git a/cphasing/plot.py b/cphasing/plot.py index 94d1c2d..d2e5843 100644 --- a/cphasing/plot.py +++ b/cphasing/plot.py @@ -623,7 +623,7 @@ def func(row): return chrRangeID(row.values) return f'{outprefix}.chrom.cool' -def coarsen_matrix(cool, k, out, threads): +def coarsen_matrix(cool, cool_binsize, k, out, threads): """ coarsen a matrix @@ -649,7 +649,7 @@ def coarsen_matrix(cool, k, out, threads): """ from cooler.cli.coarsen import coarsen if not out: - input_resolution = cooler.Cooler(cool).binsize + input_resolution = cool_binsize out_resolution = f"{k * input_resolution}" out = cool.replace(str(input_resolution), to_humanized2(out_resolution)) @@ -1182,9 +1182,9 @@ def plot_heatmap_core(matrix, cbar.set_label("Contacts", fontsize=12) if add_lines and chrom_offset: - ax.hlines(chrom_offset[1:], *ax.get_xlim(), + ax.hlines(np.array(chrom_offset[1:-1]) - 0.5, *ax.get_xlim(), linewidth=0.5, color='black', linestyles="--") - ax.vlines(chrom_offset[1:], *ax.get_ylim(), + ax.vlines(np.array(chrom_offset[1:-1]) - 0.5, *ax.get_ylim(), linewidth=0.5, color='black', linestyles="--") if chrom_offset: diff --git a/cphasing/prepare.py b/cphasing/prepare.py index 92872f6..0dd5ad8 100644 --- a/cphasing/prepare.py +++ b/cphasing/prepare.py @@ -17,6 +17,7 @@ from pathlib import Path from .utilities import ( + decompress_cmd, digest, get_contig_length, read_fasta, @@ -96,12 +97,20 @@ def pipe(fasta, pairs, pattern="AAGCTT", min_mapq=0, min_contacts=3, ## pairs2clm if not skip_pairs2clm: - cmd = ["cphasing-rs", "pairs2clm", str(pairs), "-c", str(min_contacts), - "-t", str(threads), "-o", f"{outprefix}.clm.gz", "-q", str(min_mapq)] - # if low_memory: - # cmd.append("--low-memory") + if str(pairs).endswith('.gz'): + cmd0 = decompress_cmd(str(pairs), str(threads)) + logger.info(f"\nGenerating clm file from pairs file `{pairs}`") + cmd = ["cphasing-rs", "pairs2clm", "-", "-c", str(min_contacts), + "-t", str(threads), "-o", f"{outprefix}.clm.gz", "-q", str(min_mapq)] + + # print(" ".join(cmd0) + f" 2>{log_dir}/prepare.decompress.log" + " | " + " ".join(cmd) + f" 2>{log_dir}/prepare.pairs2clm.log") + flag = os.system(" ".join(cmd0) + f" 2>{log_dir}/prepare.decompress.log" + " | " + " ".join(cmd) + f" 2>{log_dir}/prepare.pairs2clm.log") + else: + cmd = ["cphasing-rs", "pairs2clm", str(pairs), "-c", str(min_contacts), + "-t", str(threads), "-o", f"{outprefix}.clm.gz", "-q", str(min_mapq)] - flag = run_cmd(cmd, log=f'{log_dir}/prepare.pairs2clm.log') + flag = run_cmd(cmd, log=f'{log_dir}/prepare.pairs2clm.log') + assert flag == 0, "Failed to execute command, please check log." # ## pairs2split_contacts @@ -117,3 +126,60 @@ def pipe(fasta, pairs, pattern="AAGCTT", min_mapq=0, min_contacts=3, split_contacts_to_contacts(f"{outprefix}.split.contacts", f"{outprefix}.contacts" ) logger.info(f"Output contacts `{outprefix}.contacts`") + + +def pairs2depth(pq): + """ + backend function for pairs2depth + """ + schema = { + "read_idx": pl.Utf8, + "chrom1": pl.Utf8, + "pos1": pl.UInt32, + "chrom2": pl.Utf8, + "pos2": pl.UInt32, + "strand1": pl.Categorical, + "strand2": pl.Categorical, + "mapq": pl.UInt8 + } + + os.environ["POLARS_MAX_THREADS"] = str(10) + df = pl.scan_parquet(f"{pq}", schema=schema) + + df = df.select([ + "chrom1", "pos1", "chrom2", "pos2", "mapq" + ]) + df = (df.filter(pl.col("mapq") >= 0) + .with_columns( + ((pl.col("pos1") - 1) // 10000).alias("pos1"), + ((pl.col("pos2") - 1) // 10000).alias("pos2"), + ) + ) + + df1 = df.select(["chrom1", "pos1", "mapq"]).group_by(["chrom1", "pos1"]).agg([ + pl.col("mapq").count().alias("count") + ]) + + df2 = df.select(["chrom2", "pos2", "mapq"]).group_by(["chrom2", "pos2"]).agg([ + pl.col("mapq").count().alias("count") + ]) + + + ## merge the two dataframes + df = df1.join(df2, left_on=["chrom1", "pos1"], right_on=["chrom2", "pos2"], how="full") + + df = df.drop_nulls(subset=["chrom1", "pos1", "chrom2", "pos2"]) + df = df.with_columns( + (pl.col("count") + pl.col("count_right")).alias("count") + ) + df = df.drop(["count_right"]) + df = df.with_columns( + ((pl.col("pos1") + 1) * 10000 - 10000).alias("start"), + ((pl.col("pos1") + 1) * 10000).alias("end") + ).drop(["pos1"]) + + df = df.sort(["chrom1", "start"]) + df = df.select(["chrom1", "start", "end", "count"]) + + df.collect().write_csv("output.csv", separator="\t", include_header=False) + \ No newline at end of file diff --git a/cphasing/utilities.py b/cphasing/utilities.py index 8fbff60..5c52c15 100644 --- a/cphasing/utilities.py +++ b/cphasing/utilities.py @@ -109,6 +109,48 @@ def cmd_exists(program): from shutil import which return which(program) is not None +def choose_compressor(): + if not cmd_exists('crabz') and not cmd_exists('pigz'): + raise ValueError(f"pigz or crabz: command not found") + + if cmd_exists('crabz'): + return 'crabz' + else: + if cmd_exists('pigz'): + return 'pigz' + else: + raise ValueError(f"pigz: command not found") + +def compress_cmd(infile, threads=10): + compressor = choose_compressor() + + if not Path(infile).exists(): + raise FileNotFoundError(f"File `{infile}` not found") + + if not Path(infile).is_file(): + raise ValueError(f"File `{infile}` is not a file") + + if compressor == 'crabz': + cmd = ["crabz", "-p", str(threads), infile] + else: + cmd = ["pigz", "-c", "-p", str(threads), infile] + + return cmd + +def decompress_cmd(infile, threads=10): + compressor = choose_compressor() + + if infile.endswith(".gz"): + if compressor == "crabz": + cmd = ["crabz", "-d", "-p", str(threads), infile] + else: + cmd = ["pigz", "-c", "-d", "-p", str(threads), infile] + else: + cmd = ["cat", infile] + + return cmd + + def run_cmd(command, log=sys.stderr, out2err=False): """ run command on shell @@ -180,6 +222,8 @@ def is_compressed_table_empty(_file): return True except pd.errors.EmptyDataError: return True + except UnicodeDecodeError: + raise ValueError(f"File `{_file}` is compressed file, you should add `.gz` suffix.") return False @@ -738,7 +782,12 @@ def humanized2numeric(size): >>> humanized2numeric("10k") 10000 """ - if not isinstance(size, str): + if isinstance(size, int): + return size + elif isinstance(size, float): + return int(size) + + elif not isinstance(size, str): logger.error("Please input correct string") raise ValueError("Value error") @@ -747,18 +796,20 @@ def humanized2numeric(size): try: if size[-1] == "k": - res_size = int(size[:-1]) * 1000 + res_size = float(size[:-1]) * 1000 elif size[-1] == "m": - res_size = int(size[:-1]) * 1000000 + res_size = float(size[:-1]) * 1000000 elif size[-1] == "g": - res_size = int(size[:-1]) * 1000000000 + res_size = float(size[:-1]) * 1000000000 else: - res_size = int(size) + res_size = float(size) except ValueError: logger.error("Please input correct string") raise ValueError("Value error") + res_size = int(res_size) + return res_size def to_humanized(size): @@ -791,7 +842,7 @@ def to_humanized2(size): 1 Mbp """ size = int(size) - if size <= 1e3: + if size < 1e3: label = "{:,.0f}".format((size)) + "" elif size < 1e6: label = "{:,.0f}".format((size / 1e3)) + "k" @@ -1064,9 +1115,17 @@ def generate_to_hic_cmd(agp, pairs, mapq=1, n=0, _3ddna_path="~/software/3d-dna" def generate_plot_cmd(pairs, pairs_prefix, contigsizes, agp, min_quality, init_binsize, - binsize, output="plot.cmd.sh"): + binsize, colormap, whitered, + output="plot.cmd.sh"): out_heatmap = f"groups.q${{min_quality}}.${{heatmap_binsize}}.wg.png" + plot_another_args = "" + if colormap != "redp1_r": + plot_another_args += f" -cm {colormap}" + if whitered: + plot_another_args += " --whitered" + + cmd = f"""#!/usr/bin/bash min_quality={min_quality} @@ -1075,7 +1134,7 @@ def generate_plot_cmd(pairs, pairs_prefix, contigsizes, agp, cphasing pairs2cool {pairs} {contigsizes} {pairs_prefix}.q${{min_quality}}.${{cool_binsize}}.cool -q ${{min_quality}} -bs ${{cool_binsize}} -cphasing plot -a {agp} -m {pairs_prefix}.q${{min_quality}}.${{cool_binsize}}.cool -o {out_heatmap} -bs ${{heatmap_binsize}} -oc +cphasing plot -a {agp} -m {pairs_prefix}.q${{min_quality}}.${{cool_binsize}}.cool -o {out_heatmap} -bs ${{heatmap_binsize}} -oc {plot_another_args} """ with open(output, 'w') as out: diff --git a/docs/CLI/alleles.md b/docs/CLI/alleles.md index dca561c..20fc6a7 100644 --- a/docs/CLI/alleles.md +++ b/docs/CLI/alleles.md @@ -1,5 +1,22 @@ Alleles aims to identify the allelic contig pairs by pairwise comparison. + +## Examples +### Total length of genome < 10 Gb +```shell +cphasing alleles -f draft.asm.fasta +``` +### Total length of genome >= 10 Gb +```shell +cphasing alleles -f draft.asm.fasta -k 27 -w 14 +``` + + + + +## Parameters + ```shell title="cphasing alleles -h" Usage: cphasing alleles [OPTIONS] @@ -38,16 +55,3 @@ Alleles aims to identify the allelic contig pairs by pairwise comparison. ``` - -## Examples -### Total length of genome < 10 Gb -```shell -cphasing alleles -f draft.asm.fasta -``` -### Total length of genome >= 10 Gb -```shell -cphasing alleles -f draft.asm.fasta -k 27 -w 14 -``` - - diff --git a/docs/CLI/mapper.md b/docs/CLI/mapper.md index d57d3a1..03df3c6 100644 --- a/docs/CLI/mapper.md +++ b/docs/CLI/mapper.md @@ -1,5 +1,48 @@ -`cphasing mapper` is designed for process Pore-C data, it output two file: (1) porec table (`.porec.gz`) which contain high-order contacts and (2) 4DN pairs (`pairs.gz`) which only retain VPCs. +`cphasing mapper` is designed for process Pore-C data, its output two files: + (1) porec table (`.porec.gz`) which contain high-order contacts and (2) 4DN pairs (`pairs.gz`) which only retain VPCs. + +## Examples + +### process one cell porec data +```shell +cphasing mapper draft.contigs.fasta sample.porec.fastq.gz -t 40 +``` + +### process multiple cell porec data +#### Process together +```shell +cphasing mapper draft.contigs.fasta sample1.porec.fastq.gz sample2.porec.fastq.gz -t 40 +``` +or +```shell +cphasing mapper draft.contigs.fasta sample*.porec.fastq.gz -t 40 +``` +!!! note + It will output results using the `sample1.porec` as the prefix + + +#### Submit each cell to the cluster + +- use multiple scripts to run mapper for each sample +```shell title="run_sample1.sh" +cphasing mapper draft.contigs.fasta sample1.porec.fastq.gz -t 40 +``` +```shell title="run_sample2.sh" +cphasing mapper draft.contigs.fasta sample2.porec.fastq.gz -t 40 +``` +```shell title="run_sample3.sh" +cphasing mapper draft.contigs.fasta sample3.porec.fastq.gz -t 40 +``` + +- merge results +```shell +cphasing-rs porec-merge sample1.porec.porec.gz sample2.porec.porec.gz sample3.porec.porec.gz -o sample.merge.porec.gz +cphasing-rs pairs-merge sample1.porec.pairs.gz sample2.porec.pairs.gz sample3.porec.pairs.gz -o sample.merge.pairs.gz +``` + + +## Parameters ```shell title="cphasing mapper -h" Usage: cphasing mapper [OPTIONS] REFERENCE FASTQ... @@ -51,42 +94,3 @@ │ --help -h,-help Show this message and exit. │ ╰─────────────────────────────────────────────────────────────────────────╯ ``` - -## Examples - -### process one cell porec data -```shell -cphasing mapper draft.contigs.fasta sample.porec.fastq.gz -t 40 -``` - -### process multiple cell porec data -#### Process together -```shell -cphasing mapper draft.contigs.fasta sample1.porec.fastq.gz sample2.porec.fastq.gz -t 40 -``` -or -```shell -cphasing mapper draft.contigs.fasta sample*.porec.fastq.gz -t 40 -``` -!!! note - It will output results using the `sample1.porec` as the prefix - - -#### Submit each cell to the cluster - -- use multiple scripts to run mapper for each sample -```shell title="run_sample1.sh" -cphasing mapper draft.contigs.fasta sample1.porec.fastq.gz -t 40 -``` -```shell title="run_sample2.sh" -cphasing mapper draft.contigs.fasta sample2.porec.fastq.gz -t 40 -``` -```shell title="run_sample3.sh" -cphasing mapper draft.contigs.fasta sample3.porec.fastq.gz -t 40 -``` - -- merge results -```shell -cphasing-rs porec-merge sample1.porec.porec.gz sample2.porec.porec.gz sample3.porec.porec.gz -o sample.merge.porec.gz -cphasing-rs pairs-merge sample1.porec.pairs.gz sample2.porec.pairs.gz sample3.porec.pairs.gz -o sample.merge.pairs.gz -``` diff --git a/docs/CLI/plot.md b/docs/CLI/plot.md index c2a2606..0adc3f9 100644 --- a/docs/CLI/plot.md +++ b/docs/CLI/plot.md @@ -1,135 +1,189 @@ -```shell title="cphasing plot -h" - - Usage: cphasing plot [OPTIONS] - - Adjust or Plot the contacts matrix after assembling. - - ▌ Usage: - ▌ • adjust the matrix by agp and plot a heatmap - - - cphasing plot -a groups.agp -m sample.10000.cool -o groups.500k.wg.png - - - ▌ • adjust the matrix by agp and plot a 100k resolution heatmap - - - cphasing plot -a groups.agp \ - -m sample.10000.cool \ - -o groups.100k.wg.png \ - -bs 100k - - - ▌ • only plot a heatmap - - - cphasing plot -m sample.100k.cool -o sample.100k.png - - - ▌ • Plot some chromosomes - - - cphasing plot -m sample.100k.cool -c Chr01,Chr02 -o Chr01_Chr02.100k.png - - -╭─ Options of Matrix Operation ────────────────────────────────────────────────╮ -│ * --matrix -m Contacts matrix stored by Cool format. │ -│ (COOL) │ -│ [required] │ -│ --binsize -bs Bin size of the heatmap you want to plot. Enabled │ -│ suffix with k or m. [defalt: input matrix binsize] │ -│ (STR) │ -│ --only-coarsen Only coarsen the input matrix, do not need plot the │ -│ heatmap. │ -│ --balance balance the matrix. │ -│ --balanced Plot balanced values, which need cool have weights │ -│ columns in bins. │ -╰──────────────────────────────────────────────────────────────────────────────╯ -╭─ Options of AGP Adjustment ──────────────────────────────────────────────────╮ -│ --agp -a (AGP) │ -│ --only-adjust Only adjust the matrix by agp, do not need plot the │ -│ heatmap. │ -╰──────────────────────────────────────────────────────────────────────────────╯ -╭─ Options of Heatmap ─────────────────────────────────────────────────────────╮ -│ --chromosomes -c Chromosomes and order in which │ -│ the chromosomes should be │ -│ plotted. Comma seperated. or a │ -│ one column file │ -│ (TEXT) │ -│ --disable-natural-sort -dns Disable natural sort of │ -│ chromosomes, only used for │ -│ --chromosomes or --only-chr │ -│ --per-chromosomes,--per-chromos… -pc Instead of plotting the whole │ -│ matrix, each chromosome is │ -│ plotted next to the other. │ -│ --only-chr -oc Only plot the chromosomes that │ -│ ignore unanchored contigs. When │ -│ --chromosomes specifed, this │ -│ parameter will be ignored. The │ -│ default use prefix of Chr to │ -│ find the chromosomes. │ -│ --chr-prefix can be used to │ -│ change this. │ -│ --chr-prefix -cp Prefix of the chromosomes, only │ -│ used for --only-chr │ -│ (STR) │ -│ [default: Chr] │ -│ --chrom-per-row -cpr Number of chromosome plot in │ -│ each row │ -│ (INTEGER) │ -│ [default: 4] │ -│ --vmin -vmin (FLOAT) │ -│ --vmax -vmax (FLOAT) │ -│ --scale -s Method of contact normalization │ -│ (STR) │ -│ [default: log1p] │ -│ --triangle Plot the heatmap in triangle │ -│ --fontsize Fontsize of the ticks, default │ -│ is auto │ -│ (INT) │ -│ --dpi -dpi Resolution for the image. │ -│ (INTEGER) │ -│ [default: 600] │ -│ --cmap,--colormap -cmap,-cm Colormap of heatmap. Available │ -│ values can be seen : │ -│ https://pratiman-91.github.io/… │ -│ and │ -│ http://matplotlib.org/examples… │ -│ and whitered . │ -│ (TEXT) │ -│ [default: redp1_r] │ -│ --no-lines -nl Don't add dash line in │ -│ chromosome boundaries. │ -│ --no-ticks -nt Don't add ticks both in x axis │ -│ and y axis. │ -│ --rotate-xticks -rx Rotate the x ticks │ -│ [default: True] │ -│ --rotate-yticks -ry Rotate the x ticks │ -╰──────────────────────────────────────────────────────────────────────────────╯ -╭─ Global Options ─────────────────────────────────────────────────────────────╮ -│ --output -o Output path of file. │ -│ (TEXT) │ -│ [default: plot.heatmap.png] │ -│ --threads -t Number of threads. (unused) │ -│ (INT) │ -│ [default: 4] │ -│ --help -help,-h Show this message and exit. │ -╰──────────────────────────────────────────────────────────────────────────────╯ - -``` +## Examples +### Plot chromosome-level heatmap from contig-level mapping result +- Convert pairs to cool file + ```shell + cphasing-rs contigsizes contigs.fa > contigs.contigsizes + cphasing pairs2cool sample.pairs.gz contigs.contigsizes sample.10k.cool + ``` +- Adjust contig-level matrix to chromosome-level by agp file + ```shell + cphasing plot -a groups.agp -m sample.10k.cool -o groups.500k.wg.png + ``` -## Examples -### Plot from `.agp` and contig-level `.cool` -```shell -cphasing plot -a groups.agp -m sample.10000.cool -o groups.500k.wg.png -``` !!! note This function will generate two intermedia `sample.10000.chrom.cool` and `sample.500k.chrom.cool`. -### Plot heatmap from `.cool` file + +
+ +
+ groups.500k.wg.png +
+ + +### Directly plot heatmap from `.cool` file ```shell -cphasing plot -m sample.500k.cool -o groups.500k.wg.png +cphasing plot -m sample.100k.chrom.cool -o groups.100k.wg.png +``` +!!!note + `--binsize` can be specified when you want to plot a larger resolution, e.g., 500k, `plot` will automatically generate a 1m matrix from the input matrix. Still, the binsize of the output matrix should be in integer multiples than the binsize of the input matrix. + + ```shell + cphasing plot -m sample.100k.chrom.cool -o groups.500k.wg.png -bs 500k + ``` + +
+ + +
+ groups.100k.wg.png +         + groups.500k.wg.png + +
+ + + + +### Gallery of different parameters + +![different_parameters](https://pic.superbed.cc/item/67921867fa9f77b4dce3785c.png) + +#### Plot specified chromosome +![chrom_plot](https://pic.superbed.cc/item/67921640fa9f77b4dce3654f.png) +#### Plot all chromosome split in one picture +![per_chrom](https://pic.superbed.cc/item/67921812fa9f77b4dce37540.png) + +## Parameters of plot + +```shell title="cphasing plot -h" + + + Usage: cphasing plot [OPTIONS] + + Adjust or Plot the contacts matrix after assembling. + + ▌ Usage: + ▌ • adjust the matrix by agp and plot a heatmap + + + cphasing plot -a groups.agp -m sample.10000.cool -o groups.500k.wg.png + + + ▌ • adjust the matrix by agp and plot a 100k resolution heatmap + + + cphasing plot -a groups.agp \ + -m sample.10000.cool \ + -o groups.100k.wg.png \ + -bs 100k + + + ▌ • only plot a heatmap + + + cphasing plot -m sample.100k.cool -o sample.100k.png + + + ▌ • Plot some chromosomes + + + cphasing plot -m sample.100k.cool -c Chr01,Chr02 -o Chr01_Chr02.100k.png + + +╭─ Options of Matrix Operation ────────────────────────────────────────────╮ +│ * --matrix -m Contacts matrix stored by Cool format. │ +│ (COOL) │ +│ [required] │ +│ --binsize -bs Bin size of the heatmap you want to plot. │ +│ Enabled suffix with k or m. [defalt: input │ +│ matrix binsize] │ +│ (STR) │ +│ --only-coarsen Only coarsen the input matrix, do not need plot │ +│ the heatmap. │ +│ --balance balance the matrix. │ +│ --balanced Plot balanced values, which need cool have │ +│ weights columns in bins. │ +╰──────────────────────────────────────────────────────────────────────────╯ +╭─ Options of AGP Adjustment ──────────────────────────────────────────────╮ +│ --agp -a (AGP) │ +│ --only-adjust Only adjust the matrix by agp, do not need plot the │ +│ heatmap. │ +╰──────────────────────────────────────────────────────────────────────────╯ +╭─ Options of Heatmap ─────────────────────────────────────────────────────╮ +│ --chromosomes -c Chromosomes and order in │ +│ which the chromosomes should │ +│ be plotted. Comma seperated. │ +│ or a one column file │ +│ (TEXT) │ +│ --regex -r Regular expression of │ +│ chromosomes, only used for │ +│ --chromosomes, e.g. Chr01g.* │ +│ --disable-natural-sort -dns Disable natural sort of │ +│ chromosomes, only used for │ +│ --chromosomes or --only-chr │ +│ --per-chromosomes,--per-chrom… -pc Instead of plotting the whole │ +│ matrix, each chromosome is │ +│ plotted next to the other. │ +│ --only-chr -oc Only plot the chromosomes │ +│ that ignore unanchored │ +│ contigs. When --chromosomes │ +│ specifed, this parameter will │ +│ be ignored. The default use │ +│ prefix of Chr to find the │ +│ chromosomes. --chr-prefix can │ +│ be used to change this. │ +│ --chr-prefix -cp Prefix of the chromosomes, │ +│ only used for --only-chr │ +│ (STR) │ +│ [default: Chr] │ +│ --chrom-per-row -cpr Number of chromosome plot in │ +│ each row │ +│ (INTEGER) │ +│ [default: 4] │ +│ --vmin -vmin (FLOAT) │ +│ --vmax -vmax (FLOAT) │ +│ --scale -s Method of contact │ +│ normalization │ +│ (STR) │ +│ [default: log1p] │ +│ --triangle Plot the heatmap in triangle │ +│ --fontsize Fontsize of the ticks, │ +│ default is auto │ +│ (INT) │ +│ --dpi -dpi Resolution for the image. │ +│ (INTEGER) │ +│ [default: 600] │ +│ --cmap,--colormap -cmap,-cm Colormap of heatmap. │ +│ Available values can be seen │ +│ : │ +│ https://pratiman-91.github.i… │ +│ and │ +│ http://matplotlib.org/exampl… │ +│ and whitered . │ +│ (TEXT) │ +│ [default: redp1_r] │ +│ --whitered -wr Preset of --scale none │ +│ --colormap whitered │ +│ --no-lines -nl Don't add dash line in │ +│ chromosome boundaries. │ +│ --no-ticks -nt Don't add ticks both in x │ +│ axis and y axis. │ +│ --rotate-xticks -rx Rotate the x ticks │ +│ [default: True] │ +│ --rotate-yticks -ry Rotate the x ticks │ +╰──────────────────────────────────────────────────────────────────────────╯ +╭─ Global Options ─────────────────────────────────────────────────────────╮ +│ --output -o Output path of file. │ +│ (TEXT) │ +│ [default: plot.heatmap.png] │ +│ --threads -t Number of threads. Used in matrix balance. │ +│ (INT) │ +│ [default: 8] │ +│ --help -help,-h Show this message and exit. │ +╰──────────────────────────────────────────────────────────────────────────╯ + ``` diff --git a/docs/CLI/plot.zh.md b/docs/CLI/plot.zh.md new file mode 100644 index 0000000..2dbb46f --- /dev/null +++ b/docs/CLI/plot.zh.md @@ -0,0 +1,191 @@ + + +## 示例 +### 从contig水平比对结果绘制热图 +- 将`pairs`文件转成`.cool`矩阵文件 + ```shell + cphasing-rs contigsizes contigs.fa > contigs.contigsizes + cphasing pairs2cool sample.pairs.gz contigs.contigsizes sample.10k.cool + ``` +- 根据AGP文件绘制热图 + 根据AGP文件将contig水平的`.cool`矩阵转成染色体水平,并绘制热图 + + ```shell + cphasing plot -a groups.agp -m sample.10k.cool -o groups.500k.wg.png + ``` + +!!! note + 这步会产生两个文件一个是`sample.10k.chrom.cool` and `sample.500k.chrom.cool`,后续在agp文件不变的情况下,可以使用这两个文件进行热图的调整. + + +
+ +
+ groups.500k.wg.png +
+ + +### 直接使用`.cool` 文件绘制热图 +```shell +cphasing plot -m sample.100k.chrom.cool -o groups.100k.wg.png +``` +!!!note + + 用户可以指定`--binsize`参数来调整热图的分辨率,如果输入矩阵的分辨率跟设定的分辨率不一致,程序会自动生成新的分辨率绘制热图,但需要设定的分辨率和输入矩阵的分辨率呈整数倍的关系。例如,输入100k.cool,想绘制500k分辨率的热图: + + ```shell + cphasing plot -m sample.100k.chrom.cool -o groups.500k.wg.png -bs 500k + ``` + +
+ + +
+ groups.100k.wg.png +         + groups.500k.wg.png + +
+ + + + +### 不同参数绘制的热图 + +![different_parameters](https://pic.superbed.cc/item/67921867fa9f77b4dce3785c.png) + +#### 指定染色体绘制 +![chrom_plot](https://pic.superbed.cc/item/67921640fa9f77b4dce3654f.png) +#### 将每条染色体分开绘制 +![per_chrom](https://pic.superbed.cc/item/67921812fa9f77b4dce37540.png) + +## Parameters of plot + +```shell title="cphasing plot -h" + + + Usage: cphasing plot [OPTIONS] + + Adjust or Plot the contacts matrix after assembling. + + ▌ Usage: + ▌ • adjust the matrix by agp and plot a heatmap + + + cphasing plot -a groups.agp -m sample.10000.cool -o groups.500k.wg.png + + + ▌ • adjust the matrix by agp and plot a 100k resolution heatmap + + + cphasing plot -a groups.agp \ + -m sample.10000.cool \ + -o groups.100k.wg.png \ + -bs 100k + + + ▌ • only plot a heatmap + + + cphasing plot -m sample.100k.cool -o sample.100k.png + + + ▌ • Plot some chromosomes + + + cphasing plot -m sample.100k.cool -c Chr01,Chr02 -o Chr01_Chr02.100k.png + + +╭─ Options of Matrix Operation ────────────────────────────────────────────╮ +│ * --matrix -m Contacts matrix stored by Cool format. │ +│ (COOL) │ +│ [required] │ +│ --binsize -bs Bin size of the heatmap you want to plot. │ +│ Enabled suffix with k or m. [defalt: input │ +│ matrix binsize] │ +│ (STR) │ +│ --only-coarsen Only coarsen the input matrix, do not need plot │ +│ the heatmap. │ +│ --balance balance the matrix. │ +│ --balanced Plot balanced values, which need cool have │ +│ weights columns in bins. │ +╰──────────────────────────────────────────────────────────────────────────╯ +╭─ Options of AGP Adjustment ──────────────────────────────────────────────╮ +│ --agp -a (AGP) │ +│ --only-adjust Only adjust the matrix by agp, do not need plot the │ +│ heatmap. │ +╰──────────────────────────────────────────────────────────────────────────╯ +╭─ Options of Heatmap ─────────────────────────────────────────────────────╮ +│ --chromosomes -c Chromosomes and order in │ +│ which the chromosomes should │ +│ be plotted. Comma seperated. │ +│ or a one column file │ +│ (TEXT) │ +│ --regex -r Regular expression of │ +│ chromosomes, only used for │ +│ --chromosomes, e.g. Chr01g.* │ +│ --disable-natural-sort -dns Disable natural sort of │ +│ chromosomes, only used for │ +│ --chromosomes or --only-chr │ +│ --per-chromosomes,--per-chrom… -pc Instead of plotting the whole │ +│ matrix, each chromosome is │ +│ plotted next to the other. │ +│ --only-chr -oc Only plot the chromosomes │ +│ that ignore unanchored │ +│ contigs. When --chromosomes │ +│ specifed, this parameter will │ +│ be ignored. The default use │ +│ prefix of Chr to find the │ +│ chromosomes. --chr-prefix can │ +│ be used to change this. │ +│ --chr-prefix -cp Prefix of the chromosomes, │ +│ only used for --only-chr │ +│ (STR) │ +│ [default: Chr] │ +│ --chrom-per-row -cpr Number of chromosome plot in │ +│ each row │ +│ (INTEGER) │ +│ [default: 4] │ +│ --vmin -vmin (FLOAT) │ +│ --vmax -vmax (FLOAT) │ +│ --scale -s Method of contact │ +│ normalization │ +│ (STR) │ +│ [default: log1p] │ +│ --triangle Plot the heatmap in triangle │ +│ --fontsize Fontsize of the ticks, │ +│ default is auto │ +│ (INT) │ +│ --dpi -dpi Resolution for the image. │ +│ (INTEGER) │ +│ [default: 600] │ +│ --cmap,--colormap -cmap,-cm Colormap of heatmap. │ +│ Available values can be seen │ +│ : │ +│ https://pratiman-91.github.i… │ +│ and │ +│ http://matplotlib.org/exampl… │ +│ and whitered . │ +│ (TEXT) │ +│ [default: redp1_r] │ +│ --whitered -wr Preset of --scale none │ +│ --colormap whitered │ +│ --no-lines -nl Don't add dash line in │ +│ chromosome boundaries. │ +│ --no-ticks -nt Don't add ticks both in x │ +│ axis and y axis. │ +│ --rotate-xticks -rx Rotate the x ticks │ +│ [default: True] │ +│ --rotate-yticks -ry Rotate the x ticks │ +╰──────────────────────────────────────────────────────────────────────────╯ +╭─ Global Options ─────────────────────────────────────────────────────────╮ +│ --output -o Output path of file. │ +│ (TEXT) │ +│ [default: plot.heatmap.png] │ +│ --threads -t Number of threads. Used in matrix balance. │ +│ (INT) │ +│ [default: 8] │ +│ --help -help,-h Show this message and exit. │ +╰──────────────────────────────────────────────────────────────────────────╯ + +``` diff --git a/docs/basic_usage.md b/docs/basic_usage.md index ebe0146..3edb1b6 100644 --- a/docs/basic_usage.md +++ b/docs/basic_usage.md @@ -86,4 +86,4 @@ Rename and orient chromosome according a monoploid reference (or genome of close cphasing rename -r mono.fasta -f draft.asm.fasta -a groups.review.agp -t 20 ``` !!! note - To reduce the time consumed, we only align the first haplotype (g1) to the monoploid, which the orientation among different haplotypes has already been set to the same in the `scaffolding` step. If not, you can set `—unphased` to align all haplotypes to the monoploid to adjust the orientation. \ No newline at end of file + To reduce the time consumed, we only align the first haplotype (g1) to the monoploid, which the orientation among different haplotypes has already been set to the same in the `scaffolding` step. If not, you can set `—unphased` to align all haplotypes to the monoploid to adjust the orientation. diff --git a/docs/basic_usage.zh.md b/docs/basic_usage.zh.md index 4591641..bc31fa0 100644 --- a/docs/basic_usage.zh.md +++ b/docs/basic_usage.zh.md @@ -62,7 +62,8 @@ cphasing utils agp2assembly groups.agp > groups.assembly bash ~/software/3d-dna/visualize/run-assembly-visualizer.sh sample.assembly sample.mnd.txt ``` !!! note - if chimeric corrected, please use `groups.corrected.agp` and generate a new `corrected.pairs.gz` by `cphasing-rs pairs-break` + 如果contig进行嵌合纠错,请使用`groups.corrected.agp`,并从原始的`pairs.gz`通过`cphasing-rs pairs-break` 生成一个新的 `corrected.pairs.gz`。 + - 调整完后生成agp和fasta文件 ```shell diff --git a/docs/index.md b/docs/index.md index c02d71b..79277b4 100644 --- a/docs/index.md +++ b/docs/index.md @@ -69,7 +69,7 @@ The advantages of `C-Phasing`: - [bedtools](https://bedtools.readthedocs.io/en/latest/) - [seqkit](https://bioinf.shenwei.me/seqkit/) - [pigz](https://github.com/madler/pigz) - 2. For Pore-C pipeline + 2. For Pore-C pipeline - [minimap2](https://github.com/lh3/minimap2)(>= v2.24) @@ -88,6 +88,6 @@ The advantages of `C-Phasing`: --- Tutorials to help you know how to run `cphasing`. - [:octicons-arrow-right-24:Tutorials](tutorials) + [:octicons-arrow-right-24:Tutorials](tutorials/porec/porec_decaploid.md) \ No newline at end of file diff --git a/docs/index.zh.md b/docs/index.zh.md index ab9ff61..fcef72c 100644 --- a/docs/index.zh.md +++ b/docs/index.zh.md @@ -82,7 +82,7 @@ - [bedtools](https://bedtools.readthedocs.io/en/latest/) - [seqkit](https://bioinf.shenwei.me/seqkit/) - [pigz](https://github.com/madler/pigz) - 2. Pore-C流程依赖: + 2. Pore-C流程依赖: - [minimap2](https://github.com/lh3/minimap2)(>= v2.24) @@ -101,6 +101,6 @@ --- 一些案例教你如何运行 `cphasing`. - [:octicons-arrow-right-24:Tutorials](tutorials) + [:octicons-arrow-right-24:Tutorials](tutorials/porec/porec_decaploid.md) \ No newline at end of file diff --git a/docs/tutorials/hic/hic_decaploid.md b/docs/tutorials/hic/hic_decaploid.md new file mode 100644 index 0000000..4a77ae3 --- /dev/null +++ b/docs/tutorials/hic/hic_decaploid.md @@ -0,0 +1,9 @@ +# Assemble decaploid by Hi-C + +Decaploid (2n=8x=80) + + +```shell +cphasing pipeline -f decaploid_hifi.bp.p_utg.fasta -hic1 hic_R1.fastq.gz -hic2 hic_R2.fastq.gz -t 40 -n 8:10 -hcr -p AAGCTT +``` + diff --git a/docs/tutorials/hic/hic_decaploid.zh.md b/docs/tutorials/hic/hic_decaploid.zh.md new file mode 100644 index 0000000..8a682d4 --- /dev/null +++ b/docs/tutorials/hic/hic_decaploid.zh.md @@ -0,0 +1,9 @@ +# 同源十倍体组装(Hi-C) + +同源十倍体 (2n=8x=80) + + +```shell +cphasing pipeline -f decaploid_hifi.bp.p_utg.fasta -hic1 hic_R1.fastq.gz -hic2 hic_R2.fastq.gz -t 40 -n 8:10 -hcr -p AAGCTT +``` + diff --git a/docs/tutorials/hic/hic_sugarcane.md b/docs/tutorials/hic/hic_sugarcane.md new file mode 100644 index 0000000..bf1da2f --- /dev/null +++ b/docs/tutorials/hic/hic_sugarcane.md @@ -0,0 +1,31 @@ +# Assemble hybrid sugarcane by Hi-C + + +Aneuploidy (2n=114), estimate genome size ~ 10 Gb + +## Hi-C data processing +Due to the huge porec data, we mapped each cell to fasta by submitting this command to cluster, respectively. And merged them into one file by `cphasing pairs-merge`. +!!!note + Please submit one first; wait until the index is successfully created and the remaining jobs are enabled to submit to the cluster. + +```shell +cphasing hic mapper -f sh_hifi.bp.p_utg.fasta -hic1 hic-1_R1.fastq.gz -hic1 hic-1_R1.fastq.gz -t 40 -k 27 -w 14 +``` + +```shell +cphasing hic mapper -f sh_hifi.bp.p_utg.fasta -hic1 hic-2_R1.fastq.gz -hic1 hic-2_R1.fastq.gz -t 40 -k 27 -w 14 +cphasing hic mapper -f sh_hifi.bp.p_utg.fasta -hic1 hic-3_R1.fastq.gz -hic1 hic-3_R1.fastq.gz -t 40 -k 27 -w 14 +cphasing hic mapper -f sh_hifi.bp.p_utg.fasta -hic1 hic-4_R1.fastq.gz -hic1 hic-4_R1.fastq.gz -t 40 -k 27 -w 14 +``` + +```shell +cphasing pairs-merge hic-*.pairs.gz -o hic.merge.pairs.gz +``` + + +## Assembling by `cphasing pipeline` +Modern hybrid sugarcane is an aneuploid, which contains an unequal number of chromosomes in each homologous group, so we set `-n 0:0` to automatically output cluster numbers. +```shell +cphasing pipeline -f sh_hifi.bp.p_utg.fasta -pct hic.mrege.pairs.gz -t 40 -n 0:0 -hcr -p AAGCTT +``` + diff --git a/docs/tutorials/hic/hic_sugarcane.zh.md b/docs/tutorials/hic/hic_sugarcane.zh.md new file mode 100644 index 0000000..60d8ae4 --- /dev/null +++ b/docs/tutorials/hic/hic_sugarcane.zh.md @@ -0,0 +1,32 @@ +# 现代栽培甘蔗组装(Hi-C) + + +非整倍 (2n=114), 预估基因组大小 ~ 10 Gb + +## Hi-C数据处理 +由于Hi-C数据量大,我们将每个cell的Hi-C数据分别提交到不同的节点上进行比对,然后再合并成一个文件。 + +!!!note + 因为`hic mapper`里面有一步索引构建,用户需要先提交一个任务,等到索引创建完成再提交剩余的任务。 + +```shell +cphasing hic mapper -f sh_hifi.bp.p_utg.fasta -hic1 hic-1_R1.fastq.gz -hic1 hic-1_R1.fastq.gz -t 40 -k 27 -w 14 +``` + +```shell +cphasing hic mapper -f sh_hifi.bp.p_utg.fasta -hic1 hic-2_R1.fastq.gz -hic1 hic-2_R1.fastq.gz -t 40 -k 27 -w 14 +cphasing hic mapper -f sh_hifi.bp.p_utg.fasta -hic1 hic-3_R1.fastq.gz -hic1 hic-3_R1.fastq.gz -t 40 -k 27 -w 14 +cphasing hic mapper -f sh_hifi.bp.p_utg.fasta -hic1 hic-4_R1.fastq.gz -hic1 hic-4_R1.fastq.gz -t 40 -k 27 -w 14 +``` + +```shell +cphasing pairs-merge hic-*.pairs.gz -o hic.merge.pairs.gz +``` + + +## 组装流程 +现代栽培甘蔗属于非整倍体,不同同源染色体组内的染色体数量不同,因此我们倾向于先让程序自行分组看看(`-n 0:0`)。 +```shell +cphasing pipeline -f sh_hifi.bp.p_utg.fasta -pct hic.mrege.pairs.gz -t 40 -n 0:0 -hcr -p AAGCTT +``` + diff --git a/docs/tutorials/porec/porec_decaploid.md b/docs/tutorials/porec/porec_decaploid.md new file mode 100644 index 0000000..b01fa28 --- /dev/null +++ b/docs/tutorials/porec/porec_decaploid.md @@ -0,0 +1,9 @@ +# Assemble autodecaploid by Pore-C + +Autodecaploid (2n=8x=80) + + +```shell +cphasing pipeline -f decaploid_hifi.bp.p_utg.fasta -pcd porec.fastq.gz -t 40 -n 8:10 -hcr -p AAGCTT +``` + diff --git a/docs/tutorials/porec/porec_decaploid.zh.md b/docs/tutorials/porec/porec_decaploid.zh.md new file mode 100644 index 0000000..b76bf65 --- /dev/null +++ b/docs/tutorials/porec/porec_decaploid.zh.md @@ -0,0 +1,8 @@ +# 同源十倍体组装(Pore-C) + +同源十倍体(2n=8x=80) + +```shell +cphasing pipeline -f decaploid_hifi.bp.p_utg.fasta -pcd porec.fastq.gz -t 40 -n 8:10 -hcr -p AAGCTT +``` + diff --git a/docs/tutorials/porec/porec_sugarcane.md b/docs/tutorials/porec/porec_sugarcane.md index fc6853d..00fe326 100644 --- a/docs/tutorials/porec/porec_sugarcane.md +++ b/docs/tutorials/porec/porec_sugarcane.md @@ -1 +1,26 @@ -# Assemble hybrid sugarcane by Pore-C \ No newline at end of file +# Assemble hybrid sugarcane by Pore-C + + +Aneuploidy (2n=114), estimate genome size ~ 10 Gb + +## Pore-C data processing +Due to the huge porec data, we mapped each cell to fasta by submitting this command to cluster, respectively. And merged them into one file by `cphasing porec-merge` +```shell +cphasing mapper sh_hifi.bp.p_utg.fasta porec-1.fastq.gz -t 40 +cphasing mapper sh_hifi.bp.p_utg.fasta porec-2.fastq.gz -t 40 +cphasing mapper sh_hifi.bp.p_utg.fasta porec-3.fastq.gz -t 40 +cphasing mapper sh_hifi.bp.p_utg.fasta porec-4.fastq.gz -t 40 +cphasing mapper sh_hifi.bp.p_utg.fasta porec-5.fastq.gz -t 40 +``` + +```shell +cphasing porec-merge porec-*.porec.gz -o porec.merge.porec.gz +``` + + +## Assembling by `cphasing pipeline` +Modern hybrid sugarcane is an aneuploid, which contains an unequal number of chromosomes in each homologous group, so we set `-n 0:0` to automatically output cluster numbers. +```shell +cphasing pipeline -f sh_hifi.bp.p_utg.fasta -pct porec.mrege.porec.gz -t 40 -n 0:0 -hcr -p AAGCTT +``` + diff --git a/docs/tutorials/porec/porec_sugarcane.zh.md b/docs/tutorials/porec/porec_sugarcane.zh.md new file mode 100644 index 0000000..7acd595 --- /dev/null +++ b/docs/tutorials/porec/porec_sugarcane.zh.md @@ -0,0 +1,26 @@ +# 现代栽培甘蔗组装(Pore-C) + +非整倍 (2n=114), 预估基因组大小 ~ 10 Gb + +## Pore-C数据处理 +由于Pore-C数据量大,我们将每个cell的Pore-C数据分别提交到不同的节点上进行比对,然后再合并成一个文件。 +`cphasing porec-merge` +```shell +cphasing mapper sh_hifi.bp.p_utg.fasta porec-1.fastq.gz -t 40 +cphasing mapper sh_hifi.bp.p_utg.fasta porec-2.fastq.gz -t 40 +cphasing mapper sh_hifi.bp.p_utg.fasta porec-3.fastq.gz -t 40 +cphasing mapper sh_hifi.bp.p_utg.fasta porec-4.fastq.gz -t 40 +cphasing mapper sh_hifi.bp.p_utg.fasta porec-5.fastq.gz -t 40 +``` + +```shell +cphasing porec-merge porec-*.porec.gz -o porec.merge.porec.gz +``` + + +## 组装流程 +现代栽培甘蔗属于非整倍体,不同同源染色体组内的染色体数量不同,因此我们倾向于先让程序自行分组看看(`-n 0:0`)。 +```shell +cphasing pipeline -f sh_hifi.bp.p_utg.fasta -pct porec.mrege.porec.gz -t 40 -n 0:0 -hcr -p AAGCTT +``` + diff --git a/environment.yml b/environment.yml index 38cf458..c65a6a9 100644 --- a/environment.yml +++ b/environment.yml @@ -41,4 +41,5 @@ dependencies: - wfmash 0.17.0.* - matplotlib >=3.9.3,<4 - line_profiler >=4.1.3,<5 +- crabz >=0.9.0,<0.10 diff --git a/mkdocs.yml b/mkdocs.yml index 2b2d6d9..a749c4e 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -116,6 +116,8 @@ plugins: Installation: 安装 Basic Usage: 基本用法 Tutorials: 教程 + Pore-C: Pore-C数据 + Hi-C: Hi-C数据 FAQ: 常见问题 @@ -123,12 +125,13 @@ plugins: nav: - Installation: index.md - Basic Usage: basic_usage.md - - Tutorials: + - Tutorials: - Pore-C: - - tutorials/porec/porec_tetraploid.md + - tutorials/porec/porec_decaploid.md - tutorials/porec/porec_sugarcane.md - Hi-C: - - tutorials/hic/hic_tetraploid.md + - tutorials/hic/hic_decaploid.md + - tutorials/hic/hic_sugarcane.md - CLI: - CLI/mapper.md diff --git a/pixi.lock b/pixi.lock index 02fcdaf..fac6ba6 100644 --- a/pixi.lock +++ b/pixi.lock @@ -53,6 +53,7 @@ environments: - conda: https://conda.anaconda.org/conda-forge/noarch/colormaps-0.4.2-pyhd8ed1ab_0.conda - conda: https://conda.anaconda.org/conda-forge/linux-64/contourpy-1.3.1-py312h68727a3_0.conda - conda: https://conda.anaconda.org/bioconda/noarch/cooler-0.10.2-pyhdfd78af_0.tar.bz2 + - conda: https://conda.anaconda.org/conda-forge/linux-64/crabz-0.9.0-he8a937b_0.conda - conda: https://conda.anaconda.org/conda-forge/noarch/cycler-0.12.1-pyhd8ed1ab_0.conda - conda: https://conda.anaconda.org/conda-forge/linux-64/cyrus-sasl-2.1.27-h54b06d7_7.conda - conda: https://conda.anaconda.org/conda-forge/linux-64/cytoolz-1.0.0-py312h66e93f0_1.conda @@ -1450,6 +1451,20 @@ packages: license_family: BSD size: 88598 timestamp: 1718678114563 +- kind: conda + name: crabz + version: 0.9.0 + build: he8a937b_0 + subdir: linux-64 + url: https://conda.anaconda.org/conda-forge/linux-64/crabz-0.9.0-he8a937b_0.conda + sha256: 78b3b009dc6896c3b10e57f3b5584d13d00e160ffbd497cb006417c62fc4870b + md5: 4e9f28a472472bcd55d32bdde3610313 + depends: + - libgcc-ng >=12 + license: MIT + license_family: MIT + size: 841451 + timestamp: 1710711158236 - kind: conda name: cssselect2 version: 0.2.1 diff --git a/pixi.toml b/pixi.toml index 3c7490b..2b0e014 100644 --- a/pixi.toml +++ b/pixi.toml @@ -51,6 +51,7 @@ pigz = ">=2.8,<3" wfmash = "0.17.0.*" matplotlib = ">=3.9.3,<4" line_profiler = ">=4.1.3,<5" +crabz = ">=0.9.0,<0.10" [feature.docs.system-requirements] diff --git a/scripts/evaluator/stat_misjoin.py b/scripts/evaluator/stat_misjoin.py index 9de284c..38500b4 100755 --- a/scripts/evaluator/stat_misjoin.py +++ b/scripts/evaluator/stat_misjoin.py @@ -75,7 +75,7 @@ def main(args): chrom_sizes = agp_df.groupby(agp_df.index)['end'].max() chrom_sizes.to_csv(f".{output}.sizes", sep='\t', header=None, index=True) - cmd = f"grep -v -w U {args.agp} | grep '^Chr' | bedtools flank -g .{output}.sizes -i /dev/stdin -l 2000 -r 0 | bedtools slop -i /dev/stdin -g .{output}.sizes -r 2000 -l 0 | bedtools intersect -a /dev/stdin -b .{output}.misassembly.txt -wa 2>/dev/null | bedtools merge -i /dev/stdin > {output}.misassembly.txt" + cmd = f"grep -v -w U {args.agp} | bedtools flank -g .{output}.sizes -i /dev/stdin -l 2000 -r 0 | bedtools slop -i /dev/stdin -g .{output}.sizes -r 2000 -l 0 | bedtools intersect -a /dev/stdin -b .{output}.misassembly.txt -wa 2>/dev/null | bedtools merge -i /dev/stdin > {output}.misassembly.txt" os.system(cmd) diff --git a/scripts/evaluator/stat_misjoin2.py b/scripts/evaluator/stat_misjoin2.py new file mode 100755 index 0000000..d47f161 --- /dev/null +++ b/scripts/evaluator/stat_misjoin2.py @@ -0,0 +1,221 @@ +#!/usr/bin/env python +# -*- coding:utf-8 -*- + +""" +stat misjoin points from paftools.js misjoin -e +""" + +import argparse +import logging +import os +import os.path as op +import sys +import io + +import pandas as pd +import numpy as np + +from pathlib import Path +from pyranges import PyRanges +from shutil import which +from subprocess import Popen, PIPE + +from cphasing.utilities import cmd_exists, list_flatten, calculate_Nx_from_contigsizes +from cphasing.agp import import_agp + +def main(args): + p = argparse.ArgumentParser(prog=__file__, + description=__doc__, + formatter_class=argparse.RawTextHelpFormatter, + conflict_handler='resolve') + pReq = p.add_argument_group('Required arguments') + pOpt = p.add_argument_group('Optional arguments') + pReq.add_argument('tsv', + help='tsv from paftools.js misjoin -e ') + pReq.add_argument("agp", help="agp file") + pOpt.add_argument('-m', '--min_count', default=2, type=int, + help='minimum support of read count [default: %(default)s]') + pOpt.add_argument('-h', '--help', action='help', + help='show help message and exit.') + + args = p.parse_args(args) + + + flank_length = 5000 + + cmd = f"grep -v '^C' {args.tsv}" + process = Popen(cmd, shell=True, stdout=PIPE, stderr=PIPE) + stdout, stderr = process.communicate() + + csv = io.StringIO(stdout.decode()) + + try: + df = pd.read_csv(csv, sep='\t', header=None, index_col=None, + comment="#", engine="python", usecols=list(range(12))) + except pd.errors.EmptyDataError: + print(f"No. misassemblies: {0}", file=sys.stderr) + return + + df = df[df[0] != "M"] + df = df.sort_values([1, 3]) + candicate_res = [] + for read, tmp_df in df.groupby(1, sort=False): + for i in range(0, len(tmp_df), 2): + tmp_df2 = tmp_df.iloc[i:i+2] + + + if (tmp_df2[5] == '-').all(): + tmp_df2 = tmp_df2.sort_values(3, ascending=False) + + + if tmp_df2.iloc[0][6] > tmp_df2.iloc[1][6]: + res = ( + tmp_df2.iloc[1][6], tmp_df2.iloc[1][8], tmp_df2.iloc[1][8] + flank_length, + tmp_df2.iloc[0][6], tmp_df2.iloc[0][9] - flank_length, tmp_df2.iloc[0][9]) + else: + res = (tmp_df2.iloc[0][6], tmp_df2.iloc[0][9] - flank_length, tmp_df2.iloc[0][9], + tmp_df2.iloc[1][6], tmp_df2.iloc[1][8], tmp_df2.iloc[1][8] + flank_length) + candicate_res.append(res) + # candicate_res.append(res2) + + res_df = pd.DataFrame(candicate_res, columns=['chrom1', 'start1', 'end1', 'chrom2', 'start2', 'end2']) + res_df.reset_index(inplace=True) + + res_df1 = res_df[['chrom1', 'start1', 'end1', 'index']] + res_df1.columns = ["Chromosome", "Start", "End", "Index"] + + res_df2 = res_df[['chrom2', 'start2', 'end2', 'index']] + res_df2.columns = ["Chromosome", "Start", "End", "Index"] + + + # res_df = PyRanges(res_df).merge(count=True).df + # res_df = res_df.sort_values(['Chromosome', 'Start', 'End']) + # res_df = res_df[res_df['Count'] >= args.min_count] + output = Path(args.tsv).stem + # res_df.to_csv(f'.{output}.misassembly.txt', sep='\t', header=None, index=None) + + agp_df, gap_df = import_agp(args.agp) + + _agp_df = agp_df.reset_index() + + contig_size = _agp_df[['id', 'tig_end']].set_index('id') + + tig_df = _agp_df[_agp_df['chrom'] == _agp_df['id']] + unanchored_contigs = set(tig_df['chrom'].values.tolist()) + + chrom_sizes = agp_df.groupby(agp_df.index)['end'].max() + chrom_sizes.to_csv(f".{output}.sizes", sep='\t', header=None, index=True) + + def flanking(row): + row2 = row.copy() + row['start'] = row['start'] - 1 + row['end'] = row['start'] + flank_length + + if row['orientation'] == "-": + row['id'] = f"{row['id']}_R" + else: + + row['id'] = f"{row['id']}_L" + + + row2['start'] = row2['end'] - flank_length + if row2['orientation'] == "-": + row2['id'] = f"{row2['id']}_L" + else: + row2['id'] = f"{row2['id']}_R" + + row = row.rename(index = {"start": "start1", "end": "end1", "id": "id1"}) + row2 = row2.rename(index = {"start": "start2", "end": "end2", "id": "id2"}) + + return pd.concat([row[['start1', 'end1', 'id1']], row2[['start2', 'end2', 'id2']]], axis=0) + + + + target_df = agp_df.apply(flanking, axis=1) + + target_df = pd.concat([target_df[['start1', 'end1', 'id1']].rename(columns={"start1": "start", + "end1": "end", + "id1": "id"}), + target_df[['start2', 'end2', 'id2']].rename(columns={"start2": "start", + "end2": "end", + "id2": "id"})], axis=0) + target_df.reset_index(inplace=True) + + target_df.columns = ["Chromosome", "Start", "End", "ID"] + target_gr = PyRanges(target_df) + + gr1 = PyRanges(res_df1) + gr2 = PyRanges(res_df2) + gr1 = gr1.join(target_gr, report_overlap=True).df[['Chromosome', 'Start_b', "End_b", "Index", "ID", "Overlap"]] + gr2 = gr2.join(target_gr, report_overlap=True).df[['Chromosome', 'Start_b', "End_b", "Index", "ID", "Overlap"]] + + overlap = flank_length // 2 + + + gr1.columns = ['chrom1', 'start1', 'end1', 'index', 'id1', 'overlap'] + gr2.columns = ['chrom2', 'start2', 'end2', 'index', 'id2', 'overlap'] + gr1 = gr1[gr1['overlap'] > overlap] + gr2 = gr2[gr2['overlap'] > overlap] + + + gr1.set_index('index', inplace=True) + gr2.set_index('index', inplace=True) + + res_df = pd.concat([gr1, gr2], axis=1).dropna().drop('overlap', axis=1) + # res_df = res_df.drop_duplicates() + + res_df = res_df[res_df.duplicated()].drop_duplicates() + + unanchor_df = res_df[res_df['chrom1'].isin(unanchored_contigs) | res_df['chrom2'].isin(unanchored_contigs)] + un_unanchor_df = unanchor_df[unanchor_df['chrom1'].isin(unanchored_contigs) & unanchor_df['chrom2'].isin(unanchored_contigs)] + unanchor_df = unanchor_df[~(unanchor_df['chrom1'].isin(unanchored_contigs) & unanchor_df['chrom2'].isin(unanchored_contigs))] + + reported_unanchor_contigs = list_flatten([unanchor_df['chrom1'].values.tolist(), unanchor_df['chrom2'].values.tolist()]) + reported_unanchor_contigs = list(filter(lambda x: x in unanchored_contigs, reported_unanchor_contigs)) + + reported_un_unanchor_contigs = list_flatten([un_unanchor_df['chrom1'].values.tolist(), un_unanchor_df['chrom2'].values.tolist()]) + reported_un_unanchor_contigs = list(filter(lambda x: x in unanchored_contigs, reported_un_unanchor_contigs)) + + + res_df = res_df[~(res_df['chrom1'].isin(unanchored_contigs) | res_df['chrom2'].isin(unanchored_contigs))] + + res_df = res_df.astype({"start1": np.int64, "end1": np.int64, "start2": np.int64, "end2": np.int64}) + res_df.to_csv(f"{output}.misassembly.txt", sep='\t', header=None, index=None) + + res_gap_df = pd.concat([res_df[['chrom1', 'start1', 'end1', 'id1']].rename( + columns={ + "chrom1": "chrom", + "start1": "start", + "end1": "end", + "id1": "id" + } + ), + res_df[['chrom2', 'start2', 'end2', 'id2']].rename( + columns={ + "chrom2": "chrom", + "start2": "start", + "end2": "end", + "id2": "id" + } + )], axis=0).drop_duplicates() + + + contig_size = contig_size.rename(columns={"tig_end": "length"}) + misassembly_contig_n50 = calculate_Nx_from_contigsizes(contig_size.reindex(res_gap_df['id'].str.split("_", n=1).map(lambda x: x[0])), 50) + + print(f"No. of misassembly pairs: {int(len(res_df))}", file=sys.stderr) + print(f"No. of misassembly contig N50: {misassembly_contig_n50}", file=sys.stderr) + print(f"No. of misassembly gaps: {int(len(res_gap_df))}", file=sys.stderr) + print(f"No. of gaps: {int(len(gap_df))}", file=sys.stderr) + print(f"Proportion of false gaps: {int(len(res_gap_df))/int(len(gap_df)):.4f}", file=sys.stderr) + print(f"No. of supported single unanchor contigs: {int(len(reported_unanchor_contigs))}", file=sys.stderr) + print(f"Size. of supported single unanchor contigs: {contig_size.reindex(reported_unanchor_contigs).sum().values[0]}", file=sys.stderr) + print(f"No. of supported unanchor paired contigs: {int(len(reported_un_unanchor_contigs))}", file=sys.stderr) + print(f"Size. of supported unanchor paired contigs: {contig_size.reindex(reported_un_unanchor_contigs).sum().values[0]} ", file=sys.stderr) + print(f"No. of total supported unanchor contigs: {int(len(reported_un_unanchor_contigs)) + int(len(reported_unanchor_contigs))}", file=sys.stderr) + print(f"Size. of total supported unanchor contigs: {contig_size.reindex(reported_unanchor_contigs).sum().values[0] + contig_size.reindex(reported_un_unanchor_contigs).sum().values[0]}", file=sys.stderr) + print(f"No. of total unancored contigs: {int(len(unanchored_contigs))}", file=sys.stderr) + print(f"Size. of total unancored contigs: {contig_size.reindex(unanchored_contigs).sum().values[0]}", file=sys.stderr) + +if __name__ == "__main__": + main(sys.argv[1:]) \ No newline at end of file