C-Phasing

Phasing and scaffolding polyploid genomes based on Pore-C, Ultra-long, or Hi-C data

.

Introduction

One of the major problems with Hi-C scaffolding of polyploid genomes is a large proportion of ambiguous short-read mapping, leading to a high-level of switched or chimeric assemblies. Now, the long-read-based chromosome conformation capture technology, e.g., Pore-C, provides an effective way to overcome this problem. Here, we developed a new pipeline, namely C-Phasing, which is specifically tailored for polyploid phasing by leveraging the advantage of Pore-C data. It also works on Hi-C data and diploid genome assembly.

The advantages of C-Phasing:

High speed.
High anchor rate of genome.
High accuracy of polyploid phasing.

Installation

Via anaconda (recommend)

## Download C-Phasing and install all dependencies
git clone https://github.com/wangyibin/CPhasing.git
cd CPhasing
conda env create -f environment.yml
conda activate cphasing

## Add these command into .bash_profile or .bashrc
export PATH=/path/to/CPhasing/bin:$PATH
export PYTHONPATH=/path/to/CPhasing:$PYTHONPATH
## The hic pipeline require GLIBCXX_3.4.29, or you can add this to your environment (.bash_profile)
export LD_LIBRARY_PATH=/path/to/anaconda3/envs/cphasing/lib:$LD_LIBRARY_PATH

Custom install the dependencies

Install C-Phasing

## Download C-Phasing and install python dependencies
git clone https://github.com/wangyibin/CPhasing.git
cd CPhasing
pip install .

## Add these into .bash_profile or .bashrc
export PATH=/path/to/CPhasing/bin:$PATH

Dependencies

For core function
- bedtools
- seqkit
- pigz
For Pore-C pipeline
- minimap2(>= v2.24)

One command pipeline of C-Phasing

The -n 8:4 parameter of the following commands means assembling a tetraploid (4) with 8 chromosome basic numbers. If you set -n 0:0 means partition in both rounds automatically, also support it set to -n 8:0 or -n 0:4.

Start from a pore-c data

cphasing pipeline -f draft.asm.fasta -pcd sample.fastq.gz -t 10 -n 8:4

Start from multiple pore-c data: specify multiple -pcd parameters.

cphasing pipeline -f draft.asm.fasta -pcd sample1.fastq.gz -pcd sample2.fastq.gz -t 10 -n 8:4

Note: If you want to run on cluster system and submit them to multiple nodes, you can use cphasing mapper and cphasing-rs porec-merge to generate the merged porec.gz file and input it by -pct parameter.

Start from a pore-c table (porec.gz), which is generated by cphasing mapper.

cphasing pipeline -f draft.asm.fasta -pct sample.porec.gz -t 10 -n 8:4

Start from a paired-end Hi-C data

cphasing pipeline -f draft.asm.fasta -hic1 Lib_R1.fastq.gz -hic2 Lib_R2.fastq.gz -t 10 -n 8:4

Note1: If you want to run multiple samples, you can use cphasing hic mapper and cphasing-rs pairs-merge to generate the merged pairs.gz file, and input it by -prs parameter.
Note2: If the total length of your input genome is larger than 8 Gb, the -hic-mapper-k 27 -hic-mapper-w 14 should be specified.

Start from a 4DN pairs file,

cphasing pipeline -f draft.asm.fasta -prs sample.pairs.gz -t 10 -n 8:4

Skip some steps

## skip steps 1.alleles and 2.prepare steps 
cphasing pipeline -f draft.asm.fasta -pct sample.porec.gz -t 10 -ss 1,2

Perform only specified steps

## run 3.hyperpartition 
cphasing pipeline -f draft.asm.fasta -pct sample.porec.gz -t 10 -s 3

Add the -hcr parameter to remove the greedy contacts (several regions contact with the whole genome) to improve the phasing quality.

cphasing pipeline -f draft.asm.fasta -pct sample.porec.gz -t 10 -hcr

Pipeline of Ultra-long data [Optional]

C-Phasing enable to use ultra-long to correct chimeric and identify the high confidence regions (HCRs) to help assembly.

hitig tutorial

Step by step pipeline of Pore-C data

mapping

Mapping pore-c data into contig-level assembly and output the pore-c table and 4DN pairs.
```
## results are `sample.porec.gz` and `sample.pairs.gz`  
cphasing mapper draft.asm.fasta sample.fastq.gz -t 10
```
For Hi-C data please use cphasing hic mapper.

Note: If you are mapping multiple pore-c data, the multiple pairs.gz files should be merged by following steps:
```
cphasing-rs pairs-merge sample1.pairs.gz sample2.pairs.gz -o sample.merge.pairs.gz
```

0_3. hcr (Optional)

Only retain high confidence regions (HCRs) to subsequencing analysis, which will remove greedy contacts (contact to whole genome or multiple chromosomes.)

## results are `sample.hcr.bed`, `sample.hcr.porec.gz` and `sample.hcr.pairs.gz`
cphasing -pct sample.porec.gz -cs drfat.asm.contigsizes

alleles (Optional for phasing mode)

Identify the inter-allelic contig pairs.
```
## result is `draft.asm.allele.table`
cphasing alleles -f draft.asm.fasta
```
Note: the memory usage of this step is depend on the genome size of input fasta, when input the length larger 10 Gb, the peak memory will reach 130 Gb, or more. If you want to reduce the memory usage, you can increase the k and w, for example, when we benchmark a 20 Gb hexaploid, we set the k and w to 27 and 24, respectively.

prepare

Prepare some data for subsequence analysis

## results are `sample.counts_AAGCTT.txt`, `sample.clm.gz`, `sample.split.contacts`, `sample.contacts`
cphasing prepare draft.asm.fasta sample.pairs.gz

hyperpartition

## result is `output.clusters.txt`
## for haploid scaffolding 
cphasing hyperpartition sample.porec.gz draft.asm.contigsizes output.clusters.txt
## for polyploid or diploid phasing must add prune information and use the incremental partition mode
### chromsome number aware, 8:4 indicate that this is a tetraploid with 8 chromosome in each haplotype
cphasing hyperpartition sample.porec.gz draft.asm.contigsizes output.clusters.txt -pt prune.contig.table -inc -n 8:4 -t 4
### auto generate groups
cphasing hyperpartition sample.porec.gz draft.asm.contigsizes output.clusters.txt -pt prune.contig.table -inc -t 4

Note: If you are not satisfied with the number of groups, you can adjust the resolution parameters (-r1 or -r2) to make difference groups. -r1 controls the first cluster while -r2 controls the second cluster in phasing mode. Increasing the resolution can generate more groups while decreasing it will get fewer groups.
If you want to group some contigs, you can input a contig list with the -wl parameter.

scaffolding

## result is `groups.agp`
cphasing scaffolding output.clusters.txt sample.counts_AAGCTT.txt sample.clm.gz -sc sample.split.contacts -f draft.asm.fasta -t 4

## for polyploid can specified allele table to adjust the orientation of different haplotypes
cphasing scaffolding output.clusters.txt sample.counts_AAGCTT.txt sample.clm -at draft.asm.allele.table -sc sample.split.contacts -f draft.asm.fasta

plot

Adjust the contig-level contacts matrix into chromosome-level and plot a heatmap.

## result is `sample.10000.cool` and `groups.wg.png`
cphasing pairs2cool sample.pairs.gz draft.asm.contigsizes sample.10000.cool
cphasing plot -a groups.agp -m sample.10000.cool -o groups.wg.png

plot also enable only plot the heatmap from a matrix file (.cool)

## directly plot the heatmap in the input resolution, 
cphasing plot -m sample.500k.chrom.cool -o sample.500k.chrom.png

## or first coarsen small binsize to larger binsize, e.g. 10k to 500k, specify the `--coarsen` and `-k 50` parameters.

cphasing plot -m sample.10k.chrom.cool -o sample.500k.chrom.png --coarsen -k 50

Note: If the number of bins in matrix is too large (large genome with small binsize), the memory may overflow. To fix it, users can improve the binsize or specified several chromosomes with -c parameter.

cphasing plot -m sample.100k.chrom -c Chr01g1,Chr01g2,Chr01g3,Chr01g4 -o Chr01.100k.png

Curation by Juicebox

generate .assembly and .hic, depend on 3d-dna

cphasing pairs2mnd sample.pairs.gz -o sample.mnd.txt
cphasing utils agp2assembly groups.agp > groups.assembly
bash ~/software/3d-dna/visualize/run-assembly-visualizer.sh sample.assembly sample.mnd.txt

Note: if chimeric corrected, please use groups.corrected.agp and generate a new corrected.pairs.gz by cphasing-rs pairs-break

After curation

## convert assembly to agp
cphasing utils assembly2agp groups.review.assembly -n 8:4 
## or haploid or a homologous group
cphasing utils assembly2agp groups.review.assembly -n 8
## extract contigs from agp 
cphasing agp2fasta groups.review.agp draft.asm.fasta --contigs > contigs.fasta
## extract chromosome-level fasta from agp
cphasing agp2fasta groups.review.agp draft.asm.fasta > groups.review.asm.fasta

Rename

Rename and orient chromosome according a monoploid reference (or genome of closely related species).

cphasing rename -r mono.fasta -f draft.asm.fasta -a groups.review.agp -t 20

Note: To reduce the time consumed, we only align the first haplotype (g1) to the monoploid, which the orientation among different haplotypes has already been set to the same in the scaffolding step. If not, you can set —unphased to align all haplotypes to the monoploid to adjust the orientation.

FAQ

The results of the first round partition are unsatisfactory.

In our two-round partition algorithm, the first round partition depends on the h-trans errors between homologous chromosomes; if you input a contig assembly with low level switch errors or input a high accuracy pore-c data, the h-trans will be not enough to cluster all contigs to correct homologous groups, resulting in unsatisfactory results. You can set the -q1 0 for hyperpartition to increase the rate of h-trans errors. However, this parameter may raise error of out of memory when you input huge pore-c data in porec table or hic contacts in pairs file.

How to set the `-n` parameter when assembling an aneuploid genome.

The aneuploid genome, such as modern cultivated sugarcane, contains unequal homologous chromosomes. The -n parameter can be set to zero (-n 10:0) to automatically partition contigs into different chromosomes within a homologous group.
However, we also allow the user to input a file with two columns: the first column is the index(1-base) of the first round partition, and the second column is the chromosome number of each homologous. And then specified the -n 10:second.number.tsv in cphasing pipeline or cphasing hyperpartition.

second.number.tsv

Name		Name	Last commit message	Last commit date
Latest commit History 335 Commits
bin		bin
cphasing		cphasing
docs		docs
notebooks		notebooks
pictures/logo		pictures/logo
scripts		scripts
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
.pyproject.toml		.pyproject.toml
CHANGES.md		CHANGES.md
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

C-Phasing

Introduction

Installation

Via anaconda (recommend)

Custom install the dependencies

Install C-Phasing

Dependencies

One command pipeline of C-Phasing

Pipeline of Ultra-long data [Optional]

hitig tutorial

Step by step pipeline of Pore-C data

Curation by Juicebox

Rename

FAQ

The results of the first round partition are unsatisfactory.

How to set the `-n` parameter when assembling an aneuploid genome.

About

Releases

Contributors 3

Languages

wangyibin/CPhasing

Folders and files

Latest commit

History

Repository files navigation

C-Phasing

Introduction

Installation

Via anaconda (recommend)

Custom install the dependencies

Install C-Phasing

Dependencies

One command pipeline of C-Phasing

Pipeline of Ultra-long data [Optional]

hitig tutorial

Step by step pipeline of Pore-C data

Curation by Juicebox

Rename

FAQ

The results of the first round partition are unsatisfactory.

How to set the -n parameter when assembling an aneuploid genome.

About

Topics

Resources

Stars

Watchers

Forks

Releases

Contributors 3

Languages

How to set the `-n` parameter when assembling an aneuploid genome.