split large genome fasta and gtf/gff into shorter scaffolds

中文说明

split large genome fasta and gtf/gff into shorter scaffolds

Large genome (e.g. wheat) have long single chromosome, which some software or file format not support.
.bai (bam index file) only support chromosomes shorter than $2^{29}-1$ bp.
.csi extend the limit to $2^{44}-1$ bp, but support for .csi is not widely applied in all softwares.
.tbi (variant index file) have the same limitation, but the extended .csi format is not supported by some software (like GATK).

Summary

This script can :

split the genome sequences (fasta format) into smaller scaffolds
split point is always the GAP sequence (a certain length of Ns)
the gtf/gff chromosome name and feature coordinates can be converted simultaneously.(optional)

Usage

Installation

download repository as zip
unzip splitLargeGenome-main.zip
just run.

perl splitLargeGenome-main/splitLargeGenome.pl

Options

    -fa        <file>      required       input genome sequences file, fasta format
    -gxf       <file>      optional       gtf/gff file for the fasta file, default not set
    -out       <str>       required       output file prefix
    -numN      <num>       optional       minimum length of Ns as Separator, default 10
    -minlen    <num>       optional       minmum  scaffold length in output, default 300000000
    -maxlen    <num>       optional       maximum fragment length in output, default 500000000

Use example

Split a genome into smaller fragments of 300M~500M in length using at least 10 Ns as separators, and update the corresponding gene.gtf with the new coordinates

perl  splitLargeGenome.pl  -fa genome.fa -minlen 300000000 -maxlen 500000000 -gxf gene.gtf -out genome.sep  -numN 10

Results

genome.sep.fa         : scaffold sequences with length between 300~500Mb
genome.sep.detail.txt : split position details
gene.sep.gtf          : new gtf file with new positions according to the detail file
                         only exists when -gxf was set

report bugs

any suggestions or bug reports you can:

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.CN.md		README.CN.md
README.md		README.md
splitLargeGenome.pl		splitLargeGenome.pl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

split large genome fasta and gtf/gff into shorter scaffolds

Summary

Usage

Installation

Options

Use example

Results

report bugs

About

Releases

Packages

Languages

License

biomarble/splitLargeGenome

Folders and files

Latest commit

History

Repository files navigation

split large genome fasta and gtf/gff into shorter scaffolds

Summary

Usage

Installation

Options

Use example

Results

report bugs

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages