py-simRAD

A parallel implementation of simRAD (Lepais & Weir, 2014) in Python using BioPython (Cock et al., 2009) for simulated prediction of loci expected in RADseq.

Installation

The installation is quite easy. The only requirement is BioPython which can be installed using the BioPython official documentation. They generally recommend using the pip package manager:

pip install biopython

For use with conda, BioPython can be installed from conda packages:

conda install -c conda-forge biopython

Usage

1. Catalysis: Generating fragment positions

Before visualization, exporting, or any other summarization of the restriction fragment data, restriction fragment positions must first be generated.

python src/py_simRAD.py catalyze \
    "raw-data/Trichoderma_atroviride_genomic.fna" \
    --enzymes "HhaI;HindIII;NotI"

This will generate a directory containing files (.pos) of restriction fragment position data for each enzyme and enzyme pair. These files are meant to be used as input for each of the export or summary options.

2. Export (reference-required)

The output produced by the python program are just restriction fragment positions per chromosome. For some fancier output related to the reference sequence, there is a special export function that requires an input of a reference sequence.

2A. Export FASTA

One such type of export is FASTA export. This command will generate a multi-FASTA of fragments ranging from 300-600 bp.

python src/py_simRAD.py export \
    "raw-data/Trichoderma_atroviride_genomic.fna" \
    "output-dir" \
    "raw-data/.Trichoderma_atroviride_genomic/HhaI-HindIII.pos" \
    --min 300 \
    --max 600 \
    --type 'fasta'

The result is a FASTA file containing sequences of restriction fragments from the given restriction combination used on the given genome.

HEADER FORMAT     >{ID} {DESCRIPTION} {ENZYME_COMBINATION} {POSITIONS}
EXAMPLE           >CP084935.1 Trichoderma atroviride strain P1 chromosome 1 HhaI-HindIII 0-403

2B. Export GFF

Again, this command is meant to be performed after generation of fragment positions. These features can be filtered according to fragment lengths before export into GFF format and can be used with IGV.

python src/py_simRAD.py export \
    "raw-data/Trichoderma_atroviride_genomic.fna" \
    "output-dir" \
    "raw-data/.Trichoderma_atroviride_genomic/HhaI-HindIII.pos" \
    --min 300 \
    --max 600 \
    --type 'gff'

FORMAT            {chromosome}	{py-simRAD version}	restriction_fragment	{start} {end}	.	+	.
EXAMPLE           CP084935.1	py-simRADv4.1.2	restriction_fragment	0	403	.	+	.

3. Delimited Enzyme Prints

These outputs do not require an input of a reference sequence and print out directly to the console.

3A. Fragment numbers

Use this command to print out a tab-delimited report of number of fragments, especially from a given range of fragment sizes.

python src/py_simRAD.py summary \
    "raw-data/.Trichoderma_atroviride_genomic/HhaI-HindIII.pos" \
    --min 300 \
    --max 600 \

3B. Summary genomic representation

Use this command to print out a tab-delimited report of percent genomic representation, especially from a given range of fragment sizes.

python src/py_simRAD.py summary \
    "raw-data/.Trichoderma_atroviride_genomic/HhaI-HindIII.pos" \
    --min 300 \
    --max 600 \
    --type genomic_rep

enzmye  total repr (%)  CP084935.1      CP084936.1      CP084937.1      CP084938.1      CP084939.1      CP084940.1      CP084941.1
HhaI-HindIII    37.26   37.367  36.282  36.528  36.439  39.51   37.94   37.962

The default behaviour is to print to console with human-readable tab delimiting, but this behaviour may be changed. As delimited output, the console printout of this function can be easily redirected to a file:

python src/py_simRAD.py summary \
    "raw-data/.Trichoderma_atroviride_genomic/HhaI-HindIII.pos" \
    --min 300 \
    --max 600 \
    --delimiter ',' \
    --type genomic_rep \
    > target_file.csv

4. Batched functionality

Each of the export and summary functions have been written so that wildcard (*) file input is allowed.

Summary genomic representation

The following program takes an input of every restriction enzyme combination generated (*) and filters printout according to fragment size and percent genomic representation.

python src/py_simRAD.py summary \
    "raw-data/.Trichoderma_atroviride_genomic/*" \
    --min 300 \
    --max 600 \
    --delimiter ',' \
    --min_rep 5 \
    --max_rep 15 \
    > target_file.csv

Planned features

Maybe make it possible to pipe generated fasta file from generate_sequence.py
Iterative sequence generation and creation similar to the simRAD paper

References

Cock, P. J. A., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., & de Hoon, M. J. L. (2009). Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11), 1422–1423. https://doi.org/10.1093/bioinformatics/btp163

Lepais, O., & Weir, J. T. (2014). SimRAD: an R package for simulation-based prediction of the number of loci expected in RADseq and similar genotyping by sequencing approaches. Molecular ecology resources, 14(6), 1314–1321. https://doi.org/10.1111/1755-0998.12273

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
src		src
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

py-simRAD

Table of contents

Installation

Usage

1. Catalysis: Generating fragment positions

2. Export (reference-required)

2A. Export FASTA

2B. Export GFF

3. Delimited Enzyme Prints

3A. Fragment numbers

3B. Summary genomic representation

4. Batched functionality

Summary genomic representation

Planned features

References

About

Releases

Packages

Languages

License

KPU-AGC/py-simRAD

Folders and files

Latest commit

History

Repository files navigation

py-simRAD

Table of contents

Installation

Usage

1. Catalysis: Generating fragment positions

2. Export (reference-required)

2A. Export FASTA

2B. Export GFF

3. Delimited Enzyme Prints

3A. Fragment numbers

3B. Summary genomic representation

4. Batched functionality

Summary genomic representation

Planned features

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages