A parallel implementation of simRAD (Lepais & Weir, 2014) in Python using BioPython (Cock et al., 2009) for simulated prediction of loci expected in RADseq.
The installation is quite easy. The only requirement is BioPython which can be installed using the
BioPython official documentation.
They generally recommend using the pip
package manager:
pip install biopython
For use with conda
, BioPython can be installed from conda packages:
conda install -c conda-forge biopython
Before visualization, exporting, or any other summarization of the restriction fragment data, restriction fragment positions must first be generated.
python src/py_simRAD.py catalyze \
"raw-data/Trichoderma_atroviride_genomic.fna" \
--enzymes "HhaI;HindIII;NotI"
This will generate a directory containing files (.pos
) of restriction fragment position data for each enzyme and
enzyme pair. These files are meant to be used as input for each of the export or summary options.
The output produced by the python program are just restriction fragment positions per chromosome. For some fancier output related to the reference sequence, there is a special export function that requires an input of a reference sequence.
One such type of export is FASTA export. This command will generate a multi-FASTA of fragments ranging from 300-600 bp.
python src/py_simRAD.py export \
"raw-data/Trichoderma_atroviride_genomic.fna" \
"output-dir" \
"raw-data/.Trichoderma_atroviride_genomic/HhaI-HindIII.pos" \
--min 300 \
--max 600 \
--type 'fasta'
The result is a FASTA file containing sequences of restriction fragments from the given restriction combination used on the given genome.
HEADER FORMAT >{ID} {DESCRIPTION} {ENZYME_COMBINATION} {POSITIONS}
EXAMPLE >CP084935.1 Trichoderma atroviride strain P1 chromosome 1 HhaI-HindIII 0-403
Again, this command is meant to be performed after generation of fragment positions. These features can be filtered according to fragment lengths before export into GFF format and can be used with IGV.
python src/py_simRAD.py export \
"raw-data/Trichoderma_atroviride_genomic.fna" \
"output-dir" \
"raw-data/.Trichoderma_atroviride_genomic/HhaI-HindIII.pos" \
--min 300 \
--max 600 \
--type 'gff'
FORMAT {chromosome} {py-simRAD version} restriction_fragment {start} {end} . + .
EXAMPLE CP084935.1 py-simRADv4.1.2 restriction_fragment 0 403 . + .
These outputs do not require an input of a reference sequence and print out directly to the console.
Use this command to print out a tab-delimited report of number of fragments, especially from a given range of fragment sizes.
python src/py_simRAD.py summary \
"raw-data/.Trichoderma_atroviride_genomic/HhaI-HindIII.pos" \
--min 300 \
--max 600 \
Use this command to print out a tab-delimited report of percent genomic representation, especially from a given range of fragment sizes.
python src/py_simRAD.py summary \
"raw-data/.Trichoderma_atroviride_genomic/HhaI-HindIII.pos" \
--min 300 \
--max 600 \
--type genomic_rep
enzmye total repr (%) CP084935.1 CP084936.1 CP084937.1 CP084938.1 CP084939.1 CP084940.1 CP084941.1
HhaI-HindIII 37.26 37.367 36.282 36.528 36.439 39.51 37.94 37.962
The default behaviour is to print to console with human-readable tab delimiting, but this behaviour may be changed. As delimited output, the console printout of this function can be easily redirected to a file:
python src/py_simRAD.py summary \
"raw-data/.Trichoderma_atroviride_genomic/HhaI-HindIII.pos" \
--min 300 \
--max 600 \
--delimiter ',' \
--type genomic_rep \
> target_file.csv
Each of the export and summary functions have been written so that wildcard (*
) file input is allowed.
The following program takes an input of every restriction enzyme combination generated (*
) and filters printout
according to fragment size and percent genomic representation.
python src/py_simRAD.py summary \
"raw-data/.Trichoderma_atroviride_genomic/*" \
--min 300 \
--max 600 \
--delimiter ',' \
--min_rep 5 \
--max_rep 15 \
> target_file.csv
- Maybe make it possible to pipe generated fasta file from generate_sequence.py
- Iterative sequence generation and creation similar to the simRAD paper
Cock, P. J. A., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., & de Hoon, M. J. L. (2009). Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11), 1422–1423. https://doi.org/10.1093/bioinformatics/btp163
Lepais, O., & Weir, J. T. (2014). SimRAD: an R package for simulation-based prediction of the number of loci expected in RADseq and similar genotyping by sequencing approaches. Molecular ecology resources, 14(6), 1314–1321. https://doi.org/10.1111/1755-0998.12273