Usage: assemblyStats.py [options] input.fasta[.gz]
Options:
-h, --help show this help message and exit
-z, --gzip input file is gzipped
-t, --tabular write tabular outputComputes basic assembly statistics given that input.fasta is a multi fasta
file with one contig per record.
Normally write human readable output. With the -t option a tabular format of the same information will be written, that can be used in statistic evaluation of multiple assemblies.
Usage: filterFasta.py [options] input.fasta [output.fasta]
Options:
-h, --help show this help message and exit
-q, --quiet do not print status messages to the screen
-u, --fastq input file is fastq
-z, --gzip input file is gzipped
-l X, --min-length=X write only sequence with lengths at least X
-i X, --id-list=X write only sequence with an ID from this list. List
can be comma separated string of IDs or a path to a
file with a line separated list of IDs
-r X, --random=X randomly sample X sequence from input file
-e, --regexp use regular expression instead of exact matching for
IDs
-a, --ignore-at ignore the first letter of the query IDs if it is an @ (this is for more convenient filter list creation from fastq files)
-n, --negative do exactly the opposite of what would normally be done
Filter fasta/fastq files in different way.
- Filter by minimum length (
-l
)
Only write sequences of a certain length to the output. - Filter by list of IDs (
-i
)
Only write sequences with an ID from the list given. The list can either be given as a comma separated list of IDs or as in a file with one ID per line. With the-e
option the given "IDs" will be used as regular expression instead of exact matching (using python regex). - "Filter" randomly Write a random subset of the sequences of the given size.
The -n
option will switch to negative mode. Meaning the script will do exactly the oposite it normally does.
The -a
option is will make the script ignore @-signs at the begining of IDs in the ID list.
The main use case for this is with two fastq files (A.fq and B.fq) and all your ID lines start with @M01271 (because M01271 is the serial number of your sequencer). If we want to keep in B only the sequences that are also in A, we can run the following:
grep "^@M01271" A.fq > id_list.txt python filterFasta.py -i id_list.txt -a B.fq > B_and_A.fq
Input data can be provided as a file (first argument) or be piped in.
Usage: plotFastaLengthDist.py [options] input.fasta
Options:
-h, --help show this help message and exit
-u, --fastq input file is fastq
-z, --gzip input file is gzipped
-o OUT/PATH, --out-folder=OUT/PATH
write output files to folder OUT/PATH
-m X, --mark-value=X mark position X on the x-Axis
-t, --text-output also write text output
-l, --log-yaxis create plot with logarithmic y-axis
-f FORMAT, --img-format=FORMAT set plot format to FORMAT [default: pdf]
Plot the length distribution of the sequences in the input file.
Two plots in one pdf file will be drawn.
A bar plot of the length counts (NOT a histogram, there will be no binning) and a smoothed density plot.
With the -t
option the collected data (sequence ID and sequence length) will also be written to a text file.
Specify the image format with -f
. Possible formats are: 'pdf', 'png', 'jpeg', 'bmp', 'postscript'.
This scripts needs R to do the plotting.
It will write a temporary R script and pipe the data into a R process executing this script.
Usage: removeSeqsWithN.py [options] inputfile1 [inputfile2]
Options:
-h, --help show this help message and exit
-z, --gzip input file is gzipped
-a, --fasta set input and output format to fasta [default: fastq]
-n X, --max-n=X remove all sequences with more than X Ns. If X is set to -1
only sequence that are only Ns will be removed [default: 5]
Remove sequences with more then a certain number of Ns. If the threshold (-n
) is set to -1 only sequences that consist completely of Ns will be removed.
Write all sequences that have less than a certain number of Ns to a different file.
Usage: spatialQuality.py [options] input.fastq[.gz] [input2.fastq[.gz] ...]
Options:
-h, --help show this help message and exit
-q, --quiet do not print status messages to the screen
-o X, --output-folder=X
write results to folder X [default: ./spatialQual]
-z, --gzip input file is gzipped
-n, --n-count also make plots for N count
-d, --detail-plot make detail plot for each tile
-p, --pdf output pdf files additional to png filesPlot properties of reads according to position on the flow cell
Quality of Illumina reads can be linked to the physical position on the flow cell.
This script plots read properties according to their position on the flow cell.
Default only one quality overview per file will be plotted as png file.
Additional options allow to also plot N-count per read (-n
), produce detail plots per tile (-d
) and to produce additional pdf files (-p
).
Usage: subtractReadsByMapping.py [options] mappingfile read1.fastx out1.fastx read2.fastx >out2.fastx
Options:
-h, --help show this help message and exit
-q, --quiet do not print status messages to the screen
-a, --fasta input file(s) is/are fasta
-m X, --mapped1=X write mapped reads from read 1 to this files
-n X, --mapped2=X write mapped reads from read 2 to this files
-z, --gzip input file(s) is/are gzipped
-y, --mapping-gzip mapping file is gzipped
-b, --blast mapping file is tabular blast output instead of sam
file
-t X, --threshold=X consider reads with an e-value lower than X as
"mapped". (Can only be used in blast mode) [default:
0.000001]
From a set of single end or paired end reads in a fasta or fastq file (or two for paired end) remove all reads that were mapped in a mapping result given as a sam file or a blast tabular file (-b
). Reads files can be gziped (-z
) as well as the mapping file (-y
). For blast results a minimal e-value can be given for a match to be considered as a "mapping" (-t
).
Usage: trim.py [options] input1.fastq [output1.fasta input2.fastq output2.fasta]
Options:
-h, --help show this help message and exit
-q, --quite do not print status messages to the screen
-a, --fasta set output format to fasta [default: fastq]
-t X, --min-quality=X
quality threshold
-l X, --min-length=X minimal length for reads after trimming [default: 0]
-p X, --max-error-prob=X
maximal over all error probability in one read
-c X, --const=X remove X bases from the end of the read
-b X, --begin-const=X
remove X bases from the begining of the read
-r X, --crop=X cut all reads to length X; if combined with -l Y reads
shorter than Y will be disgarded and reads shorter
than X but longer than Y will be padded with Ns to have
length X
Trim (paired end) reads in a variety of ways. Does not support gzipped input, yet.
usage: primerRecognition.py [-h] [-o OUT] [-g] [-p PRE] [-m MAX] length path
Try to recognize a possible primer sequence from the starting bases of a read
file in fastq[.gz] formatpositional arguments:
length up to which position should be analysed for the primer
path where to look for fastq filesoptional arguments:
-h, --help show this help message and exit
-o OUT, --out OUT write results to this file
-g, --gz input files are gzipped
-p PRE, --pre-computed PRE
give a file of precomputed data here
-m MAX, --max-reads MAX
maximum number of reads to read per file (0 for all)
Analyzes the first X bases of each read in multiple fastq files to guess the primers used (per sample). Primer sequences are given in IUPAC ambiguity codes and represent 9
usage: splitFasta.py [-h] (-p X | -n X | -s) [-u] [-z] INFILE
Split fast[a|q] file into multiple files.
positional arguments:
INFILE fast[a|q] file to split
optional arguments:
-h, --help show this help message and exit
-p X, --pieces X split sequence to X pieces
-n X, --number X split sequence to pieces with X reads
-s, --single-record split a every record
-u, --fastq input file is in fastq format
-z, --gzip input file is gzipped
Split multi fasta or multi fastq file into several files by a) splitting it into a certain number of pieces (-p), b) splitting it into files with a certain number of sequences (-n) or putting each sequence into its own file (-s).
Collection of tools to query Ncbi. Most classes work as python dictionaries.
Base class for querying NCBI via BioPyhton and the NCBI web interface. Do not use directly!
Map a scientific species name to a NCBI taxonomy ID
Map a NCBI taxonomy ID to a scientific species name
Map a Ncbi Taxonomy Node name to its ID. This is slitely different from the SpeciesName2TaxId
map: It always returns a list, which contains multiple IDs for ambiguous node names and it also works for higher taxonomic levels and nodes that do not have a standard rank (like sub-phylum or no rank).
Map a NCBI taxonomy ID to the complete taxonomic lineage defined by the NCBI taxonomy tree. Return value will be a list of 3-tuples representing NCBI taxonomy nodes. The tuples will contain: rank, taxonomy ID, scientific name
Map a NCBI taxonomy ID to specific taxonomic level from the NCBI taxonomy tree. Return value will be a list of 3-tuples representing NCBI taxonomy nodes. The tuples will contain: rank, taxonomy ID, scientific name
Map a NCBI taxonomy ID to its parent node (in the NCBI taxonomy tree) NCBI taxonomy ID.
Map the GI number of a NCBI nucleotide record to the according NCBI taxonomy ID
Map the GI number of a NCBI nucleotide record to scientific name of the according species
Map the GI number of a NCBI nucleotide record to the according NCBI taxonomy ID. Use a sqlite3 database as persistent cache to reduce requests to NCBI
Map the GI number of a NCBI protein record to the name of the protein (NCBI calls this the definition
)
Map the GI number of a NCBI protein record to the name of the protein. Uses a multi-layer cache (RAM and sqlite3 database).
A class representing the tree given by the NCBI taxonomy database. If a cache path is given to the constructor a database at this path will be used as persistent cache. The object can be initialized by the function with the same name. It takes a node.dmp file of the NCBI taxonomy file dump as input. This option is only available if a cache is used (otherwise there is no place to store the initialized data). Missing data will be loaded directly from the NCBI data base via SOAP request, but only once it is needed.
The object can be queried for information on the tree with NCBI taxonomy IDs representing a node. The function include: parent of a node, full path to the root, lowest common ancestor of two or more nodes and a variant of the lowest common ancestor that for a set of nodes returns the lowest node for which all input nodes are either ancestors or descendants of the output node (called lowest common node (LCN) here).
Collection of tools to query Uniprot
Map protein IDs via the Uniprot mapping service. Uses a multi-layer cache (RAM and sqlite3 database). Can be configured to map different ID types to each other as long as they are supported by Uniprot. The sqlite3 cache can be initialized with a Uniprot flat file to reduce web requests.
Map proteins IDs via the Uniprot mapping service. Will send one request per mapping to Uniprot. Can be configured to map different ID types to each other as long as they are supported by Uniprot.
Class to query the Uniprot web service. Comes with functions to read CAZy, KEGG and Gene Ontology informations.
Map a Uniprot ID to the Kegg gene IDs via the Uniprot REST API. Return a set of KEGG IDs or an empty set if no mapping was returned from Uniprot.
Map a Uniprot ID to the GO IDs the protein is annotated with via the Uniprot REST API. Will return a (possibly) empty set of GO IDs. Does not separate between the three GO trees.
Module for multi-layer cached dictionaries. Multi-layer cache means that the dictionary will beside the normal python dictionary also have other cache layers, that will be queried one after another until a value for the key can be found. Cache layers should be ordered by speed.
Base class that implements the multi-layer cache.
Cache class to use as a cache layer in a MultiCachedDict
. It will use a sqlite3 database to implement a local, persistent mapping.
Same as SqliteCache
, but can deal with values that are lists.
Simple cached dictionary. It uses only a SqliteCache
(beside the normal RAM dictionary) to save mappings between program calls.
Simple parser for Thermonuclear Blast standard output files.
Generator that will yield one amplicon at a time in the form of a tuple. consisting of: (NCBI GI number, start of amplicon in the data base sequence, end of amplicon in the data base sequence, amplicon sequece)
Collection of tools to query the Encyclopedia of Life
Dictionary that maps names to EOL IDs using the search
EOL web service.
Will return a list of IDs.
Search parameters can be configured in the config
dictionary.
Note: exact search is switched on by default.
Loads data for an EOL entry once with the page
EOL web service.
Data is kepped in memory and can be querried directly in the data
member of this class or by specialized functions.
Collection of tools to query the KEGG REST API
Abstract base class to query KEGG. Do not use directly!
Like KeggMap
, but can have multiple values for one key and supports None
as value.
Maps NCBI protein GIs to KEGG gene IDs via the KEGG REST API. Uses the convert
operation. Will return a KEGG gene ID (including the three letter organism prefix) or None
if no mapping was found.
Maps KEGG gene IDs to KEGG pathway IDs via the KEGG REST API. Uses the link
operation. Will return a set of pathway IDs without the path:
prefix. Input has to be a KEGG gene ID including the three letter organism prefix.
Maps KEGG pathway IDs (wihtout the path:
prefix) to their name via the KEGG REST API.
Maps KEGG KO IDs (K[0-9]{5}
) to their "definition" via the KEGG REST API.
Maps KEGG reaction IDs to the Enzyme Comission (EC) numbers of the involved enzymes. Returns a set of EC numbers as strings or None
if no information was found. Input must be a KEGG reaction ID (R[0-9]{5})
Maps KEGG protein IDs to the according KEGG Orthology group(s) (KO) via the KEGG REST API. Uses the link
operation. Input has to be a KEGG gene ID including the three letter organism prefix. Returns a set of KEGG Orhtology ID (ko:.*). May return an empty set if no link was found.
Maps KEGG Onthology groups (KOs) to the pathways they are part of via the KEGG REST API. Uses the link
operation. The return value will be a set of the (ko) pathways the KO is part of (or an empty set if it is not part of any). The key must be a KEGG Orhtology ID (ko:.*).
Maps KEGG Onthology groups (KOs) to the Enzyme Comission (EC) numbers that encode their function(s) via the KEGG REST API. Uses the link
operation. Returns a set of EC numbers as strings (or an empty set if no link is found). The key must be a KEGG Orhtology ID (ko:.*).
Maps Enzyme Comission (EC) numbers to KEGG pathway IDs vis the KEGG Rest API. Uses the link
operation. Returns a set of pathway IDs as strings without the path:
prefix. (or an empty set if no link is found). The key must be a EC nubmer in KEGG format (ec:[0-9]+.[0-9]+.[0-9]+.[0-9]+).
Maps Enzyme Comission (EC) numbers to KEGG Orthology (KO) IDs vis the KEGG Rest API. Uses the link
operation. Returns a set of KO IDs as strings without the ko:
prefix. (or an empty set if no link is found or the EC number does not exist). The key must be a EC nubmer in KEGG format (ec:[0-9]+.[0-9]+.[0-9]+.[0-9]+).