Skip to content

Latest commit

 

History

History
411 lines (296 loc) · 26.1 KB

README.md

File metadata and controls

411 lines (296 loc) · 26.1 KB

Socru

Build Status License: GPL v3 codecov Docker Pulls

Contents

Introduction

Socru allows you to easily identify and communicate the order and orientation of complete genomes around ribosomal operons. These large scale structural variants have real impacts on the phenotype of the organism, and with the advent of long read sequencing, we can now start to delve into the mechanisms at work.

Installation

If you just want to quickly try out the software please try a Docker continer. This software is designed to run on Linux and OSX. It will not run on Windows.

Conda

Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge

To install Socru, first install conda with Python3 then run:

conda install -c conda-forge -c bioconda socru

Docker

Install Docker. There is a docker container which gets automatically built from the latest version of Socru. To install it:

docker pull quadraminstitute/socru

To use it you would use a command such as this (substituting in your filename/directories), using the example file in this repository:

docker run --rm -it -v /path/to/example_data:/example_data quadraminstitute/socru socru xxxxx

Dependancies

If you are performing a manual install (not recommended and we do not support it), you will need to ensure the following software is installed and available in your environment.

Python:

  • python >=3.6
  • biopython >=1.68
  • PyYAML
  • numpy
  • matplotlib

Software applications:

  • barrnap
  • blast

The Python 3 dependancies, and socru itself, can all be installed through pip:

pip3 install git+https://github.com/quadram-institute-bioscience/socru

leaving you to just install the software application dependancies.

Usage

Quick start

Given that you have an Escherichia coli complete genome (e.g. K12.fasta), see what databases are available:

socru_species

Next use one of the database names:

socru Escherichia_coli K12.fasta

which will give you output like:

K12.fasta	GS1.0	1	2	3	4	5	6	7

socru

This is the main script for the application. If you provide a complete assembly, it will give you back the orientation and order of the fragments.

usage: socru [options] species assembly.fasta

calculate the order and orientation of complete bacterial genomes

positional arguments:
  species               Species name, use socru_species to see all available
  input_files           Input FASTA files (optionally gzipped)

optional arguments:
  -h, --help            show this help message and exit
  --db_dir DB_DIR, -d DB_DIR
                        Base directory for species databases, defaults to
                        bundled (default: None)
  --threads THREADS, -t THREADS
                        No. of threads to use (default: 1)
  --output_file OUTPUT_FILE, -o OUTPUT_FILE
                        Output filename, defaults to STDOUT (default: None)
  --novel_profiles NOVEL_PROFILES, -n NOVEL_PROFILES
                        Filename for novel profiles (default:
                        profile.txt.novel)
  --new_fragments NEW_FRAGMENTS, -f NEW_FRAGMENTS
                        Filename for novel fragments (default:
                        novel_fragments.fa)
  --top_blast_hits TOP_BLAST_HITS, -b TOP_BLAST_HITS
                        Filename for top blast hits (default: None)
  --max_bases_from_ends MAX_BASES_FROM_ENDS, -m MAX_BASES_FROM_ENDS
                        Only look at this number of bases from start and end
                        of fragment (default: None)
  --not_circular, -c    Assume chromosome is not circularised (default: False)
  --min_bit_score MIN_BIT_SCORE
                        Minimum bit score (default: 100)
  --min_alignment_length MIN_ALIGNMENT_LENGTH
                        Minimum alignment length (default: 100)
  --verbose, -v         Turn on debugging (default: False)
  --version             show program's version number and exit

species: This mandatory argument is the name of the species database you wish to use. You can either create your own species database using socru_create or look up one of the bundled databases with socru_species. It normally takes the form of Genus_species, for example: "Salmonella_enterica".

input_files: This mandatory argument takes in a list of 1 or more FASTA files. Each FASTA file should be a complete assembly (chromosome in 1 contig) and never short read draft assemblies. Short read assemblies cannot resolve large repeats, such as the rrn region. The FASTA files can be optionally gzipped (compressed).

help: This will print out the extended help information, including default values, then exit.

db_dir: By default the software will look for the bundled databases. You can use this option to change the base directory for the databases and point it somewhere else, perhaps if you have a custom database you wish to use or if you wish to separate data from software on your computing system. You can use a relative or absolute path. The full database pathname is derived from joining this directory to the species argument.

threads: An integer with the number of threads to use. It defaults to 1 and you get diminishing returns with higher numbers. There's not much benefit to be had from using more than 4 threads.

output_file: By default the output is printed to STDOUT (to your terminal screen). You can specify a filename to print it to instead. The default behaviour is to create the file if it doesn't already exist, and to append to the end of the file if it already exists.

novel_profiles: Sometimes you encounter novel arrangements and orders. These will get printed to a file to allow you to update the profile.txt file in the database. If there is a new order of fragments, the first number will be 0. You will need to assign a number manually before adding it to the profile.txt. This is because you need to check to see if there is an assembly error or if it is a legitimate new pattern. If it's just a novel reorientation, the first number will have an integer of 1 or more. Please considered sending your changes back to the GitHub repository, so that the whole community will benefit from your science.

new_fragments: Any fragments that cannot be classified are written to a FASTA file for later investigation. The outcome might be that you add the fragment to the database as another representation, or exclude it as a contamination.

top_blast_hits: You can write the top BLAST hit for each input fragment against each database fragment. This is in outfmt 6 (old m8 format) which is tab delimited. You can use this as input to the socru_shrink_database command.

max_bases_from_ends: Take this number of bases from the start and end of a fragment and compare to the database. This is an experimental feature, it hasn't performed as expected and may be removed at a later point.

not_circular: Not all bacteria have circular chromosomes, or you may have an incomplete assembly. This flag tells the software not to try joining up the start and end of the largest contig. If you are using this flag, you may be attempting to use this software for a purpose for which it was never designed.

min_bit_score: Internally blastn is used and this allows you to specify the minimum bit score for a hit to be considered, since blast will throw up a lot of small hits.

min_alignment_length: Only consider blast alignment lengths above this value. Remember that there can be some very short fragments between rrns, so you'll need to know the approximate minimum fragment size (in bases) before increasing this value too high.

verbose: Print out enhanced information while the program is running.

version: Print the version of the software and exit. If the version is 'x.y.z' it probably means you haven't installed the software in a standard manner (conda/docker).

Output

Primary output

The output is printed to STDOUT or to an output file. It is tab delimited and provides the filename, the GS identifier and the order and orientation of the individual fragments, labelled from 1..n. If a fragment is reversed compared to the database reference, it is denoted prime (').

Staphylococcus/aureus/USA300.fna.gz	GS1.0	1	2	3	4	5
Staphylococcus/aureus/MOZ66.fna.gz	GS1.8	1	2	3	4'	5

Interpreting the output when all goes well is fairly straightforward. If the GS identifier is 1 or greater, then the pattern has been observed before and all should be okay. In the normal case you are looking for each fragment to occur exactly once (and only once), in an order that makes biological sense. Some species will have a variable number of fragments, which you will need to verify as being real. If a fragment cannot be classified it is flagged with a question mark (?).

Operon directions output

A file is generated called operon_directions.txt which contains the order of the fragments and the direction of the rRNA operon. Using this example:

Ty2_1.fa	--> 1 <-- 2' <-- 4 <-- 5 <-- 3(Ori) --> 7'

The first column is the filename, the arrows denote the direction of the operon (its circular so loops around) and the number denotes the fragment. The operon goes in the direction --> if 16S comes first, and <-- if 16S comes last in the operon. The fragment containing origin of replication is denoted (Ori) and it's orientation is fixed to match the orientation in the database reference. As with the primary output, a prime indicates the fragment (not the operon) is inverted. To be biologically valid, the arrows should point outwards from the Origin of replication to the terminus, if it doesn't then it's an assembly error.

Novel genome structures output

A file is generated called profile.txt.novel which contains any novel genome structures encountered in the genomes. They may contain invalid structures. You can copy and paste any valid genome structures to the database (and increment the GS number if they have none).

Visual representation of genome structure

A visual representation of the genome structure is saved as genome_structure.pdf showing the order of a circular genome and fragments.

Not all complete genomes are equal

You should be aware that not all complete assemblies are equal. In the early days, each complete reference genome was lovingly hand finished by teams of scientists at huge expense. With the advent of long read sequencing and better bioinformatics methods, it allowed a huge number of complete assemblies to be produced at a fraction of the cost. Many of these assemblies have not undergone rigorous quality checks, so may contain large structural errors. These errors may manifest as novel patterns in the output of this software. So it's useful for quality control if your input is your own assemblies.

Additionally, a common mistake people make is taking short read data, scaffolding this using a reference genome, and calling the output a "complete genome". These are not complete genomes. This would look something like:

Campylobacter_jejuni.fa	GS0.0	1

where there would normally be expected to be 3 fragments within Campylobacter jejuni. In this instance the short read assembler could only assembled a single rRNA segment as it could not unambiguously resolve the repetition. Unfortunately since short read sequencing is so cheap, researchers can pump out these erroneous genomes at a high rate, creating an overwhelming amount of noise compared to the real complete genomes.

Another example of a poor quality assembly is extra fragments. Virtually all Klebsiella pneumoniae assemblies consist of 8 fragments, ordered identically to the reference. To have regions duplicated, with an extra fragment, and an unidentified fragment rings alarm bells. This is likely to be something really really interesting, or just an assembly error.

Klebsiella/pneumoniae/TGH10/GCF_001611095.1.fna.gz	GS0.4	1	3'	2	?'	3	4	7	8	8

Sometimes fragments are missing:

Salmonella/enterica/ty3-193/GCF_001240865.2.fa	GS0.1	1'	2	3	6	5	7	

In this instance a very small fragment (4) is missing. If it's small the assembler may get a bit confused and miss it entirely.

Sometimes biologically improbable patterns appear and you need to do some further investigating. For example, if the origin and terminus are on fragments side by side, there is a high chance it's an assembly error, since this could indicate an unbalanced replichore. The terminus is always on fragment 1 and the origin varies by species (recorded in the database profile.txt.yml file).

Salmonella_enterica.fa	GS1.0	1	2	7	4	5	6	3

In this case the origin is on fragment 3, which is beside fragment 1 (circular). This is highly improbable in Salmonella.

socru_species

This will list all the species databases bundled with the software. You can then copy and paste one of the names for use with the main socru script. It doesn't take any input, instead it just prints out a sorted list of available species. If the species you want is not in the list, please create it using socru_create.

usage: socru_species [options]

List all available species

optional arguments:
  -h, --help     show this help message and exit
  --verbose, -v  Turn on debugging (default: False)
  --version      show program's version number and exit

help: This will print out the extended help information, including default values, then exit.

verbose: Print out enhanced information while the program is running.

version: Print the version of the software and exit. If the version is 'x.y.z' it probably means you havent installed the software in a standard manner (conda/pip).

Output

Acinetobacter_baumannii
Enterobacter_cloacae
Enterococcus_faecium
Klebsiella_pneumoniae
Salmonella_enterica
Staphylococcus_aureus

socru_create

You can create your own database and all you need is a single complete genome in FASTA format as input. Please consider pushing your new database back to the GitHub repository so that the whole community can benefit from your hard work.

usage: socru_create [options] output_directory assembly.fasta

create genome arrangement type scheme

positional arguments:
  output_directory      Output directory
  input_file            Input FASTA file (optionally gzipped)

optional arguments:
  -h, --help            show this help message and exit
  --max_bases_from_ends MAX_BASES_FROM_ENDS, -m MAX_BASES_FROM_ENDS
                        Only look at this number of bases from start and end
                        of fragment (default: None)
  --threads THREADS, -t THREADS
                        No. of threads to use (default: 1)
  --fragment_order FRAGMENT_ORDER, -f FRAGMENT_ORDER
                        Order of fragments, you may need to change this,
                        example 1-2-3-4-5-6-7 (default: None)
  --dnaa_fasta DNAA_FASTA, -d DNAA_FASTA
                        Location of dnaA FASTA file, defaults to bundled
                        (default: None)
  --verbose, -v         Turn on debugging [False]
  --version             show program's version number and exit

output_directory: This is the directory where your new database will live. The directory must not already exist.

input_file: This is a complete assembly file in FASTA format. You only need the chromosome, and if you provide it with more, only the largest sequence in the file will be used. This means that bacteria with more than 1 chromosome will only have their largest chromosome used by this software, sorry Vibrio.

help: This will print out the extended help information, including default values, then exit.

max_bases_from_ends: Take this number of bases from the start and end of a fragment and compare to the database. This is an experimental feature, it hasn't performed as expected and may be removed at a later point.

threads: An integer with the number of threads to use. It defaults to 1 and you get diminishing returns with higher numbers. There's not much benefit to be had from using more than 4 threads.

fragment_order: By default the software will take the largest fragment and go around in a clockwise fashion, labelling the fragments incrementally (1,2,3,4,5...). You can choose to force different numbers on the fragments, perhaps if someone has already published a particular scheme. In practice this option shouldnt be used, and if you are using it, don't ask for any support when things go wrong.

dnaa_fasta: This is the location of the FASTA file containing the dnaA sequences which are used to anchor the fragment orientations. It defaults to a bundled version, so you should never need to change it. The FASTA file containing the dnaA genes was generated by Circlator. The original file is run through cd-hit-est with default parameters to cluster similar sequences and reduce the size of the overall file.

verbose: Print out enhanced information while the program is running.

version: Print the version of the software and exit. If the version is 'x.y.z' it probably means you haven't installed the software in a standard manner (conda/pip).

Output

This command creates a directory containing the database. Each fragment is saved to a separate FASTA file, labelled 1..n. By default the full sequence is used, and it is unzipped initially to allow you to more easily look at the data while building the database. You should gzip the FASTA files once you are happy with them. Additionally there is a tab delimited profile.txt file which contains the patterns which have been previously seen (and appear valid). This consists of a GS identifier, followed by the fragment order and orientation (prime ' denotes reversed). As this is a new database the profile.txt will only contain GS1.0 with the fragments labelled 1..n in ascending order. There is a metadata file in YAML format which notes the direction of dnaA and which fragment it was found on. This information is used by socru later to decide on the orientation of the fragments it finds, since you can flip a circular bacterial chromosome.

socru_lookup

This is a utility script which will take in a fragment pattern and give you back the GS number. It's probably of limited use, but if you do find it useful and need it extended/made better, please submit an Issue on GitHub to let us know.

usage: socru_lookup[options] /path/to/database 1-2-3-4-5-6-7

Given a set of fragments, output the type

positional arguments:
  db_dir         Database directory
  fragments      Fragments such as 1-2-3-4-5-6-7 or 1'-3-4'-5'-6'-7-2

optional arguments:
  -h, --help     show this help message and exit
  --verbose, -v  Turn on debugging [False]
  --version      show program's version number and exit

db_dir: The full path to the database.

fragments: The pattern to lookup. Each number is separated by a dash (-), and reverse orientations are denoted with prime ('). The first number should be 1.

help: This will print out the extended help information, including default values, then exit.

verbose: Print out enhanced information while the program is running.

version: Print the version of the software and exit. If the version is 'x.y.z' it probably means you haven't installed the software in a standard manner (conda/pip).

socru_shrink_database

This is a utility script which will help you to shrink down a database by looking at the most conserved regions. By default a database contains all of the bases in the reference fragment, but this is way more information than needed in reality. The first step you must take is to have multiple assemblies for the species (more than 1) and run socru with the -b option. This will output blast results for each fragment to a file. The matching regions in the database from the blast results are piled up. The coverage threshold is gradually reduced until the minimum fragment size (bases) is met. These regions are then outputted to a new FASTA file in a new directory and gzipped. The profile.txt and profile.txt.yml files are also copied, so the database is ready to go. On average there is more than an 80% reduction in the amount of storage space required for a database. Small fragments are skipped over and copied in full. Some species have a lot of variation and don't work with this method. This was validated by using over 7000 samples, with identical results before shrinking and after shrinking.

usage: socru_shrink_database [options]

Admin utility to take the novel GS results and update the profile for the
database

positional arguments:
  blast_results         Blast results file from running socru -b xxx against
                        multiple assemblies
  input_database        Directory containing database to shrink
  output_database       Output directory for new database, it must not already
                        exist

optional arguments:
  -h, --help            show this help message and exit
  --min_fragment_size MIN_FRAGMENT_SIZE
                        Minimum fragment size in bases (default: 100000)
  --verbose, -v         Turn on debugging (default: False)
  --version             show program's version number and exit

blast_results: The blast results file from having run socru -b.

input_database: The full path to the directory containing the database you wish to shrink.

output_database: The full path to a directory where you wish to put the new database. It must not already exist.

help: This will print out the extended help information, including default values, then exit.

min_fragment_size: The minimum size in bases that you want in a fragment. If the fragment is less than this, it is copied in full. It may be greater than this amount depending on the coverage of the inputs.

verbose: Print out enhanced information while the program is running.

version: Print the version of the software and exit. If the version is 'x.y.z' it probably means you haven't installed the software in a standard manner (conda/pip).

Output

The output is a new directory mirroring in the input database. There will be 1 FASTA file per fragment, labelled 1..n, and these are gzipped. The profile.txt and profile.txt.yml files are copied from the input database without modification.

socru_update_profile

This utility will take the primary output of socru, where there is 1 assembly per tab delimited line, the GS identifier and the pattern. It will then look for novel patterns (GS0.X), and if they contain 1 copy of each fragment exactly once (and only once), it will assign a new number to them and output a new updated database. Multiple sets of results (multiple input assemblies) can be used at once, so it is good for the mass population of a database.

usage: socru_update_profile [options]

Admin utility to take the novel GS results and update the profile for the
database

positional arguments:
  socru_output_filename
                        Socru output file
  profile_filename      profile.txt from database

optional arguments:
  -h, --help            show this help message and exit
  --output_file OUTPUT_FILE, -o OUTPUT_FILE
                        Output filename, defaults to STDOUT (default:
                        updated_profile.txt)
  --verbose, -v         Turn on debugging (default: False)
  --version             show program's version number and exit

socru_output_filename: The output results of socru containing the tab delimited assembly name, GS identifier, and fragment pattern

help: This will print out the extended help information, including default values, then exit.

output_file: The name of the output file to use. It will contain the new profiles, and existing profiles from profile.txt in the database, so that it can be immediately dropped into the database without further modification.

verbose: Print out enhanced information while the program is running.

version: Print the version of the software and exit. If the version is 'x.y.z' it probably means you haven't installed the software in a standard manner (conda/pip).

Output

A new profile.txt file is outputted which can be dropped into the database without any further modification.

License

Socru is free software, licensed under GPLv3.

Feedback/Issues

Please report any issues or to provide feedback please go to the issues page. If you make improvements to the software, add databases or extend profiles, please send us the changes though a pull request so that the whole community may benefit from your work.

Resources required

To give you an indication of the resources required, a single 5Mbase assembly takes about 20 seconds using a single thread on an average laptop and uses no more than 250MB RAM. So overall the resource requirements are very light.

Etymology

socrú (sock-roo) is the word for arrangement in Irish (Gaeilge).

Citation

socru: typing of genome-level order and orientation around ribosomal operons in bacteria Andrew J. Page, Emma V. Ainsworth, Gemma C. Langridge, Microbial Genomics (2020) https://doi.org/10.1099/mgen.0.000396