Skip to content

Commit

Permalink
DOC: update paper
Browse files Browse the repository at this point in the history
  • Loading branch information
Vini2 committed Sep 16, 2024
1 parent ea8efb8 commit 9518f38
Showing 1 changed file with 107 additions and 19 deletions.
126 changes: 107 additions & 19 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,36 +53,116 @@ It is crucial to obtain accurate binning results in metagenomic studies to under
* Contigs that are too short (e.g., shorter than 1000 base pairs) can be discarded during binning as they may not capture enough genomic signatures. Such short sequences can contain important regions such as repeats.
* Contigs shared among different genomes are only placed in the bin of the most representative genome.

To address these challenges, GraphBin-Tk integrates the capabilities of GraphBin [@Mallawaarachchi1:2020], GraphBin2 [@Mallawaarachchi2:2020; @Mallawaarachchi:2021] and MetaCoAG [@Mallawaarachchi1:2022; @Mallawaarachchi2:2022], providing a comprehensive toolkit for metagenomic binning and refinement as shown in \autoref{fig1}. GraphBin-Tk unifies three state-of-the-art binning solutions in just one tool, making it easy to install and execute. It provides users with a more comprehensive set of features and capabilities, enabling them to perform a wider range of tasks related to metagenomic binning without needing additional software. This also eliminates the compatibility issues of having to run separate binning-related software and enhances the user experience by making the software easier to learn and use.
GraphBin-Tk enhances binning accuracy and addresses the aforementioned challenges by integrating the capabilities of GraphBin [@Mallawaarachchi1:2020], GraphBin2 [@Mallawaarachchi2:2020; @Mallawaarachchi:2021] and MetaCoAG [@Mallawaarachchi1:2022; @Mallawaarachchi2:2022], which leverage the connectivity information of the assembly graph. GraphBin-Tk unifies three state-of-the-art binning solutions in a comprehensive toolkit for metagenomic binning and refinement as shown in \autoref{fig1}, making it easy to install and execute. It provides users with a more comprehensive set of features and capabilities, enabling them to perform a wider range of tasks related to metagenomic binning without needing additional software. This also eliminates the compatibility issues of having to run separate binning-related software and enhances the user experience by making the software easier to learn and use.

![Example binning workflow using tools available from GraphBin-Tk.\label{fig1}](gbintk_workflow.svg){width=100%}

GraphBin-Tk can perform stand-alone metagenomic binning using MetaCoAG and bin refinement using either GraphBin or GraphBin2. Additionally, pre-processing functionalities to run these tools and post-processing functionalities to analyse the produced results are included in GraphBin-Tk. A list of the subcommands provided in GraphBin-Tk is as follows:
GraphBin-Tk can perform stand-alone metagenomic binning using MetaCoAG and bin refinement using either GraphBin or GraphBin2. Additionally, pre-processing functionalities to run these tools and post-processing functionalities to analyse the produced results are included as well. GraphBin-Tk supports metagenome assemblies generated from three popular metagenome assemblers; metaSPAdes [@Nurk:2017] and MEGAHIT [@Li:2015] for short-read sequencing data and metaFlye [@Kolmogorov:2020] for long-read sequencing data. GraphBin-Tk can be launched using the command `gbintk`. The following subsections explain the subcommands provided in GraphBin-Tk.

| Subcommand | Tool/processing functionality | Inputs required |
|:------------:|:-------------------------------------------------------------------:|:----------------------------------------------------------------------------:|
| `graphbin` | Bin refinement tool GraphBin | Contigs, assembly graph file(s)[^1], initial binning result |
| `graphbin2` | Bin refinement tool GraphBin2 | Contigs, assembly graph file(s), initial binning result, coverage of contigs |
| `metacoag` | Binning tool MetaCoAG | Contigs, assembly graph file(s), coverage of contigs |
| `prepare` | Format initial binning results for GraphBin and GraphBin2 | Folder containing the initial binning result |
| `visualise` | Visualise initial and refined binning results on the assembly graph | Assembly graph file(s), initial binning result, final binning result |
| `evaluate` | Evaluate binning results given a ground truth | Binning result, ground truth |
## `metacoag`

[^1]: The assembly graph files can vary depending on the assembler used to generate the contigs. The metaSPAdes version requires the assembly graph file in `.gfa` format and the paths file in `.paths` format. The MEGAHIT version requires the assembly graph file in `.gfa` format. The metaFlye version requires the assembly graph file in `.gfa` format and the paths file in `.txt` format.
### Tool/processing function

## Binning, preparing binning results and bin refinement
A user can start the analysis by running the `metacoag` subcommand to bin a metagenomic dataset using the metagenomic binning tool MetaCoAG [@Mallawaarachchi1:2022; @Mallawaarachchi2:2022] and obtain MAGs as shown in \autoref{fig1}.

GraphBin-Tk supports metagenome assemblies generated from three popular metagenome assemblers; metaSPAdes [@Nurk:2017] and MEGAHIT [@Li:2015] for short-read sequencing data and metaFlye [@Kolmogorov:2020] for long-read sequencing data. GraphBin-Tk can be launched using the command `gbintk`. A user can start the analysis by running the `metacoag` subcommand to bin a metagenomic dataset and obtain MAGs as shown in \autoref{fig1}. The inputs required are the contigs file, the assembly graph files and the read coverage of contigs. The read coverage of contigs can be obtained by running a coverage calculation tool such as Koverage [@Roach:2024]. The MetaCoAG binning result can be formatted using the `prepare` subcommand into a delimited text file such as `.csv` or `.tsv` that represents each contig and its bin name. This formatted binning result can be improved by providing to either GraphBin or GraphBin2 using the subcommands `graphbin` or `graphbin2` with the contigs file, the assembly graph files and the read coverage of contigs (\autoref{fig1}).
### Inputs

## Visualisation
The following inputs are required to run the `metacoag` subcommand.
* Contigs
* Assembly graph file(s)
* Coverage of contigs - can be obtained by running a coverage calculation tool such as CoverM [https://github.com/wwood/CoverM](https://github.com/wwood/CoverM) or Koverage [@Roach:2024]

The initial MetaCoAG binning result and the refined binning result can be visualised on the assembly graph using the `visualise` subcommand (\autoref{fig1}). Users can generate images in different formats such as `png`, `eps`, `pdf` and `svg`, and customise the dimensions of the images. An example is shown in \autoref{fig2} for the Sim-5G+metaSPAdes dataset [@Mallawaarachchi2:2020; @Mallawaarachchi:2021] containing five bacterial species.
The assembly graph files can vary depending on the assembler used to generate the contigs. The metaSPAdes version requires the assembly graph file in `.gfa` format and the paths file in `.paths` format. The MEGAHIT version requires the assembly graph file in `.gfa` format. The metaFlye version requires the assembly graph file `assembly_graph.gfa` and the paths file `assembly_info.txt`.

### Outputs

The following outputs will be generated by the `metacoag` subcommand.
* A delimited text file containing the contig identifier and bin identifier for each binned contig
* `.fasta` files for each bin


## `prepare`

### Tool/processing function

If a delimited text file is not available, the MetaCoAG binning result can be formatted using the `prepare` subcommand into a delimited text file that represents each contig and its bin identifier.

### Inputs

The directory containing the initial binning is required to run the `prepare` subcommand.

### Outputs

The `prepare` subcommand will generate a delimited text file such as `.csv` or `.tsv` containing contig identifier and bin identifier for the binning result.


## `graphbin`

### Tool/processing function

This formatted binning result from MetaCoAG can be improved by providing to GraphBin [@Mallawaarachchi1:2020] using the subcommand `graphbin` (\autoref{fig1}).

### Inputs

The following inputs are required to run the `graphbin` subcommand.
* Contigs file
* Assembly graph file(s) - can vary depending on the assembler used to generate the contigs (refer to inputs under 'metacoag')
* A delimited text file containing the initial binning result
* Coverage of contigs - can be obtained by running a coverage calculation tool such as CoverM [https://github.com/wwood/CoverM](https://github.com/wwood/CoverM) or Koverage [@Roach:2024]

### Outputs

The following outputs will be generated by the `graphbin` subcommand.
* A delimited text file containing the contig identifier and bin identifier for each binned contig
* `.fasta` files for each bin

## `graphbin2`

This formatted binning result can be improved by providing to GraphBin2 [@Mallawaarachchi2:2020; @Mallawaarachchi:2021] using the subcommand `graphbin2` (\autoref{fig1}).

### Inputs

The following inputs are required to run the `graphbin2` subcommand.
* Contigs file
* Assembly graph file(s) - can vary depending on the assembler used to generate the contigs (refer to inputs under 'metacoag')
* A delimited text file containing the initial binning result

### Outputs

The following outputs will be generated by the `graphbin2` subcommand.
* A delimited text file containing the contig identifier and bin identifier for each binned contig
* `.fasta` files for each bin


## `visualise`

### Tool/processing function

The initial MetaCoAG binning result and the refined binning result can be visualised on the assembly graph using the `visualise` subcommand (\autoref{fig1}). Users can generate images in different formats such as `png`, `eps`, `pdf` and `svg`, and customise the dimensions of the images.

### Inputs

The following inputs are required to run the `visualise` subcommand.
* Contigs file
* Assembly graph file(s) - can vary depending on the assembler used to generate the contigs (refer to inputs under 'metacoag')
* A delimited text file containing the initial binning result
* A delimited text file containing the refined binning result

### Outputs

The following outputs will be generated by the `visualise` subcommand.
* Figure of the assembly graph with the initial binning result
* Figure of the assembly graph with the refined binning result

An example is shown in \autoref{fig2} for the Sim-5G+metaSPAdes dataset [@Mallawaarachchi2:2020; @Mallawaarachchi:2021] containing five bacterial species.

![Visualisation of the assembly graph with the initial binning result from MetaCoAG (left) and final binning result from GraphBin (right) for the Sim-5G+metaSPAdes dataset. The five colours represent the five bins and the white nodes represent unbinned contigs.\label{fig2}](visualisation.svg){width=100%}

## Evaluation
## `evaluate`

Finally, the produced binning results can be evaluated using the `evaluate` subcommand by providing the ground truth bins of contigs (\autoref{fig1}). This evaluation is possible only for simulated or mock metagenomes where the ground truth genomes of contigs are known. GraphBin-Tk uses the four common metrics 1) precision, 2) recall, 3) F1-score and 4) Adjusted Rand Index (ARI) that have been used in previous binning studies [@Alneberg:2014; @Meyer:2018; @Mallawaarachchi1:2020]. These metrics are calculates as follows. The binning result is denoted as a $K \times S$ matrix with $K$ number of bins and $S$ number of ground truth taxa. In this matrix, the element $a_{ks}$ denotes the number of contigs binned to the $k^{th}$ bin and belongs to the $s^{th}$ taxa. $U$ denotes the number of unbinned contigs and $N$ denotes the total number of contigs. Following are the equations used to calculate the evaluation metrics.
### Tool/processing function

The produced binning results can be evaluated using the `evaluate` subcommand by providing the ground truth bins of contigs (\autoref{fig1}). This evaluation is possible only for simulated or mock metagenomes where the ground truth genomes of contigs are known. GraphBin-Tk uses the four common metrics 1) precision, 2) recall, 3) F1-score and 4) Adjusted Rand Index (ARI) that have been used in previous binning studies [@Alneberg:2014; @Meyer:2018; @Mallawaarachchi1:2020]. These metrics are calculates as follows. The binning result is denoted as a $K \times S$ matrix with $K$ number of bins and $S$ number of ground truth taxa. In this matrix, the element $a_{ks}$ denotes the number of contigs binned to the $k^{th}$ bin and belongs to the $s^{th}$ taxa. $U$ denotes the number of unbinned contigs and $N$ denotes the total number of contigs. Following are the equations used to calculate the evaluation metrics.

__Precision__ = $\frac{\sum_{k}max_s \{a_{ks}\}}{\sum_{k}\sum_{s}a_{ks}}$

Expand All @@ -92,7 +172,15 @@ __F1-score__ = $2 \times \frac{Precision\times Recall}{Precision+Recall}$

__ARI__ = $\frac{\sum_{k,s}\binom{a_{ks}}{2}-t_3}{\frac{1}{2}(t_1+t_2)-t_3}$ $where\;t_1 = \sum_{k}\binom{\sum_{s}a_{ks}}{2},\;t_2 = \sum_{s}\binom{\sum_{k}a_{ks}}{2},\; and\; t_3 = \frac{t_1t_2}{\binom{N}{2}}$

These metrics can be plotted for comparison between the initial binning result and the refined binning result as shown in \autoref{fig3}.
### Inputs

The following inputs are required to run the `evaluate` subcommand.
* Delimited text file for the ground truth
* Delimited text file for the binning result

### Outputs

A text file containing the $K \times S$ matrix and the calculated evaluation metrics will be generated by the `visualise` subcommand. These metrics can be plotted for comparison between the initial binning result and the refined binning result as shown in \autoref{fig3}.

![Comparison of evaluation metrics for the intiial binning result from MetaCoAG and the refined binning result from GraphBin for the Sim-5G+metaSPAdes dataset.\label{fig3}](gbintk_metrics_comparison.svg){width=70%}

Expand All @@ -105,7 +193,7 @@ GraphBin-Tk is distributed as a Conda package available in the Bioconda channel

# Acknowledgements

This work is dedicated to the memory of the late Dr Yu Lin (The Australian National University) whose guidance and support were instrumental in shaping the original work. His wisdom and mentorship will be deeply missed.
This work is dedicated to the memory of the late Dr Yu Lin (The Australian National University) whose guidance and support were instrumental in shaping the original work. His wisdom and mentorship are deeply missed.

This work was supported by an Essential Open Source Software for Science Grant EOSS5-0000000223 from the Chan Zuckerberg Initiative. This work was undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI Australia) which is supported by the Australian Government.

Expand Down

0 comments on commit 9518f38

Please sign in to comment.