Skip to content

Commit

Permalink
docs: fix typos and grammar
Browse files Browse the repository at this point in the history
  • Loading branch information
matinnuhamunada authored Jun 15, 2023
1 parent 4c01fd2 commit 592fc6f
Showing 1 changed file with 29 additions and 29 deletions.
58 changes: 29 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
# BGCflow
# BGCFlow
[![Snakemake](https://img.shields.io/badge/snakemake-≥7.14.0-brightgreen.svg)](https://snakemake.bitbucket.io)
[![PEP compatible](https://pepkit.github.io/img/PEP-compatible-green.svg)](https://pep.databio.org)

BGCFlow is a systematic workflow for the analysis of biosynthetic gene clusters across large collection of genomes (pangenomes) from internal & public datasets.
BGCFlow is a systematic workflow for the analysis of biosynthetic gene clusters across large collections of genomes (pangenomes) from internal & public datasets.

## Quick Start
A quick and easy way to use BGCFlow using [`bgcflow_wrapper`](https://github.com/NBChub/bgcflow_wrapper).

1. Create a conda environment and install the [BGCFlow python wrapper](https://github.com/NBChub/bgcflow_wrapper) :

```bash
# create and activate new conda environment
# create and activate a new conda environment
conda create -n bgcflow pip -y
conda activate bgcflow

Expand All @@ -31,11 +31,11 @@ bgcflow run -n # do a dry run, remove the flag "-n" to run the example dataset
See [`README.md`](https://github.com/NBChub/bgcflow_wrapper) for more details about [`bgcflow_wrapper`](https://github.com/NBChub/bgcflow_wrapper).

## Workflow overview
The main Snakefile workflow comprise of various pipelines for data selection, functional annotation, phylogenetic analysis, genome mining, and comparative genomics for Prokaryotic datasets.
The main Snakefile workflow comprises various pipelines for data selection, functional annotation, phylogenetic analysis, genome mining, and comparative genomics for Prokaryotic datasets.

![dag](workflow/report/images/rulegraph_annotated.png)

Available pipelines in the main Snakefile can be checked using:
Available pipelines in the main Snakefile can be checked using the following command:
```
bgcflow pipelines
```
Expand All @@ -55,10 +55,10 @@ bgcflow pipelines
> ```
### Step 2: Configure the workflow
Configure the workflow according to your needs via editing the files in the `config/` folder.
Configure the workflow according to your needs by editing the files in the `config/` folder.
#### 2.1 Using template example
An example of the configuration files are provided in the `.examples` folder.
An example of the configuration files is provided in the `.examples` folder.
If you have a fresh copy of BGCFlow, you can initiate config and examples using by copying the necessary files to `config/` folder:
```shell
Expand Down Expand Up @@ -90,24 +90,24 @@ See [project_config.yaml](.examples/_pep_example/project_config.yaml) for an exa
> ```
##### 2.2.1 BGCFlow Format
A project can also be configured as previously described in BGCFlow version `<=0.3.3`. In the main `config/config.yaml`, each `project` starts with "`-`" and must contain the name of your project (`name`), the location of the sample file (`samples.csv`) and a rule configuration file (`project_config.csv`):
A project can also be configured as previously described in BGCFlow version `<=0.3.3`. In the main `config/config.yaml`, each `project` starts with "`-`" and must contain the name of your project (`name`), the location of the sample file (`samples.csv`), and a rule configuration file (`project_config.csv`):
```yaml
projects:
- name: example
samples: .examples/_genome_project_example/samples.csv
rules: .examples/_genome_project_example/project_config.yaml
```
Note that the location of the the sample file and the rule configuration file is relative to your `bgcflow` directory.
Note that the location of the sample file and the rule configuration file is relative to your `bgcflow` directory.

Ideally, you can organize a project as a set of genomes from a certain clade (pangenome).

See [further configuration](#further-configuration) for more details.

#### 2.2 Setting Up Your Samples Information
The variable `sample_table` (PEP) or `samples` denote the location of your `.csv` file which specify the genomes to analyse. Note that you can name the file anything as long as you define it in the `config.yaml`.
The variable `sample_table` (PEP) or `samples` denote the location of your `.csv` file which specifies the genomes to analyze. Note that you can name the file anything as long as you define it in the `config.yaml`.

Example : `samples.csv`
Example: `samples.csv`

| genome_id | source | organism | genus | species | strain |closest_placement_reference|
|----------------:|-------:|--------------------------------:|-------------:|--------:| ----------:|--------------------------:|
Expand All @@ -116,14 +116,14 @@ Example : `samples.csv`
| P8-2B-3.1 | custom | Streptomyces sp. P8-2B-3 | Streptomyces | sp. | P8-2B-3 | |

Columns description:
- **`genome_id`** _[required]_: The genome accession ids (with genome version for `ncbi` and `patric` genomes). For `custom` fasta file provided by users, it should refer to the fasta file names stored in `data/raw/fasta/` directory with `.fna` extension. **Example:** genome id P8-2B-3.1 refers to the file `data/raw/fasta/P8-2B-3.1.fna`.
- **`genome_id`** _[required]_: The genome accession ids (with genome version for `ncbi` and `patric` genomes). For `custom` fasta file provided by users, it should refer to the fasta file names stored in the `data/raw/fasta/` directory with `.fna` extension. **Example:** genome id P8-2B-3.1 refers to the file `data/raw/fasta/P8-2B-3.1.fna`.
- **`source`** _[required]_: Source of the genome to be analyzed choose one of the following: `custom`, `ncbi`, `patric`. Where:
- `custom` : for user provided genomes (`.fna`) in the `data/raw/fasta` directory with genome ids as filenames
- `ncbi` : for list of public genome accession IDs that will be downloaded from the NCBI refseq (GCF...) or genbank (GCA...) database
- `custom`: for user-provided genomes (`.fna`) in the `data/raw/fasta` directory with genome ids as filenames
- `ncbi`: for list of public genome accession IDs that will be downloaded from the NCBI refseq (GCF...) or genbank (GCA...) database
- `patric`: for list of public genome accession IDs that will be downloaded from the PATRIC database
- `organism` _[optional]_ : name of the organism that is same as in the fasta header
- `organism` _[optional]_: name of the organism that is the same as in the fasta header
- `genus` _[optional]_ : genus of the organism. Ideally identified with GTDBtk.
- `species` _[optional]_ : species epithet (the second word in a species name) of the organism. Ideally identified with GTDBtk.
- `species` _[optional]_: species epithet (the second word in a species name) of the organism. Ideally identified with GTDBtk.
- `strain` _[optional]_ : strain id of the organism
- `closest_placement_reference` _[optional]_: if known, the closest NCBI genome to the organism. Ideally identified with GTDBtk.

Expand Down Expand Up @@ -161,7 +161,7 @@ Installing Snakemake using [Mamba](https://github.com/mamba-org/mamba) is advise
You can use [`bgcflow_wrapper`](https://github.com/NBChub/bgcflow_wrapper) environment from [Quick Start](#Quick-Start) or install BGCFlow environment which contain Snakemake (`version 7.14.0`) and other dependencies with:

```bash
# create and activate new conda environment
# create and activate a new conda environment
conda create -n bgcflow pip -y
conda activate bgcflow
Expand Down Expand Up @@ -193,15 +193,15 @@ See the [Snakemake documentation](https://snakemake.readthedocs.io/en/stable/exe

## Further configuration
### Custom Prokka database
You can add an optional parameters: `prokka-db`, which refer to the location of a `.csv` file containing a list of your custom reference genomes for [`prokka`](https://github.com/tseemann/prokka#option---proteins) annotation:
You can add an optional parameter: `prokka-db`, which refers to the location of a `.csv` file containing a list of your custom reference genomes for [`prokka`](https://github.com/tseemann/prokka#option---proteins) annotation:
```yaml
projects:
- name: example
samples: config/samples.csv
prokka-db: config/prokka-db.csv
```

The file `prokka-db.csv` should contain a list of high quality annotated genomes that you would like to use to prioritise prokka annotations.
The file `prokka-db.csv` should contain a list of high-quality annotated genomes that you would like to use to prioritize prokka annotations.

`prokka-db.csv` example for Actinomycete group:

Expand All @@ -211,17 +211,17 @@ The file `prokka-db.csv` should contain a list of high quality annotated genomes
| GCA_000196835.1 | Amycolatopsis mediterranei U32 |

### Taxonomic Placement
The workflow will prioritize user provided taxonomic placement by adding an optional parameters: `gtdb-tax`, which refer to a similar GTDB-tk summary file, but only the "user_genome" and "classification" columns are required.
The workflow will prioritize user-provided taxonomic placement by adding an optional parameter: `gtdb-tax`, which refers to a similar GTDB-tk summary file, but only the "user_genome" and "classification" columns are required.

`gtdbtk.bac120.summary.tsv` example:

| user_genome | classification |
|------------:|---------------------------------------------------------------------------------------------------------------------------------------:|
| P8-2B-3.1 | d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Streptomycetales;f__Streptomycetaceae;g__Streptomyces;s__Streptomyces albidoflavus |

If these are not provided, the workflow will use the `closest_placement_reference` columns in the sample file (see above). Note that the value must be a valid genome accession in the latest GTDB release (currently R202), otherwise it will raise an error.
If these are not provided, the workflow will use the `closest_placement_reference` columns in the sample file (see above). Note that the value must be a valid genome accession in the latest GTDB release (currently R202), otherwise, it will raise an error.

If these information is not provided, then the workflow will guess the taxonomic placement by:
If this information is not provided, then the workflow will guess the taxonomic placement by:
1. If the `source` is `ncbi`, it will try to find the accession via GTDB API. If it doesn't find any information then,
2. It will use the `genus` table and find the parent taxonomy via GTDB API, which then results in `_genus_ sp.` preceded by the matching parent taxonomy.
3. If both option does not find any taxonomic information, then it will return empty taxonomic values.
Expand All @@ -239,10 +239,10 @@ projects:
- name: example_2
samples: config/samples_2.csv
```
Note that each `project` must have unique `name` and `samples` value.
Note that each `project` must have a unique `name` and `samples` value.

### Setting custom resources/databases folder
By default, the resources folder containing software and database dependencies are stored in the `resources/` directory.
By default, the resources folder containing software and database dependencies is stored in the `resources/` directory.

If you already have the resources folder somewhere else in your local machine, you can tell the workflow about their locations:

Expand All @@ -253,7 +253,7 @@ resources_path:
BiG-SCAPE: $HOME/your_local_directory/BiG-SCAPE
```
## List of Configurable Features
Here you can find rules keyword that you can run within BGCflow.
Here you can find rules keywords that you can run within BGCflow.
| Keywords | Description | Links |
|:---------| :------------- | :------------------------- |
| seqfu | Returns contig statistics of the genomes | [SeqFu](https://github.com/telatin/seqfu2)|
Expand All @@ -279,7 +279,7 @@ Here you can find rules keyword that you can run within BGCflow.
| cblaster-bgcs | Generate cblaster databases for bgcs in project | [cblaster](https://github.com/gamcil/cblaster) |

## Using snakemake profiles for further configurations
When using different machines, you can, for example, adapt the number of threads required for each rules using a snakemake profile. An example is given in [`config/examples/_profile_example/config.yaml`](config/examples/_profile_example/config.yaml):
When using different machines, you can, for example, adapt the number of threads required for each rule using a Snakemake profile. An example is given in [`config/examples/_profile_example/config.yaml`](config/examples/_profile_example/config.yaml):
```yaml
set-threads:
- antismash=4
Expand All @@ -288,11 +288,11 @@ set-threads:
- bigslice=16
```

You can use run a snakemake jobs with the above profile with:
You can use run a snakemake job with the above profile with:
```bash
snakemake --profile config/examples/_profile_example/ --use-conda -c $N -n # remove the dry-run parameters "-n" for actual run
snakemake --profile config/examples/_profile_example/ --use-conda -c $N -n # remove the dry-run parameters "-n" for the actual run
```
Or also with a defined `config` file:
```bash
snakemake --configfile config/examples/_config_example.yaml --profile config/examples/_profile_example/ --use-conda -c $N -n # remove the dry-run parameters "-n" for actual run
snakemake --configfile config/examples/_config_example.yaml --profile config/examples/_profile_example/ --use-conda -c $N -n # remove the dry-run parameters "-n" for the actual run
```

0 comments on commit 592fc6f

Please sign in to comment.