Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates and BGC Feature #215

Merged
merged 47 commits into from
Jan 30, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
f661583
fix: rollback bigslice to v1.1.1
matinnuhamunada Dec 16, 2022
a5e0c5f
chore: bump to v0.5.2
matinnuhamunada Dec 16, 2022
32c9a7b
feat: BiG-SCAPE update to 1.1.5 using MIBIG 3.1
matinnuhamunada Dec 21, 2022
a33c07d
chore: correct logging
matinnuhamunada Dec 21, 2022
0216adf
notebooks: add arts, cblaster, checkm, roary, and eggnog report template
matinnuhamunada Dec 21, 2022
88ca2f5
fix: correct bgc_ids from ARTS2 table
matinnuhamunada Jan 8, 2023
2dee08d
feat: extract bgc table from antismash
matinnuhamunada Jan 15, 2023
0f2c62f
notebook: update checkm figure with MIMAG classification
matinnuhamunada Jan 23, 2023
389c8b5
feat: give better flexibility to choose project types
matinnuhamunada Jan 24, 2023
4d68caf
feat: introduce a bgc project type
matinnuhamunada Jan 24, 2023
34f88bc
feat: add interproscan and clinker
matinnuhamunada Jan 24, 2023
81d3fcb
fix: correct genome_id naming
matinnuhamunada Jan 25, 2023
ed5a8da
chore: tidy logging info for gtdbtk and antismash install
matinnuhamunada Jan 25, 2023
5a35456
feat: add experimental Dockerfile
matinnuhamunada Jan 25, 2023
4879e9d
fix: add working Dockerfile and Dockerhub link
matinnuhamunada Jan 26, 2023
0ecd126
fix: enable multiple bgc_id from the same genome_id
matinnuhamunada Jan 27, 2023
56370f1
chore: use input path for variable changelog
matinnuhamunada Jan 27, 2023
8330ae5
feat: add parameter max cluster size for roary
matinnuhamunada Jan 27, 2023
527a330
test: rename to tests to accomodate unit tests
matinnuhamunada Jan 28, 2023
f77e3b5
test: add arts_extract unit test
matinnuhamunada Jan 28, 2023
066674a
test: add fix_gtdb_taxonomy test
matinnuhamunada Jan 28, 2023
0b09fc1
test: add gtdb_prep test
matinnuhamunada Jan 28, 2023
6363392
fix: add missing config files for tests
matinnuhamunada Jan 28, 2023
e3262fa
test: add mash_convert test
matinnuhamunada Jan 28, 2023
2f4dea4
test: add fastani_convert test
matinnuhamunada Jan 28, 2023
3f70103
test: add deeptfactor_to_json test
matinnuhamunada Jan 28, 2023
bbd8d68
test: add extract_ncbi_information test
matinnuhamunada Jan 28, 2023
d6e7f10
test: add antismash_overview_gather test
matinnuhamunada Jan 28, 2023
584f772
test: remove on-success and try out pytest action
matinnuhamunada Jan 28, 2023
1d2bc3a
test: add coverage report
matinnuhamunada Jan 28, 2023
1ad4d54
test: fix coverage report commentator
matinnuhamunada Jan 28, 2023
a3126c1
feat: add Snakefile to run Metabase
matinnuhamunada Jan 28, 2023
68d686b
fix: stop showing roary plot
matinnuhamunada Jan 28, 2023
9f2de0a
notebook: update eggnog reports
matinnuhamunada Jan 28, 2023
346d0b7
test: add lactobacillus minimal example
matinnuhamunada Jan 29, 2023
a41a0ca
fix: correct java and perl requirement for interproscan
matinnuhamunada Jan 29, 2023
beb4dbc
fix: don't include mibig in bigscape for BGCs
matinnuhamunada Jan 29, 2023
c1299a4
chore: use conda with JDK for metabase
matinnuhamunada Jan 29, 2023
a23398f
fix: pin gsl to 2.6 for mash
matinnuhamunada Jan 29, 2023
014a38f
chore: correct logging
matinnuhamunada Jan 30, 2023
49ab2e6
fix: correct zip location for BGC schema
matinnuhamunada Jan 30, 2023
519e1aa
notebook: update np.matrix to np.asarray and fix eggnog
matinnuhamunada Jan 30, 2023
f04e8ff
fix: rollback duckdb to version 38
matinnuhamunada Jan 30, 2023
f207901
temporary_fix: use experimental dbt branch
matinnuhamunada Jan 30, 2023
454325e
fix: make sure parquet is flat to process in duckdb
matinnuhamunada Jan 30, 2023
2d256f8
chore: pin dbt repo to v0.1.0 release
matinnuhamunada Jan 30, 2023
285dfa3
Merge branch 'main' into dev-0.6.0
matinnuhamunada Jan 30, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .examples/_config_example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@ projects:
# Project 2 (PEP file)
- name: .examples/_pep_example/project_config.yaml

bgc_projects:
- name: config/lanthipeptide/project_config.yaml

#### GLOBAL RULE CONFIGURATION ####
# This section configures the rules to run globally.
# Use project specific rule configurations if you want to run different rules for each projects.
Expand Down
30 changes: 30 additions & 0 deletions .examples/lactobacillus_delbruecki/project_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
name: Lactobacillus_delbrueckii

pep_version: 2.1.0

description: "Lactobacillus delbrueckii 27 01 2023"

sample_table: samples.csv

#### RULE CONFIGURATION ####
# rules: set value to TRUE if you want to run the analysis or FALSE if you don't
rules:
seqfu: TRUE
mash: TRUE
fastani: TRUE
checkm: FALSE
gtdbtk: FALSE
prokka-gbk: TRUE
antismash: TRUE
query-bigslice: TRUE
bigscape: TRUE
bigslice: TRUE
automlst-wrapper: TRUE
arts: TRUE
roary: TRUE
eggnog: TRUE
eggnog-roary: TRUE
deeptfactor: TRUE
deeptfactor-roary: TRUE
cblaster-genome: TRUE
cblaster-bgc: TRUE
5 changes: 5 additions & 0 deletions .examples/lactobacillus_delbruecki/samples.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
genome_id,source,organism,genus,species,strain,closest_placement_reference,input_file
GCA_000056065.1,ncbi,,,,,,
GCA_000182835.1,ncbi,,,,,,
GCA_000191165.1,ncbi,,,,,,
GCA_000014405.1,ncbi,,,,,,
5 changes: 5 additions & 0 deletions .examples/lanthipeptide/df_antismash_6.1.1_bgc.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
bgc_id,genome_id,region,accession,start_pos,end_pos,contig_edge,product,region_length,source,gbk_path
CR954253.1.region001,GCA_000056065.1,1.1,CR954253.1,17407,39909,False,['lanthipeptide-class-iii'],22502,bgcflow,
CR954253.1.region002,GCA_000056065.1,1.2,CR954253.1,1745672,1767868,False,['lanthipeptide-class-iv'],22196,bgcflow,
CP000156.1.region001,GCA_000191165.1,1.1,CP000156.1,1767251,1789447,False,['lanthipeptide-class-iv'],22196,bgcflow,
CP000412.1.region001,GCA_000014405.1,1.1,CP000412.1,17283,39785,False,['lanthipeptide-class-iii'],22502,bgcflow,
11 changes: 11 additions & 0 deletions .examples/lanthipeptide/project_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
name: lanthipeptide_lactobacillus
pep_version: 2.1.0
description: 'A selection of lanthipeptides from Lactobacillus delbrueckii'
sample_table: df_antismash_6.1.1_bgc.csv

rules:
bigslice: TRUE
bigscape: TRUE
query-bigslice: TRUE
clinker: TRUE
interproscan: TRUE
22 changes: 20 additions & 2 deletions .github/workflows/push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ jobs:
- name: Linting
uses: snakemake/[email protected]
with:
directory: .test
directory: .tests
snakefile: workflow/Snakefile
stagein: "conda config --set channel_priority strict"
args: "--lint"
Expand All @@ -41,7 +41,25 @@ jobs:
- name: Dry-run workflow
uses: snakemake/[email protected]
with:
directory: .test
directory: .tests
snakefile: workflow/Snakefile
stagein: "conda config --set channel_priority strict"
args: "--use-conda --show-failed-logs --cores 2 --conda-cleanup-pkgs cache -n"

unit-test:
runs-on: ubuntu-latest
needs:
- dry-run
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: 3.x
- run: pip install git+https://github.com/NBChub/bgcflow_wrapper.git
- run: pip install pytest-cov
- name: Build coverage file
run: pytest --cov=.tests/unit .tests/unit/ > pytest-coverage.txt
- name: Comment coverage
uses: coroo/[email protected]
with:
pytest-coverage: pytest-coverage.txt
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,5 @@ workflow/bgcflow/build*
.panoptes.db
notebooks/
*.ipynb_checkpoints/
plugins/
metabase.db*
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
62 changes: 62 additions & 0 deletions .tests/unit/antismash_overview_gather/data/config/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# This file should contain everything to configure the workflow on a global scale.

#### PROJECT INFORMATION ####
# This section control your project configuration.
# Each project are separated by "-".
# A project can be defined as (1) a yaml object or (2) a Portable Encapsulated Project (PEP) file.
# (1) To define project as a yaml object, it must contain the variable "name" and "samples".
# - name : name of your project
# - samples : a csv file containing a list of genome ids for analysis with multiple sources mentioned. Genome ids must be unique.
# - rules: a yaml file containing project rule configurations. This will override global rule configuration.
# - prokka-db (optional): list of the custom accessions to use as prokka reference database.
# - gtdb-tax (optional): output summary file of GTDB-tk with "user_genome" and "classification" as the two minimum columns
# (2) To define project using PEP file, only variable "name" should be given that points to the location of the PEP yaml file.
# - name: path to PEP .yaml file. See project example_pep for details.

projects:
# Project 1 (yaml object)
# - name: example
# samples: config/_genome_project_example/samples.csv
# rules: config/_genome_project_example/project_config.yaml
# prokka-db: config/_genome_project_example/prokka-db.csv
# gtdb-tax: config/_genome_project_example/gtdbtk.bac120.summary.tsv

# Project 2 (PEP file)
# - name: config/_pep_example/project_config.yaml
- name: config/lactobacillus_delbruecki/project_config.yaml
#### GLOBAL RULE CONFIGURATION ####
# This section configures the rules to run globally.
# Use project specific rule configurations if you want to run different rules for each projects.
# rules: set value to TRUE if you want to run the analysis or FALSE if you don't
rules:
seqfu: FALSE
mash: FALSE
fastani: FALSE
checkm: FALSE
gtdbtk: FALSE
prokka-gbk: FALSE
antismash: TRUE
query-bigslice: FALSE
bigscape: FALSE
bigslice: FALSE
automlst-wrapper: FALSE
arts: FALSE
roary: FALSE
eggnog: FALSE
eggnog-roary: FALSE
deeptfactor: FALSE
deeptfactor-roary: FALSE
cblaster-genome: FALSE
cblaster-bgc: FALSE

#### RESOURCES CONFIGURATION ####
# resources : the location of the resources to run the rule.
# The default location is at "resources/{resource_name}".
resources_path:
antismash_db: resources/antismash_db
eggnog_db: resources/eggnog_db
BiG-SCAPE: resources/BiG-SCAPE
bigslice: resources/bigslice
checkm: resources/checkm
gtdbtk: resources/gtdbtk
#RNAmmer: resources/RNAmmer # If specified, will override Barnapp in Prokka
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
name: Lactobacillus_delbrueckii

pep_version: 2.1.0

description: "Lactobacillus delbrueckii 27 01 2023"

sample_table: samples.csv

#### RULE CONFIGURATION ####
# rules: set value to TRUE if you want to run the analysis or FALSE if you don't
rules:
seqfu: TRUE
mash: TRUE
fastani: TRUE
checkm: FALSE
gtdbtk: FALSE
prokka-gbk: TRUE
antismash: TRUE
query-bigslice: TRUE
bigscape: TRUE
bigslice: TRUE
automlst-wrapper: TRUE
arts: TRUE
roary: TRUE
eggnog: TRUE
eggnog-roary: TRUE
deeptfactor: TRUE
deeptfactor-roary: TRUE
cblaster-genome: TRUE
cblaster-bgc: TRUE
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
genome_id,source,organism,genus,species,strain,closest_placement_reference,input_file
GCA_000056065.1,ncbi,,,,,,
GCA_000182835.1,ncbi,,,,,,
GCA_000191165.1,ncbi,,,,,,
GCA_000014405.1,ncbi,,,,,,
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"CP000412.1.region001": {
"genome_id": "GCA_000014405.1",
"region": "1.1",
"accession": "CP000412.1",
"start_pos": "17283",
"end_pos": "39785",
"contig_edge": "False",
"product": [
"lanthipeptide-class-iii"
],
"region_length": 22502
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"CR954253.1.region001": {
"genome_id": "GCA_000056065.1",
"region": "1.1",
"accession": "CR954253.1",
"start_pos": "17407",
"end_pos": "39909",
"contig_edge": "False",
"product": [
"lanthipeptide-class-iii"
],
"region_length": 22502
},
"CR954253.1.region002": {
"genome_id": "GCA_000056065.1",
"region": "1.2",
"accession": "CR954253.1",
"start_pos": "1745672",
"end_pos": "1767868",
"contig_edge": "False",
"product": [
"lanthipeptide-class-iv"
],
"region_length": 22196
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{}
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"CP000156.1.region001": {
"genome_id": "GCA_000191165.1",
"region": "1.1",
"accession": "CP000156.1",
"start_pos": "1767251",
"end_pos": "1789447",
"contig_edge": "False",
"product": [
"lanthipeptide-class-iv"
],
"region_length": 22196
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
genome_id,source,organism,genus,species,strain,closest_placement_reference,input_file,ref_annotation,classification_source,sample_paths,prokka-db,gtdb_paths,name
GCA_000056065.1,ncbi,,,,,,,,ncbi,config/lactobacillus_delbruecki/samples.csv,,,Lactobacillus_delbrueckii
GCA_000182835.1,ncbi,,,,,,,,ncbi,['config/lactobacillus_delbruecki/samples.csv'],[''],[''],['Lactobacillus_delbrueckii']
GCA_000191165.1,ncbi,,,,,,,,ncbi,['config/lactobacillus_delbruecki/samples.csv'],[''],[''],['Lactobacillus_delbrueckii']
GCA_000014405.1,ncbi,,,,,,,,ncbi,['config/lactobacillus_delbruecki/samples.csv'],[''],[''],['Lactobacillus_delbrueckii']
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"GCA_000014405.1": {
"CP000412.1.region001.gbk": {
"record_id": "CP000412.1",
"original_id": "CP000412.1",
"target_path": "data/interim/antismash/6.1.1/GCA_000014405.1/CP000412.1.region001.gbk",
"symlink_path": "data/interim/bgcs/Lactobacillus_delbrueckii/6.1.1/GCA_000014405.1/CP000412.1.region001.gbk"
},
"GCA_000014405.1.gbk": {
"record_id": "CP000412.1",
"original_id": "CP000412.1",
"target_path": "data/interim/antismash/6.1.1/GCA_000014405.1/GCA_000014405.1.gbk",
"symlink_path": "data/interim/bgcs/Lactobacillus_delbrueckii/6.1.1/GCA_000014405.1/GCA_000014405.1.gbk"
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
{
"GCA_000056065.1": {
"CR954253.1.region001.gbk": {
"record_id": "CR954253.1",
"original_id": "CR954253.1",
"target_path": "data/interim/antismash/6.1.1/GCA_000056065.1/CR954253.1.region001.gbk",
"symlink_path": "data/interim/bgcs/Lactobacillus_delbrueckii/6.1.1/GCA_000056065.1/CR954253.1.region001.gbk"
},
"CR954253.1.region002.gbk": {
"record_id": "CR954253.1",
"original_id": "CR954253.1",
"target_path": "data/interim/antismash/6.1.1/GCA_000056065.1/CR954253.1.region002.gbk",
"symlink_path": "data/interim/bgcs/Lactobacillus_delbrueckii/6.1.1/GCA_000056065.1/CR954253.1.region002.gbk"
},
"GCA_000056065.1.gbk": {
"record_id": "CR954253.1",
"original_id": "CR954253.1",
"target_path": "data/interim/antismash/6.1.1/GCA_000056065.1/GCA_000056065.1.gbk",
"symlink_path": "data/interim/bgcs/Lactobacillus_delbrueckii/6.1.1/GCA_000056065.1/GCA_000056065.1.gbk"
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"GCA_000182835.1": {
"GCA_000182835.1.gbk": {
"record_id": "CP002342.1",
"original_id": "CP002342.1",
"target_path": "data/interim/antismash/6.1.1/GCA_000182835.1/GCA_000182835.1.gbk",
"symlink_path": "data/interim/bgcs/Lactobacillus_delbrueckii/6.1.1/GCA_000182835.1/GCA_000182835.1.gbk"
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"GCA_000191165.1": {
"CP000156.1.region001.gbk": {
"record_id": "CP000156.1",
"original_id": "CP000156.1",
"target_path": "data/interim/antismash/6.1.1/GCA_000191165.1/CP000156.1.region001.gbk",
"symlink_path": "data/interim/bgcs/Lactobacillus_delbrueckii/6.1.1/GCA_000191165.1/CP000156.1.region001.gbk"
},
"GCA_000191165.1.gbk": {
"record_id": "CP000156.1",
"original_id": "CP000156.1",
"target_path": "data/interim/antismash/6.1.1/GCA_000191165.1/GCA_000191165.1.gbk",
"symlink_path": "data/interim/bgcs/Lactobacillus_delbrueckii/6.1.1/GCA_000191165.1/GCA_000191165.1.gbk"
}
}
}
1 change: 1 addition & 0 deletions .tests/unit/antismash_overview_gather/data/workflow
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
bgc_id,genome_id,region,accession,start_pos,end_pos,contig_edge,product,region_length
CR954253.1.region001,GCA_000056065.1,1.1,CR954253.1,17407,39909,False,['lanthipeptide-class-iii'],22502
CR954253.1.region002,GCA_000056065.1,1.2,CR954253.1,1745672,1767868,False,['lanthipeptide-class-iv'],22196
CP000156.1.region001,GCA_000191165.1,1.1,CP000156.1,1767251,1789447,False,['lanthipeptide-class-iv'],22196
CP000412.1.region001,GCA_000014405.1,1.1,CP000412.1,17283,39785,False,['lanthipeptide-class-iii'],22502
Loading