Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plan for v0.6.1 #226

Merged
merged 53 commits into from
Jun 9, 2023
Merged
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
d30836d
chore: initial commit for v0.6.1
matinnuhamunada Jan 30, 2023
50503d0
fix: pin numpy for GTDBtk
matinnuhamunada Feb 18, 2023
8e9d84e
fix: correct BGC compare workflow
matinnuhamunada Mar 3, 2023
e0ed910
chore: correct automlst DAG
matinnuhamunada Mar 3, 2023
31bb9a5
chore: add library for relationship transfer to metabase
matinnuhamunada Mar 3, 2023
df5f590
add instruction to transfer relationship to metabase
matinnuhamunada Mar 3, 2023
769455c
fix: add nbextensions
matinnuhamunada Mar 3, 2023
56b7e6d
fix: change table name
matinnuhamunada Mar 4, 2023
47f01fa
feat: export all csv to parquet as warehouse
matinnuhamunada Mar 4, 2023
e0a8ba4
feat: upgrade metabase and database related functions
matinnuhamunada Mar 4, 2023
35233c0
test: fix test change and visibility
matinnuhamunada Mar 4, 2023
a28d3df
fix: correct bgc_id for data warehouse
matinnuhamunada Mar 9, 2023
adede14
fix: add figure statistics to roary report
matinnuhamunada Mar 9, 2023
a4c7756
fix: handle unusual sequence sart location
matinnuhamunada Mar 12, 2023
d454447
feat: add multi-threading for autoMLST wrapper
matinnuhamunada Mar 13, 2023
ff4f5a7
chore: clean up obsolette rules
matinnuhamunada Mar 13, 2023
daa354b
chore: clean up refseq masher
matinnuhamunada Mar 13, 2023
7954ab9
feat: upgrade automlst patch 0.1.1
matinnuhamunada Mar 13, 2023
3fceee2
notebook: use plotly and pygraphviz for ARTS2 report
matinnuhamunada Mar 16, 2023
1d80370
fix: handle unusual locus tag location with characters and joints
matinnuhamunada Mar 19, 2023
929f469
fix: enforce correct strain id using input
matinnuhamunada Mar 19, 2023
5a1cc0b
notebook: use graphviz sfdp layout to process large network
matinnuhamunada Mar 19, 2023
a25988a
fix: correct genome_id preference in bgc overview
matinnuhamunada Mar 19, 2023
9b61703
fix: correct gtdb mjson when metadata and gtdb release are missing
matinnuhamunada Apr 17, 2023
4ce130e
feat: BGC comparison with mmseqs2 and minimap
matinnuhamunada Apr 17, 2023
5141da5
fix: use mix parameter fr bigscape in BGC comparison
matinnuhamunada Apr 17, 2023
2375ddd
fix: update deeptfactor dependencies
matinnuhamunada Apr 17, 2023
3d412b2
fix: correct ARTS parameter to search for DUF and known resistance mo…
matinnuhamunada Apr 18, 2023
89000bd
fix: correct path typo in arts rules
matinnuhamunada Apr 20, 2023
f16294f
fix: correct outfile generation for arts report
matinnuhamunada Apr 22, 2023
eed6aa1
notebook: update arts, cblaster, gtdbtk, and prokka-gbk report
matinnuhamunada Apr 24, 2023
ed06fc2
feat: copy summary and cds tsvs
matinnuhamunada Apr 24, 2023
6638aa4
feat: add README in the processed folders
matinnuhamunada Apr 24, 2023
894ef88
fix: add log prokka log summary and cds table
matinnuhamunada Apr 27, 2023
2369041
fix: handle empty result for best k cluster
matinnuhamunada May 1, 2023
adf9f46
test: update gtdb CARD metadata
matinnuhamunada May 2, 2023
4efa2f0
fix: update pytorch dependencies for deepTF
matinnuhamunada May 2, 2023
c492d37
feat: addadd path to region.csv
matinnuhamunada May 3, 2023
09caa3d
test: update region format table output
matinnuhamunada May 3, 2023
148c87a
fix: enforce samples csv metadata as string
matinnuhamunada May 9, 2023
13d5da5
fix: enforce samples csv metadata as string
matinnuhamunada May 9, 2023
6424fe9
fix: avoid mistakes in cds region matching across contigs
matinnuhamunada May 10, 2023
6ce8386
fix: handle overlapping cds among two or more regions
matinnuhamunada May 11, 2023
0b3d2e5
feat: update bgc subworkflow rules
matinnuhamunada May 15, 2023
822c47a
fix: correct input preparation for bgc subworkflow
matinnuhamunada May 15, 2023
0f8b4f8
feat: include antismash region genbank name change in result
matinnuhamunada May 15, 2023
82e41fc
fix: handle missing gtdb entry release
matinnuhamunada May 18, 2023
465dcac
fix: add details in metadata for selecting genomes in gtdbtk run
matinnuhamunada May 18, 2023
e43bb6a
tests: update gtdb prep
matinnuhamunada May 18, 2023
50f833c
chore: correct steps
matinnuhamunada May 18, 2023
131f324
fix: correct GTDB version naming
matinnuhamunada May 22, 2023
0311f54
fix: handle missing strains column
matinnuhamunada May 24, 2023
3fd18df
fix: undo make temp files for prokka
matinnuhamunada May 25, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,8 @@ jobs:
python-version: 3.x
- run: pip install git+https://github.com/NBChub/bgcflow_wrapper.git
- run: pip install pytest-cov
- name: Test coverage
run: pytest --cov=.tests/unit .tests/unit/
- name: Build coverage file
run: pytest --cov=.tests/unit .tests/unit/ > pytest-coverage.txt
- name: Comment coverage
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,5 @@ notebooks/
*.ipynb_checkpoints/
plugins/
metabase.db*
pytest-coverage.txt
.coverage
43 changes: 23 additions & 20 deletions .tests/unit/test_antismash_overview_gather.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,14 @@
import os
import sys

import subprocess as sp
from tempfile import TemporaryDirectory
import shutil
import subprocess as sp
import sys
from pathlib import Path, PurePosixPath

sys.path.insert(0, os.path.dirname(__file__))
from tempfile import TemporaryDirectory

import common

sys.path.insert(0, os.path.dirname(__file__))


def test_antismash_overview_gather():

Expand All @@ -21,24 +20,28 @@ def test_antismash_overview_gather():
shutil.copytree(data_path, workdir)

# dbg
print("data/processed/Lactobacillus_delbrueckii/tables/df_antismash_6.1.1_bgc.csv", file=sys.stderr)

# Run the test job.
sp.check_output([
"python",
"-m",
"snakemake",
print(
"data/processed/Lactobacillus_delbrueckii/tables/df_antismash_6.1.1_bgc.csv",
"-f",
"-j1",
"--keep-target-files",
file=sys.stderr,
)

"--directory",
workdir,
])
# Run the test job.
sp.check_output(
[
"python",
"-m",
"snakemake",
"data/processed/Lactobacillus_delbrueckii/tables/df_regions_antismash_6.1.1.csv",
"-f",
"-j1",
"--keep-target-files",
"--directory",
workdir,
]
)

# Check the output byte by byte using cmp.
# To modify this behavior, you can inherit from common.OutputChecker in here
# and overwrite the method `compare_files(generated_file, expected_file),
# and overwrite the method `compare_files(generated_file, expected_file),
# also see common.py.
common.OutputChecker(data_path, expected_path, workdir).check()
Empty file removed data/interim/diamond/.gitkeep
Empty file.
Empty file removed data/interim/mlst/.gitkeep
Empty file.
Empty file.
3 changes: 2 additions & 1 deletion workflow/BGC
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ DF_SAMPLES.to_csv(bgcflow_util_dir / "samples.csv", index=False)
##### 2. Generate wildcard constants #####
PROJECT_IDS = list(DF_PROJECTS.name.unique())
STRAINS = DF_SAMPLES.genome_id.to_list()
BGCS = STRAINS = DF_SAMPLES.bgc_id.to_list()
BGCS = DF_SAMPLES.bgc_id.to_list()
CUSTOM = DF_SAMPLES[DF_SAMPLES.source.eq("custom")].genome_id.to_list()
NCBI = DF_SAMPLES[DF_SAMPLES.source.eq("ncbi")].genome_id.to_list()
PATRIC = DF_SAMPLES[DF_SAMPLES.source.eq("patric")].genome_id.to_list()
Expand Down Expand Up @@ -179,3 +179,4 @@ include: "rules/antismash.smk"
include: "rules/bigslice.smk"
include: "rules/clinker.smk"
include: "rules/interproscan.smk"
include: "rules/mmseqs2.smk"
8 changes: 6 additions & 2 deletions workflow/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -48,11 +48,14 @@ custom_resource_dir()


##### Target rules #####

final_outputs = get_final_output(DF_SAMPLES, PEP_PROJECTS, rule_dict_path="workflow/rules.yaml")

rule all:
input:
expand("data/processed/{name}/tables/df_gtdb_meta.csv", name=PROJECT_IDS),
get_final_output(DF_SAMPLES, PEP_PROJECTS, rule_dict_path="workflow/rules.yaml"),

final_outputs,
expand("data/processed/{name}/data_warehouse/tables.log", name=PROJECT_IDS)


##### Modules #####
Expand All @@ -79,3 +82,4 @@ include: "rules/bgc.smk"
include: "rules/diamond.smk"
include: "rules/deeptfactor.smk"
include: "rules/cblaster.smk"
include: "rules/data_warehouse.smk"
3 changes: 2 additions & 1 deletion workflow/bgcflow/bgcflow/data/bgc_downstream_prep.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def bgc_downstream_prep(input_dir, output_dir):
def bgc_downstream_prep(input_dir, output_dir, selected_bgcs=False):
"""
Given an antiSMASH directory, check for changed name
"""
Expand All @@ -26,6 +26,7 @@ def bgc_downstream_prep(input_dir, output_dir):
change_log = {genome_id: {}}

for gbk in path.glob("*.gbk"):
logging.info(f"Parsing file: {selected_bgcs}")
logging.info(f"Parsing file: {gbk.name}")
region = SeqIO.parse(str(gbk), "genbank")
for record in region:
Expand Down
86 changes: 86 additions & 0 deletions workflow/bgcflow/bgcflow/data/bgc_downstream_prep_selection.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
import json
import logging
import sys
from pathlib import Path

from Bio import SeqIO

log_format = "%(levelname)-8s %(asctime)s %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def bgc_downstream_prep(input_dir, output_dir, selected_bgcs=False):
"""
Given an antiSMASH directory, check for changed name
"""
logging.info(f"Reading input directory: {input_dir}")
path = Path(input_dir)
if not path.is_dir():
raise FileNotFoundError(f"No such file or directory: {path}")

genome_id = path.name
outpath = Path(output_dir) / genome_id
outpath.mkdir(parents=True, exist_ok=True)
logging.debug(f"Deducting genome id as {genome_id}")

change_log = {genome_id: {}}
ctr = 0
matches = [Path(i).stem for i in selected_bgcs.split()]
for gbk in path.glob("*.gbk"):
if gbk.stem in matches:
logging.debug(f"MATCH: {gbk.stem}")
ctr = ctr + 1
logging.info(f"Parsing file: {gbk.name}")
region = SeqIO.parse(str(gbk), "genbank")
for record in region:
logging.info(f"{gbk} {record.id}")
record_log = {}
if "comment" in record.annotations:
filename = gbk.name
try:
original_id = record.annotations["structured_comment"][
"antiSMASH-Data"
]["Original ID"].split()[0]
except KeyError:
original_id = record.id
logging.warning(
f"Found shortened record.id: {record.id} <- {original_id}."
)

# generate symlink
new_filename = filename.replace(record.id, original_id)
target_path = Path.cwd() / gbk # target for symlink

link = outpath / new_filename

logging.info(f"Generating symlink: {link}")
try:
link.symlink_to(target_path)
except FileExistsError:
logging.warning(
f"Previous symlink exist, updating target: {link} -> {target_path}"
)
link.unlink()
link.symlink_to(target_path)

record_log["record_id"] = record.id
record_log["original_id"] = original_id
record_log["target_path"] = str(gbk)
record_log["symlink_path"] = str(link)
else:
logging.warning(f"No Comments in record: {gbk.name}")

change_log[genome_id][filename] = record_log
# assert 1+1==3
with open(
outpath / f"{genome_id}-change_log.json", "w", encoding="utf8"
) as json_file:
json.dump(change_log, json_file, indent=4)

logging.info(f"{genome_id}: Job done!\n")
return


if __name__ == "__main__":
bgc_downstream_prep(sys.argv[1], sys.argv[2], sys.argv[3])
10 changes: 10 additions & 0 deletions workflow/bgcflow/bgcflow/data/fix_gtdb_taxonomy.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,16 @@ def summarize_gtdb_json(accession_list, df_gtdb_output):
# Getting other metadata
try:
logging.info("Getting metadata into table...")
if "metadata" not in df.columns:
logging.warning(
"metadata is not in the column information. Adding default values..."
)
df["metadata"] = [{"genome_id": genome_id} for genome_id in df.index]
if "gtdb_release" not in df.columns:
logging.warning(
"gtdb_release is not in the column information. Adding default values..."
)
df["gtdb_release"] = "unknown"
metadata = pd.DataFrame.from_dict(
{i: df.loc[i, "metadata"] for i in df.index}
).T
Expand Down
28 changes: 23 additions & 5 deletions workflow/bgcflow/bgcflow/data/get_antismash_overview.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,11 +42,15 @@ def get_antismash_overview(json_path, outfile, genome_id=False, n_hits=1):
with open(path, "r") as f:
data = json.load(f)

logging.info(f"Processing: {json_path}, custom genome_id: {genome_id}")

if not genome_id:
genome_id = data["input_file"].strip(".gbk")
else:
pass

logging.debug(f"Genome id: {genome_id}")

# iterating over record
output = {}
for r, record in enumerate(data["records"]):
Expand Down Expand Up @@ -94,7 +98,7 @@ def get_antismash_overview(json_path, outfile, genome_id=False, n_hits=1):

bgc_id = f"{record['id']}.region{str(c+1).zfill(3)}"
output_cluster = {
"genome_id": data["input_file"].strip(".gbk"),
"genome_id": genome_id,
"region": cluster_id,
}

Expand All @@ -106,10 +110,24 @@ def get_antismash_overview(json_path, outfile, genome_id=False, n_hits=1):
"product",
]:
output_cluster[column] = region_db[bgc_id][column]

output_cluster["region_length"] = int(output_cluster["end_pos"]) - int(
output_cluster["start_pos"]
)
try:
output_cluster["region_length"] = int(output_cluster["end_pos"]) - int(
output_cluster["start_pos"]
)
except ValueError:
logging.warning(
f'Error calculating region length. Region might be incomplete: {output_cluster["start_pos"]}:{output_cluster["end_pos"]}'
)
start_pos = "".join(
[s for s in output_cluster["start_pos"] if s.isdigit()]
)
logging.warning(
f'Correcting start position from {output_cluster["start_pos"]} to {start_pos}'
)
output_cluster["start_pos"] = start_pos
output_cluster["region_length"] = int(output_cluster["end_pos"]) - int(
output_cluster["start_pos"]
)

if len(output_hits) == 1:
for k in output_hits[0].keys():
Expand Down
Loading