Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Repeatmasking workflow #198

Merged
merged 8 commits into from
Sep 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions workflows/repeatmasking/.dockstore.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
version: 1.2
workflows:
- name: main
subclass: Galaxy
publish: true
primaryDescriptorPath: /RepeatMasking-Workflow.ga
testParameterFiles:
- /RepeatMasking-Workflow-tests.yml
authors:
- name: Romane Libouban
email: [email protected]
5 changes: 5 additions & 0 deletions workflows/repeatmasking/.workflowhub.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
version: '0.1'
registries:
- url: https://workflowhub.eu
project: iwc
workflow: RepeatMasking-Workflow./main
5 changes: 5 additions & 0 deletions workflows/repeatmasking/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Changelog

## [0.1]

Initial version of the RepeatMasking workflow for genomic sequencing data.
29 changes: 29 additions & 0 deletions workflows/repeatmasking/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# RepeatMasking Workflow

This workflow uses RepeatModeler and RepeatMasker for genome analysis.

- RepeatModeler is a software package for identifying and modeling de novo families of transposable elements (TEs). At the heart of RepeatModeler are three de novo repeat search programs (RECON, RepeatScout and LtrHarvest/Ltr_retriever) which use complementary computational methods to identify repeat element boundaries and family relationships from sequence data.

- RepeatMasker is a program that analyzes DNA sequences for *interleaved repeats* and *low-complexity* DNA sequences. The result of the program is a detailed annotation of the repeats present in the query sequence, as well as a modified version of the query sequence in which all annotated repeats are present.

## Input dataset for RepeatModeler
- RepeatModeler requires a single input file, a genome in fasta format.


## Outputs dataset for RepeatModeler
- Two output files are generated:
- summary file (.tbl)
- fasta file containing alignments in order of appearance in the query sequence


## Input dataset for RepeatMasker
- ReapatMasker requires the fasta file generated by RepeatModeler

## Outputs datasets for RepeatMasker
- Five output files are generated:
- a fasta file
- .gff3 file
- a table summarizing the repeated content of the sequence analyzed
- a file with statistics related to the repeated content of the sequence analyzed
- a summary of the mutation sites found and the order of grouping

Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
- doc: Test outline for RepeatMasking Workflow
job:
input:
class: File
location: https://zenodo.org/record/8116008/files/sequence.fasta?download=1
filetype: fasta

outputs:
RepeatModeler consensus sequences:
path: test-data/repeatmodeler_output_sequences.fasta
compare: sim_size
delta: 30000

RepeatModeler seeds alignments:
path: test-data/repeatmodeler_output_seeds.stockholm
compare: sim_size
delta: 90000000

RepeatMasker masked genome:
path: test-data/repeatmasker_output_masked_genome.fasta
compare: sim_size
delta: 30000

RepeatMasker output log:
path: test-data/repeatmasker_output_log.tabular
compare: sim_size
delta: 30000

RepeatMasker repeat statistics:
path: test-data/repeatmasker_output_table.txt
compare: sim_size
delta: 30000

RepeatMasker repeat catalog:
path: test-data/repeatmasker_output_repeat_catalog.txt
compare: sim_size
delta: 30000

RepeatMasker repeat annotation:
path: test-data/repeatmasker_output_gff.gff
compare: sim_size
delta: 30000
42 changes: 42 additions & 0 deletions workflows/repeatmasking/RepeatMasking-Workflow-tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
- doc: Test outline for RepeatMasking Workflow
job:
input:
class: File
location: https://zenodo.org/record/8364146/files/eco.fasta?download=1
filetype: fasta

outputs:
rlibouba marked this conversation as resolved.
Show resolved Hide resolved
RepeatModeler consensus sequences:
location: https://zenodo.org/record/8364146/files/repeatmodeler_output_sequences.fasta?download=1
compare: sim_size
delta: 30000

RepeatModeler seeds alignments:
location: https://zenodo.org/record/8364146/files/repeatmodeler_output_seeds.stockholm?download=1
compare: sim_size
delta: 90000000

RepeatMasker masked genome:
location: https://zenodo.org/record/8364146/files/repeatmasker_output_masked_genome.fasta?download=1
compare: sim_size
delta: 30000

RepeatMasker output log:
location: https://zenodo.org/record/8364146/files/repeatmasker_output_log.tabular?download=1
compare: sim_size
delta: 30000

RepeatMasker repeat statistics:
location: https://zenodo.org/record/8364146/files/repeatmasker_output_table.txt?download=1
compare: sim_size
delta: 30000

RepeatMasker repeat catalog:
location: https://zenodo.org/record/8364146/files/repeatmasker_output_repeat_catalog.txt?download=1
compare: sim_size
delta: 30000

RepeatMasker repeat annotation:
location: https://zenodo.org/record/8364146/files/repeatmasker_output_gff.gff?download=1
compare: sim_size
delta: 30000
175 changes: 175 additions & 0 deletions workflows/repeatmasking/RepeatMasking-Workflow.ga
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
{
"a_galaxy_workflow": "true",
"annotation": "",
"format-version": "0.1",
"license": "MIT",
"release": "0.1",
"name": "Repeat masking with RepeatModeler and RepeatMasker",
"creator": [
{
"class": "Person",
"email": "mailto:[email protected]",
"name": "Romane Libouban"
}
],
"steps": {
"0": {
"annotation": "",
"content_id": null,
"errors": null,
"id": 0,
"input_connections": {},
"inputs": [
{
"description": "Apply repeat masking to this fasta file",
"name": "input"
}
],
"label": "input",
"name": "Input dataset",
"outputs": [],
"position": {
"left": 10,
"top": 10
},
"tool_id": null,
"tool_state": "{\"optional\": false, \"tag\": null}",
"tool_version": null,
"type": "data_input",
"uuid": "ab5e19b0-ce35-4e54-a55e-f75243c86e3d",
"when": null,
"workflow_outputs": []
},
"1": {
"annotation": "",
"content_id": "toolshed.g2.bx.psu.edu/repos/csbl/repeatmodeler/repeatmodeler/2.0.4+galaxy1",
"errors": null,
"id": 1,
"input_connections": {
"input_file": {
"id": 0,
"output_name": "output"
}
},
"inputs": [],
"label": null,
"name": "RepeatModeler",
"outputs": [
{
"name": "sequences",
"type": "fasta"
},
{
"name": "seeds",
"type": "stockholm"
}
],
"position": {
"left": 230,
"top": 10
},
"post_job_actions": {},
"tool_id": "toolshed.g2.bx.psu.edu/repos/csbl/repeatmodeler/repeatmodeler/2.0.4+galaxy1",
"tool_shed_repository": {
"changeset_revision": "8661b2607b7e",
"name": "repeatmodeler",
"owner": "csbl",
"tool_shed": "toolshed.g2.bx.psu.edu"
},
"tool_state": "{\"__input_ext\": \"input\", \"chromInfo\": \"/shared/ifbstor1/galaxy/mutable-config/tool-data/shared/ucsc/chrom/?.len\", \"input_file\": null, \"__page__\": null, \"__rerun_remap_job_id__\": null}",
"tool_version": "2.0.4+galaxy1",
"type": "tool",
"uuid": "9312ba36-4275-4d40-8ba6-95eea1b23b11",
"when": null,
"workflow_outputs": [
{
"output_name": "sequences",
"label": "RepeatModeler consensus sequences"
},
{
"output_name": "seeds",
"label": "RepeatModeler seeds alignments"
}
]
},
"2": {
"annotation": "",
"content_id": "toolshed.g2.bx.psu.edu/repos/bgruening/repeat_masker/repeatmasker_wrapper/4.1.5+galaxy0",
"errors": null,
"id": 2,
"input_connections": {
"input_fasta": {
"id": 1,
"output_name": "sequences"
}
},
"inputs": [],
"label": null,
"name": "RepeatMasker",
"outputs": [
{
"name": "output_masked_genome",
"type": "fasta"
},
{
"name": "output_log",
"type": "tabular"
},
{
"name": "output_table",
"type": "txt"
},
{
"name": "output_repeat_catalog",
"type": "txt"
},
{
"name": "output_gff",
"type": "gff"
}
],
"position": {
"left": 450,
"top": 10
},
"post_job_actions": {},
"tool_id": "toolshed.g2.bx.psu.edu/repos/bgruening/repeat_masker/repeatmasker_wrapper/4.1.5+galaxy0",
"tool_shed_repository": {
"changeset_revision": "ba6d2c32f797",
"name": "repeat_masker",
"owner": "bgruening",
"tool_shed": "toolshed.g2.bx.psu.edu"
},
"tool_state": "{\"__input_ext\": \"input\", \"advanced\": {\"is_only\": false, \"is_clip\": false, \"no_is\": false, \"rodspec\": false, \"primspec\": false, \"nolow\": false, \"noint\": false, \"norna\": false, \"alu\": false, \"div\": false, \"search_speed\": \"\", \"frag\": \"40000\", \"gc\": null, \"gccalc\": false, \"nocut\": false, \"xout\": false, \"keep_alignments\": false, \"invert_alignments\": false, \"poly\": false}, \"chromInfo\": \"/shared/ifbstor1/galaxy/mutable-config/tool-data/shared/ucsc/chrom/?.len\", \"excln\": true, \"gff\": true, \"input_fasta\": null, \"repeat_source\": {\"source_type\": \"dfam\", \"__current_case__\": 0, \"species_source\": {\"species_from_list\": \"no\", \"__current_case__\": 1, \"species_name\": \"\"}}, \"xsmall\": true, \"__page__\": null, \"__rerun_remap_job_id__\": null}",
"tool_version": "4.1.5+galaxy0",
"type": "tool",
"uuid": "e6c8e6a1-efe8-4291-b12b-5fdb3795b6ca",
"when": null,
"workflow_outputs": [
{
"output_name": "output_masked_genome",
"label": "RepeatMasker masked genome"
},
{
"output_name": "output_log",
"label": "RepeatMasker output log"
},
{
"output_name": "output_table",
"label": "RepeatMasker repeat statistics"
},
{
"output_name": "output_repeat_catalog",
"label": "RepeatMasker repeat catalog"
},
{
"output_name": "output_gff",
"label": "RepeatMasker repeat annotation"
}
]
}
},
"tags": [],
"uuid": "f25be8fa-7823-456f-9707-a497703f48d7",
"version": 0
}