Skip to content

Commit

Permalink
Merge pull request #1 from shiltemann/toydataset
Browse files Browse the repository at this point in the history
move workflow file to repo and minor tweaks
  • Loading branch information
lldelisle authored Aug 2, 2019
2 parents 03c6d23 + 2e196e9 commit 8e64f38
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 14 deletions.
33 changes: 19 additions & 14 deletions topics/contributing/tutorials/create-new-tutorial/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ key_points:
- "Creating a new tutorial involves several steps: some are mandatory, some can be skipped even if they are recommended"
contributors:
- bebatut
- erasche
- shiltemann
- lldelisle
---

Expand Down Expand Up @@ -145,21 +147,24 @@ Our tutorials try to follow the "learn by doing" approach; they combine both the
The first task is to select some data to use for the Hands-on sections. The selected data must be informative enough to illustrate the meaning of using a tool or a given technique, but not too big to require long waiting times for processing during a workshop. Upload and download of files into and out of Galaxy is usually quick, but the time taken for a tool to run can be long. Tool run times of no more than 10-15 mins are recommended. Typically, the selected data should be the informative subset of a full real-life dataset.
We display here two examples where a toy dataset was generated:
- When only few data are required, you can use a strategy close to this one (used in the test case of a Galaxy tool):
- Taking one 16S sequences
- Generating a reference database
- Blasting it on the NR database on [NCBI Blast](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome)
Below we describe two examples of how toy datasets were generated for tutorials:
- **Example 1**: creating a toy dataset from scratch
- Take one 16S sequence (for example found in the test case of a Galaxy tool):
- Generate a reference database
- Blast it on the NR database on [NCBI Blast](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome)
- Extracting one similar sequence found with Blast
- Searching and extracting 2 other sequences of the same species using the [NCBI Nucleotide database](https://www.ncbi.nlm.nih.gov/nuccore)
- When the experiment takes a FASTQ as input and few reads are sufficient:
- Use **seqtk_sample** {% icon tool %} to extract randomly reads from your input fastq.
- However, when it requires a lot of reads to be meaningful, you can use the following strategy (used for the ATAC-seq tutorial using [this workflow](https://raw.githubusercontent.com/lldelisle/myWorkflows/master/Galaxy-Workflow-MakeAFakeInput.ga)):
- Run the workflow until the mapping step on the full dataset (or big enough to have good results).
- Select IDs of reads which map on the smallest chromosome (for example chr22 for human data).
- In order to keep in the toy dataset enough diversity, you can also take randomly 1% of the reads IDs.
- Concatenate the two lists and remove the duplicated IDs.
- Use **seqtk_subseq** {% icon tool %} to sample your original FASTQ with the list of IDs.
- Search and extract 2 other sequences of the same species using the [NCBI Nucleotide database](https://www.ncbi.nlm.nih.gov/nuccore)
- **Example 2**: creating a toy dataset from an existing larger one
- When the experiment takes a FASTQ as input and a few reads are sufficient:
- Use **seqtk_sample** {% icon tool %} to extract randomly reads from your input fastq.
- However, when it requires a lot of reads to be meaningful, you can use the following strategy (used for the ATAC-seq tutorial using [this workflow](./workflows/Galaxy-Workflow-MakeAFakeInput.ga)):
- Run the workflow until the mapping step on the full dataset (or big enough to have good results).
- Select IDs of reads which map on the smallest chromosome (for example chr22 for human data).
- In order to keep in the toy dataset enough diversity, you can also take randomly 1% of the reads IDs.
- Concatenate the two lists and remove the duplicated IDs.
- Use **seqtk_subseq** {% icon tool %} to sample your original FASTQ with the list of IDs.
We would then develop the tutorial and test it on this toy dataset. Once we were ready to share it, we would upload the datasets on [Zenodo](https://zenodo.org/) to store them on long-term and obtain a dedicated DOI in the [Galaxy training network community](https://zenodo.org/communities/galaxy-training/?page=1&size=20).
Expand Down
Loading

0 comments on commit 8e64f38

Please sign in to comment.