-
-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Motivation for countreads rule #59
Comments
A plausible motivation for counting the reads both pre- and post- trimming is to see how many reads get discarded by the trim. But as it stands, we get close to this but in the middle of ep03, having successfully chained the trimreads and countreads rules, we then pivot and start adding Kallisto rules. In ep04, the read counts are then presented as an output of the workflow and we talk about the DAG concepts. Later, we present FastQC as taking the place of the countreads rule since it counts the reads and a lot more besides, and the old rule is discarded from the final workflow. I don't think it's unreasonable to assume that a bioinformatician cares how many reads they are working with, but the story as it stands is pretty disjointed. How can we fix this? Idea 1: Forget about counting reads and incorporate FastQC right away. I don't like this idea since FastQC produces two output files and has other issues dealt with in Ep06. Using the tool wrapper makes the rule easier to write, but then brings in the whole concept of wrappers which we are not yet ready for. Idea 2: Finish the story by adding a count_discarded rule. Rather than introducing Kallisto in ep04, we could finish the episode by adding a rule to subtract the numbers and tell us explicitly how many reads were discarded. This introduces the concept/syntax of a rule having two inputs, shortens the too-long ep03, and also gives a reasonably complex DAG which can maybe then be used to cover all the points in ep04. Then we'd only add Kallisto after ep05 (probably inserting a new ep06 to do so and moving everything else back). Idea 2 has some appeal, but it's a big change to this part of the course. Also, there are some downsides - it delays having any "real" bioinformatics tools that will supposedly motivate our bioinformaticians, and also the process of subtracting the numbers is so quick and trivial it seems a bit silly to talk about the advantages of Snakemake's lazy evaluation based on this example - making the Kallisto index is appreciably slower. I'll park this for now and come back to it. |
I implemented something like "Idea 2" with a The new ep04 is basically the back half of the old ep03 but I now properly introduce log outputs. Possibly the use of I decided to keep the old ep05 (which was ep04) "The DAG" as-is, using the Kallisto examples to illustrate the theory. The downside is this makes things a little disjointed, as we introduce the |
From @cmeesters
The text was updated successfully, but these errors were encountered: