-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate genes #81
Comments
Hi Paul, Have you tried using the "-m 0" option in bacteria_tradis? This should use reads with a mapping quality of 0 in your analysis (i.e. multi-mapping reads), and is discussed in the tutorial PDF. (edit: was wrong about the -e, our current terminology is correct) |
There's a tutorial PDF?! ;-) That made a big difference. Shouldn't this be the default mode? File CP011972-tiled.fastq Total Reads 13110900 Reads Matched 13110900 % Matched 100 Reads Mapped 13110900 % Mapped 100 Unique Insertion Sites : CP011972 6504510 Seq Len/UIS : CP011972 1.00784978422664 Total Unique Insertion Sites 6504510 (length: 6555569) Total Seq Len/Total UIS 1.00784978422664 All genes now have an ins_index > 0.83. The off by one thing may still be a very minor issue. Many thanks for your help! |
Just to follow up on this briefly now that I've had some coffee: Our default parameters are for comparative experiments, i.e. between two conditions. In this case we don't consider multi-mapping reads -- since we can't uniquely identify where these reads originate, it should be generally more conservative not to consider them. For example, if we did we might falsely call all repeats of a particular gene as required when only a single copy is actually under selective pressure. In this case, your analysis would be correct, and there's approximately 6% we can't say much about -- this problem would probably be shared by any short-read sequencing technology (e.g. RNA-seq). You could potentially get some more info in these cases by looking at orthology groups, but this isn't something we've implemented. It sounds like you're trying to do an essentiality study. In this case we have a special flag (-e), which sets --smalt_r 0 -m 0, so bacteria_tradis uses multimapping reads. In this case using everything is most conservative in terms of calling essential genes - basically all identical repeated proteins will be classified as non-essential as long some of them are non-essential. I suppose there might be catastrophic cases where you have 100 copies of a gene, and only one is non-essential without sufficient insertion density to "flip" the others, but I doubt this would happen very often in reality. If you're still observing this behavior after setting --smalt_r 0 -m 0, then we have a problem. Hope this helps. |
Cool, glad it's working now! I'll take a look at the off by one thing in the insertion index calculation when I've got some time -- I don't think it should have much of an effect on downstream results, but obviously not ideal. |
fastq.stats.txt
CP011972-tiled.out.CP011972.tradis_gene_insert_sites.txt
Hi Folks,
I hope all is well in Sanger-land.
We've been working with TraDIS data in Pseudomonas syringae. Especially this beast:
https://www.ebi.ac.uk/ena/data/view/CP011972
Two of the genes stood out to my collaborators HopAM1-1 (IYO_023205) and HopAM1-2 (IYO_008385). These are 100% identical copies of a 830 nucleotide ORF. In our TraDIS experiments no insertions were found in either gene, but they aren't thought to be essential.
I ran a little test to see what happened to insertions in these genes. I simulated a fastq file that tiled 120mers from each position across the gene. I then ran the "bacteria_tradis" and "tradis_gene_insert_sites" pipelines on these. The genes both come back as having 0 insertions and "ins_index" values of zero.
100% of the reads map to the genome, it looks like SMALT does sensible things here. The failure appears to be in the way Bio-Tradis counts insertion sites. Is it ignoring multi-mapped reads, or perhaps the parser doesn't recognise them.
I've since scaled this test up to tile reads across the entire genome, then identify the which genes/regions can actually be probed by your pipeline. Please excuse my awful code, but it gets the job done.
*Make the stats file readable:
The CP011972-tiled.out.CP011972.tradis_gene_insert_sites file (attached) also revealed some interesting results.
*None of the "ins_index" values were 1.0 and "gene_length" and "ins_count" almost always differ by 1 i.e. gene_length = ins_count-1.
*335 genes had an "ins_index" less than 0.1. I think all the values should be ~1.0 in this test. There are 5883 genes in total, so roughly 6% all the PSA genes are not able to be studied with your pipeline.
It'd be great if someone could take a look at this, otherwise I'll need to find a workaround.
Best wishes,
Paul.
The text was updated successfully, but these errors were encountered: