Duplicate genes #81

ppgardne · 2017-11-28T02:11:02Z

fastq.stats.txt
CP011972-tiled.out.CP011972.tradis_gene_insert_sites.txt

Hi Folks,

I hope all is well in Sanger-land.

We've been working with TraDIS data in Pseudomonas syringae. Especially this beast:
https://www.ebi.ac.uk/ena/data/view/CP011972

Two of the genes stood out to my collaborators HopAM1-1 (IYO_023205) and HopAM1-2 (IYO_008385). These are 100% identical copies of a 830 nucleotide ORF. In our TraDIS experiments no insertions were found in either gene, but they aren't thought to be essential.
I ran a little test to see what happened to insertions in these genes. I simulated a fastq file that tiled 120mers from each position across the gene. I then ran the "bacteria_tradis" and "tradis_gene_insert_sites" pipelines on these. The genes both come back as having 0 insertions and "ins_index" values of zero.

Stats:
File	   	       	    	        hopam.fastq
Total Reads				1424
Reads Matched				1424
% Matched				100
Reads Mapped				1424
% Mapped				100
Unique Insertion Sites : CP011972	0
Seq Len/UIS : CP011972	 		NaN
Total Unique Insertion Sites		0
Total Seq Len/Total UIS			NaN

100% of the reads map to the genome, it looks like SMALT does sensible things here. The failure appears to be in the way Bio-Tradis counts insertion sites. Is it ignoring multi-mapped reads, or perhaps the parser doesn't recognise them.

I've since scaled this test up to tile reads across the entire genome, then identify the which genes/regions can actually be probed by your pipeline. Please excuse my awful code, but it gets the job done.

Generate a fastq that tiles 120mer reads across the entire genome (forward and reverse). Warning: generates a 5G file.

cat CP011972.2.fasta      | grep -v '^>' | tr -d "\n" | tr "a-z" "A-Z" | perl -lane '$l=120; $sq=$_; for($i=0; $i<(length($_)-$l+1); $i++){$s = substr($sq, $i, $l); $cnt++; printf "\@M00100:233:000000000-BFM3J:1:1102:15255:%d 1:N:0:GTGAAA\nGACAGTCGAC$s\n+M00100:233:000000000-BFM3J:1:1102:15255:%d 1:N:0:GTGAAA\n>1AA1111BCDBB3BBDFGB3BBB01BD3D1AAB1B1BBD2DABDA1B1//AAAB//011DDGDD2A2BDB2DDDDGHDA>//BD2D1/////11/B?>B2BB1B/>//?1BB?0/B1BBGBBBFGHH?B\n", 1000+$cnt, 1000+$cnt;}' > CP011972-tiled.fastq
revcomp CP011972.2.fasta  | grep -v '^>' | tr -d "\n" | tr "a-z" "A-Z" | perl -lane '$l=120; $sq=$_; for($i=0; $i<(length($_)-$l+1); $i++){$s = substr($sq, $i, $l); $cnt++; printf "\@M00100:233:000000000-BFM3J:1:1102:15255:%d 1:N:0:GTGAAA\nGACAGTCGAC$s\n+M00100:233:000000000-BFM3J:1:1102:15255:%d 1:N:0:GTGAAA\n>1AA1111BCDBB3BBDFGB3BBB01BD3D1AAB1B1BBD2DABDA1B1//AAAB//011DDGDD2A2BDB2DDDDGHDA>//BD2D1/////11/B?>B2BB1B/>//?1BB?0/B1BBGBBBFGHH?B\n", 7*(10**6)+$cnt, 7*(10**6)+$cnt;}' >> CP011972-tiled.fastq

echo "CP011972-tiled.fastq" > fastq.infiles

Run "bacteria_tradis":

bacteria_tradis -f fastq.infiles -t 'GACAGTCGAC' -r CP011972.2.fasta  --smalt_s 1 --smalt_k 10 --smalt_r 0 --smalt_y 0.8 -mm 5 -v

*Make the stats file readable:

grep ^File fastq.stats | tr "," "\n" > blah && grep -v ^File fastq.stats | tr "," "\n" > blah2 && paste blah blah2

File	                               CP011972-tiled.fastq
Total Reads			       13110900
Reads Matched			       13110900
% Matched			       100
Reads Mapped			       13110900
% Mapped			       100
Unique Insertion Sites : CP011972      6101020 (length: 6555569)
Seq Len/UIS : CP011972	 	       1.0745037715005
Total Unique Insertion Sites	       6101020
Total Seq Len/Total UIS		       1.


0745037715005

Call essential genes:

tradis_gene_insert_sites  -o tradis_gene_insert_sites CP011972.2.embl CP011972-tiled.out.CP011972.insert_site_plot

The CP011972-tiled.out.CP011972.tradis_gene_insert_sites file (attached) also revealed some interesting results.

*None of the "ins_index" values were 1.0 and "gene_length" and "ins_count" almost always differ by 1 i.e. gene_length = ins_count-1.

*335 genes had an "ins_index" less than 0.1. I think all the values should be ~1.0 in this test. There are 5883 genes in total, so roughly 6% all the PSA genes are not able to be studied with your pipeline.

It'd be great if someone could take a look at this, otherwise I'll need to find a workaround.

Best wishes,
Paul.

lbarquist · 2017-11-28T07:23:51Z

Hi Paul,

Have you tried using the "-m 0" option in bacteria_tradis? This should use reads with a mapping quality of 0 in your analysis (i.e. multi-mapping reads), and is discussed in the tutorial PDF.

(edit: was wrong about the -e, our current terminology is correct)

ppgardne · 2017-11-28T10:06:27Z

There's a tutorial PDF?! ;-)

That made a big difference. Shouldn't this be the default mode?

File	   	       	    	        CP011972-tiled.fastq
Total Reads				13110900
Reads Matched				13110900
% Matched				100
Reads Mapped				13110900
% Mapped				100
Unique Insertion Sites : CP011972	6504510
Seq Len/UIS : CP011972	 		1.00784978422664
Total Unique Insertion Sites		6504510 (length: 6555569)
Total Seq Len/Total UIS			1.00784978422664

All genes now have an ins_index > 0.83. The off by one thing may still be a very minor issue.

Many thanks for your help!
P.

lbarquist · 2017-11-28T10:10:51Z

Just to follow up on this briefly now that I've had some coffee:

Our default parameters are for comparative experiments, i.e. between two conditions. In this case we don't consider multi-mapping reads -- since we can't uniquely identify where these reads originate, it should be generally more conservative not to consider them. For example, if we did we might falsely call all repeats of a particular gene as required when only a single copy is actually under selective pressure. In this case, your analysis would be correct, and there's approximately 6% we can't say much about -- this problem would probably be shared by any short-read sequencing technology (e.g. RNA-seq). You could potentially get some more info in these cases by looking at orthology groups, but this isn't something we've implemented.

It sounds like you're trying to do an essentiality study. In this case we have a special flag (-e), which sets --smalt_r 0 -m 0, so bacteria_tradis uses multimapping reads. In this case using everything is most conservative in terms of calling essential genes - basically all identical repeated proteins will be classified as non-essential as long some of them are non-essential. I suppose there might be catastrophic cases where you have 100 copies of a gene, and only one is non-essential without sufficient insertion density to "flip" the others, but I doubt this would happen very often in reality.

If you're still observing this behavior after setting --smalt_r 0 -m 0, then we have a problem. Hope this helps.

lbarquist · 2017-11-28T10:14:32Z

Cool, glad it's working now! I'll take a look at the off by one thing in the insertion index calculation when I've got some time -- I don't think it should have much of an effect on downstream results, but obviously not ideal.

lbarquist mentioned this issue Nov 28, 2017

include last base of annotation in insertion count #82

Merged

andrewjpage closed this as completed Nov 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate genes #81

Duplicate genes #81

ppgardne commented Nov 28, 2017

lbarquist commented Nov 28, 2017 •

edited

Loading

ppgardne commented Nov 28, 2017

lbarquist commented Nov 28, 2017

lbarquist commented Nov 28, 2017

Duplicate genes #81

Duplicate genes #81

Comments

ppgardne commented Nov 28, 2017

lbarquist commented Nov 28, 2017 • edited Loading

ppgardne commented Nov 28, 2017

lbarquist commented Nov 28, 2017

lbarquist commented Nov 28, 2017

lbarquist commented Nov 28, 2017 •

edited

Loading