Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate genes #81

Closed
ppgardne opened this issue Nov 28, 2017 · 4 comments
Closed

Duplicate genes #81

ppgardne opened this issue Nov 28, 2017 · 4 comments

Comments

@ppgardne
Copy link

fastq.stats.txt
CP011972-tiled.out.CP011972.tradis_gene_insert_sites.txt

Hi Folks,

I hope all is well in Sanger-land.

We've been working with TraDIS data in Pseudomonas syringae. Especially this beast:
https://www.ebi.ac.uk/ena/data/view/CP011972

Two of the genes stood out to my collaborators HopAM1-1 (IYO_023205) and HopAM1-2 (IYO_008385). These are 100% identical copies of a 830 nucleotide ORF. In our TraDIS experiments no insertions were found in either gene, but they aren't thought to be essential.
I ran a little test to see what happened to insertions in these genes. I simulated a fastq file that tiled 120mers from each position across the gene. I then ran the "bacteria_tradis" and "tradis_gene_insert_sites" pipelines on these. The genes both come back as having 0 insertions and "ins_index" values of zero.

Stats:
File	   	       	    	        hopam.fastq
Total Reads				1424
Reads Matched				1424
% Matched				100
Reads Mapped				1424
% Mapped				100
Unique Insertion Sites : CP011972	0
Seq Len/UIS : CP011972	 		NaN
Total Unique Insertion Sites		0
Total Seq Len/Total UIS			NaN

100% of the reads map to the genome, it looks like SMALT does sensible things here. The failure appears to be in the way Bio-Tradis counts insertion sites. Is it ignoring multi-mapped reads, or perhaps the parser doesn't recognise them.

I've since scaled this test up to tile reads across the entire genome, then identify the which genes/regions can actually be probed by your pipeline. Please excuse my awful code, but it gets the job done.

  1. Generate a fastq that tiles 120mer reads across the entire genome (forward and reverse). Warning: generates a 5G file.
cat CP011972.2.fasta      | grep -v '^>' | tr -d "\n" | tr "a-z" "A-Z" | perl -lane '$l=120; $sq=$_; for($i=0; $i<(length($_)-$l+1); $i++){$s = substr($sq, $i, $l); $cnt++; printf "\@M00100:233:000000000-BFM3J:1:1102:15255:%d 1:N:0:GTGAAA\nGACAGTCGAC$s\n+M00100:233:000000000-BFM3J:1:1102:15255:%d 1:N:0:GTGAAA\n>1AA1111BCDBB3BBDFGB3BBB01BD3D1AAB1B1BBD2DABDA1B1//AAAB//011DDGDD2A2BDB2DDDDGHDA>//BD2D1/////11/B?>B2BB1B/>//?1BB?0/B1BBGBBBFGHH?B\n", 1000+$cnt, 1000+$cnt;}' > CP011972-tiled.fastq
revcomp CP011972.2.fasta  | grep -v '^>' | tr -d "\n" | tr "a-z" "A-Z" | perl -lane '$l=120; $sq=$_; for($i=0; $i<(length($_)-$l+1); $i++){$s = substr($sq, $i, $l); $cnt++; printf "\@M00100:233:000000000-BFM3J:1:1102:15255:%d 1:N:0:GTGAAA\nGACAGTCGAC$s\n+M00100:233:000000000-BFM3J:1:1102:15255:%d 1:N:0:GTGAAA\n>1AA1111BCDBB3BBDFGB3BBB01BD3D1AAB1B1BBD2DABDA1B1//AAAB//011DDGDD2A2BDB2DDDDGHDA>//BD2D1/////11/B?>B2BB1B/>//?1BB?0/B1BBGBBBFGHH?B\n", 7*(10**6)+$cnt, 7*(10**6)+$cnt;}' >> CP011972-tiled.fastq

echo "CP011972-tiled.fastq" > fastq.infiles 
  1. Run "bacteria_tradis":
bacteria_tradis -f fastq.infiles -t 'GACAGTCGAC' -r CP011972.2.fasta  --smalt_s 1 --smalt_k 10 --smalt_r 0 --smalt_y 0.8 -mm 5 -v

*Make the stats file readable:

grep ^File fastq.stats | tr "," "\n" > blah && grep -v ^File fastq.stats | tr "," "\n" > blah2 && paste blah blah2
File	                               CP011972-tiled.fastq
Total Reads			       13110900
Reads Matched			       13110900
% Matched			       100
Reads Mapped			       13110900
% Mapped			       100
Unique Insertion Sites : CP011972      6101020 (length: 6555569)
Seq Len/UIS : CP011972	 	       1.0745037715005
Total Unique Insertion Sites	       6101020
Total Seq Len/Total UIS		       1.


0745037715005
  1. Call essential genes:
tradis_gene_insert_sites  -o tradis_gene_insert_sites CP011972.2.embl CP011972-tiled.out.CP011972.insert_site_plot

The CP011972-tiled.out.CP011972.tradis_gene_insert_sites file (attached) also revealed some interesting results.

*None of the "ins_index" values were 1.0 and "gene_length" and "ins_count" almost always differ by 1 i.e. gene_length = ins_count-1.

*335 genes had an "ins_index" less than 0.1. I think all the values should be ~1.0 in this test. There are 5883 genes in total, so roughly 6% all the PSA genes are not able to be studied with your pipeline.

It'd be great if someone could take a look at this, otherwise I'll need to find a workaround.

Best wishes,
Paul.

@lbarquist
Copy link
Contributor

lbarquist commented Nov 28, 2017

Hi Paul,

Have you tried using the "-m 0" option in bacteria_tradis? This should use reads with a mapping quality of 0 in your analysis (i.e. multi-mapping reads), and is discussed in the tutorial PDF.

(edit: was wrong about the -e, our current terminology is correct)

@ppgardne
Copy link
Author

There's a tutorial PDF?! ;-)

That made a big difference. Shouldn't this be the default mode?

File	   	       	    	        CP011972-tiled.fastq
Total Reads				13110900
Reads Matched				13110900
% Matched				100
Reads Mapped				13110900
% Mapped				100
Unique Insertion Sites : CP011972	6504510
Seq Len/UIS : CP011972	 		1.00784978422664
Total Unique Insertion Sites		6504510 (length: 6555569)
Total Seq Len/Total UIS			1.00784978422664

All genes now have an ins_index > 0.83. The off by one thing may still be a very minor issue.

Many thanks for your help!
P.

@lbarquist
Copy link
Contributor

Just to follow up on this briefly now that I've had some coffee:

Our default parameters are for comparative experiments, i.e. between two conditions. In this case we don't consider multi-mapping reads -- since we can't uniquely identify where these reads originate, it should be generally more conservative not to consider them. For example, if we did we might falsely call all repeats of a particular gene as required when only a single copy is actually under selective pressure. In this case, your analysis would be correct, and there's approximately 6% we can't say much about -- this problem would probably be shared by any short-read sequencing technology (e.g. RNA-seq). You could potentially get some more info in these cases by looking at orthology groups, but this isn't something we've implemented.

It sounds like you're trying to do an essentiality study. In this case we have a special flag (-e), which sets --smalt_r 0 -m 0, so bacteria_tradis uses multimapping reads. In this case using everything is most conservative in terms of calling essential genes - basically all identical repeated proteins will be classified as non-essential as long some of them are non-essential. I suppose there might be catastrophic cases where you have 100 copies of a gene, and only one is non-essential without sufficient insertion density to "flip" the others, but I doubt this would happen very often in reality.

If you're still observing this behavior after setting --smalt_r 0 -m 0, then we have a problem. Hope this helps.

@lbarquist
Copy link
Contributor

Cool, glad it's working now! I'll take a look at the off by one thing in the insertion index calculation when I've got some time -- I don't think it should have much of an effect on downstream results, but obviously not ideal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants