-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relationship between the STAT analysis data available on the NCBI SRA Run Browser and that on the cloud platform #36
Comments
Junna-Kawasaki,
If you look at the row in the cloud table tax_analysis_info for ERR979125 you will find it was an aligned submission of human where only the unaligned spots were analyzed. In such a case Run browser takes the total spot count 789675109, subtracts the analyzed spot count 1214952 and adds that sum (the aligned spot count) to the human spot count. Thus the denominator used in Run browser is different: it is the identified spot_count + the additional human aligned spot count which compose the majority of spots in this sample. |
Thank you for your prompt response. I hope it's not too much trouble, but I have a few additional questions. I would like to know the percentage of identified_spot_count (i.e., spot matching a certain organism) and unidentified_spot_count (i.e., spot not matching any organism) for each SRA data. However, the Cloud-based Taxonomy Analysis Information Table does not provide data in the unaligned_spot_count column for most SRA data. Therefore, I considered calculating the percentage of identified_spot_count per sample and assuming that the remaining spots fall under unaligned_spot_count. Is it possible to estimate the unaligned_spot_count percentage in each SRA dataset using this approach? In your previous explanation, it was stated that identified_spot_count does not contain spots aligned to the human genome, which leads me to believe that this method might not work well for some samples. I apologize for the numerous questions, but I would greatly appreciate any advice you could provide on calculating the percentage of unidentified spots that did not match anything using STAT. Thank you very much for your assistance. |
@Junna-Kawasaki , apologies for this tardy reply.
identified_spot_count is the number of spots where a taxon was deduced (assigned): analyzed_spot_count is the total number of spots subject to tax analysis: therefore, it is not unreasonable to consider that
That calculation is simply analyzed_spot_count - identified_spot_count. However at the moment you will find many STAT analysis results have null identified_spot_count . While we hope to backfill those values in the somewhat near future,,you can simply sum the total_count of two specific taxa: tax_id = 31567; name = 'cellular organisms' and tax_id= 10239, name = 'Viruses': the sum of those will equal 'identified_cpot_count'. |
I am writing to seek your assistance with a question regarding the Cloud-based Taxonomy Analysis Information Table.
I noticed a discrepancy between the “identified_spot_count” available on the cloud platform and the "IDENTIFIED READS" displayed on the Sequence Read Archive Run Browser. For instance, in the case of ERR979125 on the Run Browser, 97.1% of the reads are listed as being of human origin (https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=ERR979125&display=analysis).
However, the data retrieved from the cloud shows the following:
Regardless of whether the denominator is the “analyzed_spot_count” or the “total_spot_count”, the percentages are significantly lower than those reported on the Run Browser (16.0% and 0.24%).
Could you kindly clarify the relationship between the data available on the Run Browser and that on the cloud platform?
I appreciate your assistance and look forward to your response.
The text was updated successfully, but these errors were encountered: