Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low alignment rate #472

Open
MarcelloMalpighi opened this issue Dec 21, 2024 · 5 comments
Open

Low alignment rate #472

MarcelloMalpighi opened this issue Dec 21, 2024 · 5 comments

Comments

@MarcelloMalpighi
Copy link

Hi,
I quantified a paired-end Smart-seq2 RNA-Seq dataset (SRR3936136) using kallisto v0.48.0 with the GENCODE v47 transcriptome. The fragment length was calculated by CollectInsertSizeMetrics (Picard) based on STAR bam. The alignment rate was 34.2% in paired-end mode and 45.6%/43.6% in single-end mode, compared to 74.93% obtained with STAR. I wonder why there is notable discrepancy in the alignment rate. The corresponding information is provided below.
kallisto pair-end run_info.json:

{
        "n_targets": 387944,
        "n_bootstraps": 0,
        "n_processed": 1981171,
        "n_pseudoaligned": 676920,
        "n_unique": 68993,
        "p_pseudoaligned": 34.2,
        "p_unique": 3.5,
        "kallisto_version": "0.48.0",
        "index_version": 10,
        "start_time": "Sat Dec 21 00:02:34 2024",
        "call": "kallisto quant --pseudobam --genomebam --single-overhang -l 302.431195 -s 138.561984 -i /home/Usersdata2/references/gencode/kallistoIndex/gencode.v47.primary_assembly_k15.index -o /home/Usersdata2/abundance/SRR3936136 -t 32 --gtf /home/Usersdata2/references/gencode/gencode.v47.primary_assembly.annotation.gtf --chromosomes /home/Usersdata2/references/gencode/kallistoIndex/gencode.v47.primary_assembly.annotation.chromosomeInfo.txt /home/Usersdata2/datasets/SRP079058/fastq/SRR3936136_1.fastq.gz /home/Usersdata2/datasets/SRP079058/fastq/SRR3936136_2.fastq.gz"
}

kallisto single-end run_info.json:

{
        "n_targets": 387944,
        "n_bootstraps": 0,
        "n_processed": 1981171,
        "n_pseudoaligned": 903816,
        "n_unique": 84772,
        "p_pseudoaligned": 45.6,
        "p_unique": 4.3,
        "kallisto_version": "0.48.0",
        "index_version": 10,
        "start_time": "Sat Dec 21 19:35:34 2024",
        "call": "kallisto quant --pseudobam --genomebam --single-overhang -l 302.431 -s 138.562 -i /home/Usersdata2/references/gencode/kallistoIndex/gencode.v47.primary_assembly_k15.index -o /home/Usersdata2/kallisto/testSingle/read1 -t 16 --single --gtf /home/Usersdata2/references/gencode/gencode.v47.primary_assembly.annotation.gtf --chromosomes /home/Usersdata2/references/gencode/kallistoIndex/gencode.v47.primary_assembly.annotation.chromosomeInfo.txt /home/Usersdata2/datasets/SRP079058/fastq/SRR3936136_1.fastq.gz"
}
{
        "n_targets": 387944,
        "n_bootstraps": 0,
        "n_processed": 1981171,
        "n_pseudoaligned": 862909,
        "n_unique": 81911,
        "p_pseudoaligned": 43.6,
        "p_unique": 4.1,
        "kallisto_version": "0.48.0",
        "index_version": 10,
        "start_time": "Sat Dec 21 19:38:53 2024",
        "call": "kallisto quant --pseudobam --genomebam --single-overhang -l 302.431 -s 138.562 -i /home/Usersdata2/references/gencode/kallistoIndex/gencode.v47.primary_assembly_k15.index -o /home/Usersdata2/kallisto/testSingle/read2 -t 16 --single --gtf /home/Usersdata2/references/gencode/gencode.v47.primary_assembly.annotation.gtf --chromosomes /home/Usersdata2/references/gencode/kallistoIndex/gencode.v47.primary_assembly.annotation.chromosomeInfo.txt /home/Usersdata2/datasets/SRP079058/fastq/SRR3936136_2.fastq.gz"
}

STAR Log.final.out

                                 Started job on |       Dec 21 00:31:43
                             Started mapping on |       Dec 21 00:35:30
                                    Finished on |       Dec 21 00:36:24
       Mapping speed, Million of reads per hour |       132.08

                          Number of input reads |       1981171
                      Average input read length |       128
                                    UNIQUE READS:
                   Uniquely mapped reads number |       665253
                        Uniquely mapped reads % |       33.58%
                          Average mapped length |       127.44
                       Number of splices: Total |       142473
            Number of splices: Annotated (sjdb) |       142473
                       Number of splices: GT/AG |       140613
                       Number of splices: GC/AG |       1251
                       Number of splices: AT/AC |       64
               Number of splices: Non-canonical |       545
                      Mismatch rate per base, % |       0.37%
                         Deletion rate per base |       0.01%
                        Deletion average length |       1.44
                        Insertion rate per base |       0.01%
                       Insertion average length |       1.25
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       819140
             % of reads mapped to multiple loci |       41.35%
        Number of reads mapped to too many loci |       3
             % of reads mapped to too many loci |       0.00%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |       552
       % of reads unmapped: too many mismatches |       0.03%
            Number of reads unmapped: too short |       494129
                 % of reads unmapped: too short |       24.94%
                Number of reads unmapped: other |       2094
                     % of reads unmapped: other |       0.11%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%

Thanks in advance.

@mschilli87
Copy link

What fraction of your STAR alignments is intergenic or overlapping introns?

@MarcelloMalpighi
Copy link
Author

I used the following command to count reads aligned to the reference transcriptome by STAR, then subtracted it from the total read number. The result indicates that 52.96% of the reads were not derived from exons. Does this imply that the alignment rate is relatively low due to a significant proportion of reads originating from non-exonic regions?
>samtools view SRR3936136_Aligned.toTranscriptome.out.bam | awk '{print $1}' | sort | uniq | wc -l
>932000

@mschilli87
Copy link

I would assume so. Kallisto usually (pseudo)aligns to annotated spliced transcripts only. So your comparison to STAR is not really apples to apples.

@mschilli87

This comment has been minimized.

@MarcelloMalpighi
Copy link
Author

The original literature indicates that it is a standard single-cell RNA-seq library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants