GitHub - dhall1995/genotyping_scRNAseq: A selection of scripts to assign barcoded single cells to a likely donor given a sorted BAM file and a list of barcodes to consider.

Selection of scripts to genotype single cells within a BAM file created from a scRNA-seq run using barcoded cells where the donor.id of each barcode is unknown. For a given number of experiments the scripts assume that for each experiment we have a directory containing:

a BAM file named possorted_genome_bam.bam
a .tsv file containing a list of barcodes of interest

For my project at the sanger I had 3 cellranger runs: "cellranger210_count_24933_5149STDY7274846_Danio_rerio_GRCz10_91" "cellranger210_count_24933_5149STDY7274847_Danio_rerio_GRCz10_91" "cellranger210_count_24933_5149STDY7274848_Danio_rerio_GRCz10_91"

I therefore had three directories, one for each cellranger run. Within each directory I had:

a .bam file named possorted_genome_bam.bam
a list of all barcodes of interest called

e.g. cellranger210_count_24933_5149STDY7274846_Danio_rerio_GRCz10_91_barcodes.tsv

All shell scripts are intended to be run in the directory directly above the experiment directories.

To separate each BAM file into one BAM file for each barcode of interest, we first run ./Extract_barcodes.sh
We then run Variant_Calling_per_Experiment.sh. This merges the barcode BAMs for each experiment into a QC-ed BAM file for each experiment and then performs de novo variant calling on this merged BAM and filters out poor quality SNPs
We then run variant_calling_per_barcode.sh. This Uses our vcf file generated for each experiment as a reference file in order to genotype each barcode BAM at each variant site. We then filter out SNPs which are missing in too many cells and finally produce a compressed vcf for each experiment in which each barcoded cell BAM file is treated as a different sample.

The R script is currently untested as I have only run those commands directly from within the rstudio terminal. However, if the script works it should output some descriptive plots about the sparsity of the SNP matrix generated from each experiment and also attempt to run probabilistic pca on the SNP matrix, assigning each barcode to a likely cluster. The number of donors can be input directly into that script and the ppca algorithm will try and come up with the most likely partitioning of the cells into donors given the SNP profiles of each cell. I have only tested this so far with small numbers of donors and I expect that for n>4 donors the clustering may become unreliable.

The scripts require a quite large amount of storage capacity to work since it generates a lot of BAM files. If storage is low then it is safe to delete the possorted_genome_bam.bam files after running Extract_barcodes.sh. Similarly, if they are unneeded then it is safe to delete any merged BAM after variant calling has been performed.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Clustering.R		Clustering.R
Extract_barcodes.sh		Extract_barcodes.sh
README.md		README.md
Variant_Calling_per_Experiment.sh		Variant_Calling_per_Experiment.sh
rotation3_draft.docx		rotation3_draft.docx
variant_calling_per_barcode.sh		variant_calling_per_barcode.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

dhall1995/genotyping_scRNAseq

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages