Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotation_dev #9

Open
wants to merge 40 commits into
base: main
Choose a base branch
from
Open

Annotation_dev #9

wants to merge 40 commits into from

Conversation

ahmed-said-jax
Copy link
Collaborator

@ahmed-said-jax ahmed-said-jax commented Apr 17, 2023

This commit allows the pipeline to annotate
the filtered_feature_bc_matrix.h5
file outputted by cellranger. It also allows
for optional RNA velocity analysis with velocyto.

ahmed-said-jax and others added 11 commits April 13, 2023 16:32
This adds some files to assets (to be ignored for now). They will
eventually be used for annotating genes and generating web summaries.

More importantly, this commit infers the species from the reference
genome path and gets the associated genes.gtf file for velocyto to use.
Using gene annotation matrices in assets, the data-matrix is annotated
and stored in an AnnData object. It also calculates doublet scores
for the matrix. If the reference genome is unsupported, it will just
not annotate gene types, but still calculate doublets.

RNA-velocity support to be added soon.
This commit allows the pipeline to annotate
annotate the filtered_feature_bc_matrix.h5
file outputted by 'cellranger'. It also allows
for optional RNA velocity analysis with 'velocyto'.
@ahmed-said-jax ahmed-said-jax requested a review from wflynny April 17, 2023 18:36
This `Nextflow` script will now take `cellranger count` outputs and
generate `AnnData` and `Seurat` objects. They are annotated with
various QC statistics as well as gene types. The pipeline also generates
plots of the QC statistics, while generating an `HTML` document that
summarizes the pipeline's outputs. This `HTML` summary is incomplete at
the moment and likely to change a lot as the outputs are organized.
@ahmed-said-jax
Copy link
Collaborator Author

Commit 43cefcf has not been tested and is nonfunctional.


# The two currently supported species have different file types
# and formats, so they need to be handled differently
if ds == "hsapiens_gene_ensembl":
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this branching behavior feels ugly to me but i don't have a better solution at the moment

This commit creates a nearly fully functional annotation pipeline.
After `cellranger count` creates a `filtered_feature_bc_matrix`, `soupx`
creates its own from the `cellranger` outputs. The `soupx` matrix should
have ambient RNA filtered out. Both matrices are then annotated in
parallel for various gene types and doublets, and plots are generated
automatically, as well as a web summary including plots. The web summary
is as of yet incomplete, as it does not contain a list of the genes
annotated, nor does it create the correct directory tree. The flowchart
is also incorrect at the moment.
script:
"""
gen_plots.py
gen_summary.py --summary_dir=${summary_dir} --pubdir=${launchDir / params.pubdir}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pubdir here might need to be changed

ahmed-said-jax and others added 6 commits May 23, 2023 09:58
The pre-analysis (annotation) pipeline now works as expected and
produces an HTML summary with correct information. It also now has
a table of annotated genes.
This commit updates the README to contain information about
pre-analysis gene/cell annotation for single cell expression.
Additionally, it updates `tools.csv` to contain the correct doublet
detection algorithm. Finally, it updates `annotate.py` to make it
case-insensitive for Ensembl gene IDs in the reference annotations, in
case a user passes in their own files with lowercased gene IDs.
main.nf Outdated Show resolved Hide resolved
bin/arg_utils.py Outdated Show resolved Hide resolved
bin/extract_files.py Outdated Show resolved Hide resolved
bin/filter_ambient_rna.r Outdated Show resolved Hide resolved

script:
"""
mimic_cellranger.py --soupx_dir=${tool}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the functionality of mimic_cellranger.py be accomplished via bash cp commands and changing how inputs are staged?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is probably the simplest way to do it, but is this safe? in particular, the rm -r step scares me

bin/annotate.py Outdated Show resolved Hide resolved
bin/annotate.py Show resolved Hide resolved
bin/annotate.py Show resolved Hide resolved
bin/gen_plots.py Outdated Show resolved Hide resolved
assets/summaries/no_rna_velo/overview.csv Outdated Show resolved Hide resolved
@@ -14,15 +14,21 @@ nextflow.enable.dsl = 2

params.pubdir = params.getOrDefault("pubdir", "pubdir")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the getOrDefault thing necessary? can't we just do params.thing = 'value' for the same effect?

Suggested change
params.pubdir = params.getOrDefault("pubdir", "pubdir")
params.pubdir = "pubdir"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and i mean that for all the parameters, not just the pubdir

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants