-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exploration notebook for creating a reference of consensus labels #936
Exploration notebook for creating a reference of consensus labels #936
Conversation
@jaclyn-taroni I'm going to request you for review since we talked about a lot of this beforehand, but please let me know if I should send it to someone else. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will leave a review on the content of the notebook soon, but I am returning these comments about the environment now. I pushed the relevant CI workflow changes to this branch since @allyhawkins is out today, and it makes sense to me that we want to find out if we have a problem sooner rather than later!
"Package": "AnnotationDbi", | ||
"Version": "1.68.0", | ||
"Source": "Bioconductor", | ||
"Repository": "Bioconductor 3.20", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a different version of Bioconductor than the one specified at line 32. I can't be 100% sure this is responsible for my problems with renv::restore()
locally, but it seems like a good candidate:
Warning: failed to find binary for 'BiocVersion 3.20.0' in package repositories
Warning: failed to find source for 'BiocVersion 3.20.0' in package repositories
Warning: error downloading 'https://bioconductor.org/packages/3.19/bioc/src/contrib/Archive/BiocVersion/BiocVersion_3.20.0.tar.gz' [error code 22]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the problem isn't just my local setup: https://github.com/AlexsLemonade/OpenScPCA-analysis/actions/runs/12354014198/job/34474381650?pr=936
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the fact that there is a mix of Bioc versions in the renv seems likely to be a problem. I'm quite surprised (but not totally shocked) that renv
allowed that, but I am quite confident it is the problem.
I think the move is most likely to revert (or delete) the renv.lock
file, make sure the proper versions are set in renv/settings.json
(they seem to be), then run renv::hydrate()
to get a consistent set of packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just sticking an observation here: session info reports R version 4.4.2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just sticking an observation here: session info reports R version
4.4.2
That's because I'm using R 4.4.2. That was the only version of R 4.4 I could find when I was setting up my new laptop that's available for arm (https://cran.r-project.org/bin/macosx/). I looked through the old versions and couldn't find a version for arm, only Intel, so maybe I'm looking in the wrong place?
Are we okay with using R 4.4.2 and BioC 3.20 for modules? The mismatching versions was because I had run the create-analysis.py
script with the --renv
flag which initiated the lock file. But when I actually created the project and started installing packages it used 3.20. I was able to fix the errors you were seeing by adjusting the lock file to have R 4.4.2
and Bioconductor 3.20
. I just want to be sure we are okay with keeping this version or should I try and change to R 4.4 and BioC 3.19?
The error I'm seeing now looks like a dependency error for igraph and is unrelated to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to fix the errors you were seeing by adjusting the lock file to have
R 4.4.2
andBioconductor 3.20
. I just want to be sure we are okay with keeping this version or should I try and change to R 4.4 and BioC 3.19?
This seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we okay with using R 4.4.2 and BioC 3.20 for modules? The mismatching versions was because I had run the
create-analysis.py
script with the--renv
flag which initiated the lock file. But when I actually created the project and started installing packages it used 3.20. I was able to fix the errors you were seeing by adjusting the lock file to haveR 4.4.2
andBioconductor 3.20
. I just want to be sure we are okay with keeping this version or should I try and change to R 4.4 and BioC 3.19?
This like a bug we should fix... If our setup isn't forcing sticking with the defined Bioconductor version, we could end up with a lot of troubled down the road.
(For the record, I am okay with R 4.4.2, but the difference between Bioc3.19 and 3.20 is potentially more significant)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This like a bug we should fix... If our setup isn't forcing sticking with the defined Bioconductor version, we could end up with a lot of troubled down the road.
This could have been a me problem. I probably didn't use renv::restore()
when I first started writing code in this module, so I installed Bioconductor myself and it used the most recent version. So then I'm assuming when I installed any other packages it used the version of Bioconductor I had installed. If we want to use 3.19 then I can go back and delete everything and start again...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to my comment about local trouble with renv
, I think it's time uncomment the parts of the workflow files that set up renv
, etc. to check if things build/work in that context.
analyses/cell-type-consensus/exploratory-notebooks/01-reference-exploration.Rmd
Outdated
Show resolved
Hide resolved
@allyhawkins, you might want to take care of relevant typos in #944 in this PR :) |
If I had looked at the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly agree with your conclusions. I am concerned about eliminating epithelial cell for having too many descendants.
# list all ancestors and descendants calculate total | ||
ancestors = list(ontologyIndex::get_ancestors(cl_ont, cl_ontology)), | ||
total_ancestors = length(ancestors), | ||
descendants = list(ontologyIndex::get_descendants(cl_ont, cl_ontology)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is unlikely to matter in practice, but I wonder if we want to be setting exclude_roots = TRUE
here?
analyses/cell-type-consensus/exploratory-notebooks/01-reference-exploration.Rmd
Outdated
Show resolved
Hide resolved
unique() | ||
``` | ||
|
||
The only term in this list that I would be concerned about losing is "neuron". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about epithelial cell?
dplyr::filter(cl_annotation == "blood cell") | ||
``` | ||
|
||
I think I'm in favor of not having a "blood cell" label, since I'm not sure that it's helpful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
dplyr::filter(cl_annotation == "bone cell") | ||
``` | ||
|
||
I think I would also remove bone cell, since hematopoietic stem cells and osteoclasts seem pretty different to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
dplyr::filter(cl_annotation == "myeloid leukocyte") | ||
``` | ||
|
||
I'm torn on this one, because I do think it's helpful to know if something is of the myeloid lineage, but if we aren't keeping lymphocyte then I would argue we shouldn't keep myeloid leukocyte. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
T and B cells are typically easier to tell apart based on gene expression alone than different cell types in the myeloid lineage (at least in bulk settings, in my experience), so I don't think it's quite the same thing as labeling something with lymphocyte.
|
||
I'm torn on this one, because I do think it's helpful to know if something is of the myeloid lineage, but if we aren't keeping lymphocyte then I would argue we shouldn't keep myeloid leukocyte. | ||
|
||
#### Progenitor cell |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not include this one.
|
||
Same with `progenitor cell`, I do think it could be helpful to know that something may be a progenitor cell, but when you have a cell with the label for HSC and the label for cells like monocytes or osteoblasts, then maybe we are talking about a tumor cell instead. | ||
|
||
Along those same lines, I think the below terms, `lining cell` and `supporting cell`, are too broad even though they have few descendants. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me.
|
||
I would remove `perivascular cell`, since the cell type labels from PanglaoDB and Blueprint are pretty different from each other. | ||
|
||
## Similarity index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced this is as useful as LCA + some simple rules (e.g., total descendants threshold with exceptions).
analyses/cell-type-consensus/exploratory-notebooks/01-reference-exploration.Rmd
Outdated
Show resolved
Hide resolved
…tology-exploration
Co-authored-by: Jaclyn Taroni <[email protected]>
In 0e89309, I added some notes about keeping myeloid leukocyte and a section to look at the epithelial cell matches. I think the only thing remaining here is figuring out CI, which I hope will be solved by updating the Dockerfile in #950 and running the module GHA on that image. To summarize the rules we are putting in place right now:
|
This is now passing checks so should be good to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one comment about when epithelial cell
is an appropriate label that I'd like to see make it to the conclusions. Otherwise, LGTM.
analyses/cell-type-consensus/exploratory-notebooks/01-reference-exploration.Rmd
Outdated
Show resolved
Hide resolved
``` | ||
|
||
The PanglaoDB cell types seem to be more specific than the ones present in Blueprint Encode, similar to the observation with neurons. | ||
We should keep epithelial cell. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we codify this, I would say epithelial cell is only acceptable when the Blueprint annotation is Epithelial cells
because it's likely just a matter of PanglaoDB being much more specific. When the Blueprint annotation is Keratinocytes
, things seem to get a bit weird.
|
||
- Pairs should not have more than 1 LCA, with the exception of the matches that have the label hematopoietic precursor cell. | ||
- The LCA should have equal to or less than 170 total descendants. | ||
- We whould include the term for `neuron` and `epithelial cell` even though they do not pass the threshold for number of descendants. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above about when I think epithelial cell
is acceptable vs. unacceptable.
Co-authored-by: Jaclyn Taroni <[email protected]>
I added a note underneath the table showing the combinations for epithelial cell and to the conclusion in 3676e8d. And I also noted it on #951 for when we go to create the reference. Going to go ahead and merge this in once checks pass. |
Purpose/implementation Section
Please link to the GitHub issue that this pull request addresses.
Closes #889
What is the goal of this pull request?
Here I'm adding a notebook that looks at defining consensus cell types between all possible cell type labels in the PanglaoDB and Blueprint Encode references. For each possible pair in the references, I computed the latest common ancestor (LCA), which is the latest ancestor that is shared between the two labels. Ultimately, I think we could use the LCA as the consensus label for any pairs that meet the criteria that we set (e.g., the LCA has a total number of descendants less than a set threshold). Any pairs that do not meet the criteria would get an "Unknown" label.
Briefly describe the general approach you took to achieve this goal.
lining cell
,supporting cell
).lining cell
andsupporting cell
.hematopoietic precursor cell
and looking at those pairs they do all fall in the category of hematopoietic stem and progenitor cells, which I think is a useful label to have.neuron
. This term has 500 + descendants, but I think knowing if a cell is a neuron is pretty useful information.If known, do you anticipate filing additional pull requests to complete this analysis module?
Yes
Results
What types of results does your code produce (e.g., table, figure)?
All the results are in the rendered copy of the notebook. I'm including it here for easy review:
01-reference-exploration.html.zip
What is your summary of the results?
Generally I think we can use this to guide creation of our own consensus label reference. I invision this being a table with all possible cell type label combinations between PanglaoDB and Blueprint Encode and then a column with the LCA or consensus label. Along side this, we would have a whitelist of acceptable LCA terms that either pass the threshold that we set for number of descendants or are granular enough that we would want to keep. I'm wary of just relying on the threshold since terms like neuron would get lost while terms like lining cell would be kept.
Provide directions for reviewers
What are the software and computational requirements needed to be able to run the code in this PR?
I ran this locally using the
renv
environment.Are there particularly areas you'd like reviewers to have a close look at?
Are there other metrics or items that you think we should look at?
Author checklists
Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.
Analysis module and review
README.md
has been updated to reflect code changes in this pull request.Reproducibility checklist
Dockerfile
.environment.yml
file.renv.lock
file.