Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scripts to run copykat and infercnv for a subselection of Wilms tumor from SCPCP000006 #801

Merged
merged 25 commits into from
Oct 15, 2024

Conversation

maud-p
Copy link
Contributor

@maud-p maud-p commented Oct 8, 2024

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

We follow the issue 790 and the PR 776

What is the goal of this pull request?

On a #776 (comment), I want to try to infer aneuploidy and/or CNV to help identifying normal and cancer cells.

For this, I compare the use of copykat and infercnv. For each of the method, I wanted to compare few parameters.

copykat

In copykat, few parameters can be used to fine-tuned the results. Especially, we can try running copykat with or without a set of normal cells. It is important to note that , CopyKAT had difficulty in predicting tumor and normal cells in the cases of pediatric and liquid tumors that have a few CNAs. CopyKAT provides two ways to bypass this to give certain output instead of being dead staright: 1) input a vector of cell names of known normal cells from the same dataset 2) or try to search for T cells. (see copykat). I thus tested both with and without a reference but I am quite convinced that giving few normal cells help the function.

One parameter I also wanted to test is the clustering method. In copykat, parameters for clustering include "euclidean" distance and correlational distance, ie. 1-"pearson" and "spearman" similarity. In general, corretional distances tend to favor noisy data, while euclidean distance tends to favor data with larger CN segments. I thus tested eucliedean and spearman. In our dataset, I think euclidean (default) is perfoming best.

infercnv

Another way of inferring copy number alterations from tumor single cell RNA-Seq data is using infercnv. In a previous discussion, we were not sure about the impact of the definition of the heatly reference. For that reason, I ran infercvn with no normal cells as reference or immune and/or endothelial cells as reference.

Briefly describe the general approach you took to achieve this goal.

In this PR, we run the scripts 05_copyKAT.R and 06_infercnv.R for 5 samples.

We selected previously 5 samples to test for these parameters:

  • sample SCPCS000194 has > 85 % of cells predicted as kidney and 234 + 83 endothelium and immune cells.
  • sample SCPCS000179 has > 94 % of cells predicted as kidney and 25 + 111 endothelium and immune cells.
  • sample SCPCS000184 has > 96 % of cells predicted as kidney and 39 + 70 endothelium and immune cells.
  • sample SCPCS000205 has > 89 % of cells predicted as kidney and 92 + 76 endothelium and immune cells.
  • sample SCPCS0000208 has > 95 % of cells predicted as kidney and 18 + 35 endothelium and immune cells.

This is generated outputs that are stores in results as described in the results\READ.ME

If known, do you anticipate filing additional pull requests to complete this analysis module?

Results

For each of the samples, we ran 05_copyKAT.R script, defining the distance parameters.
The outputs of copykat are saved in the results\{sample_id}\05_copykat folder.
For each of the samples, we ran 06_infercnv.R script, defining either no (none) normal cells as reference, or immune and/or endothelial cells.
The outputs of infercnv are saved in the results\{sample_id}\06_infercnv folder.

What is the name of your results bucket on S3?

researcher-008971640512-us-east-2

What types of results does your code produce (e.g., table, figure)?

For now, we mostly look at the heatmap output that is generated and saved into jpeg file.
However, for later analysis, we might need one of the table generated. For that reason, I kept every input generated so far.
We can then decide what is useful and what can be removed in a second step.

What is your summary of the results?

See next PR, the results will be explored to compare the different methods.

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Analysis module and review

  • This analysis module uses the analysis template and has the expected directory structure.
  • [ NO] The analysis module README.md has been updated to reflect code changes in this pull request.
    [?] Maybe I'll update the README.md once we chose 1 or 2 methods for CNV inference?
  • The analytical code is documented and contains comments.
  • Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

  • Code in this pull request has been added to the GitHub Action workflow that runs this module.
  • The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
  • If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
  • If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

@maud-p maud-p requested a review from jaclyn-taroni as a code owner October 8, 2024 08:26
@maud-p
Copy link
Contributor Author

maud-p commented Oct 8, 2024

Hi,
Not sure why the docker file cannot be build now... So far as I know I just added the infercnv and copykat libraries...
Maybe you already encounter that issue? :)
Thank you!

@sjspielman sjspielman requested review from sjspielman and removed request for jaclyn-taroni October 8, 2024 11:52
@sjspielman
Copy link
Member

sjspielman commented Oct 8, 2024

Hi @maud-p , I'm looking into the Dockerfile issue, and it is indeed strange that it is failing now when it has previously built fine, since the packages you added do not appear related to the build failure.
For now, I'm going to suggest a (possibly?) temporary solution which should get it building, but we will want to circle back later and nail down what exactly the real problem is more formally (I'm going to open an issue about it that we will handle in the Data Lab). There is a known bug involved Rhtslib installation via renv in Dockerfiles, and previously we have gotten around this, e.g. in the doublet-detection module, by installing this library before restoring renv. Again, this is weird to see this bug potentially affecting your now, since that package is not actually a new dependency for you, but this seems to work in testing on my end.

Can you add this line into your Dockerfile right after the line that installs renv itself (not the renv::restore() line, but the install.packages('renv') line)? This will hopefully at least get this action passing for now.

# Install Rhtslib first to prevent package installation errors
RUN R -e 'BiocManager::install("Rhtslib")'

(by the way, while you are there, you can also update the line in the Dockerfile that installs renv to keep only the -e flag as well, and remove the --no-echo --no-restore --no-save flags which aren't strictly needed)

Also, just as an FYI: I am going to be filing a different issue soon that will be relevant to your module. We have been trying to find ways to get the OpenScPCA test data to work with Azimuth, but it has proven challenging. So, instead, we're going to propose to update your module to entirely use Seurat for the label transfer step; we in the Data Lab will take care of writing that code and updating the module, and this should not interfere with your progress! We don't anticipate this will affect change results much since you're earlier exploration previously Azimuth and Seurat they were very similar. I will tag you in the issue I file, and we can discuss more there if you like :)

@sjspielman
Copy link
Member

Second comment: In terms of the "Run module" workflow that is failing, I am pretty sure that is a system dependency problem where the jags library is needed. To fix this, you will need to modify the file in the root of the repository, in .github/workflows/run_cell-type-wilms-tumor-06.yml. On lines 52, you'll want to add jags to the line so it reads:

 sudo apt-get install -y libcurl4-openssl-dev libhdf5-dev libglpk40 jags

Please let me know if you need help with this; I can also file a PR to your branch to make this change too!

@maud-p
Copy link
Contributor Author

maud-p commented Oct 9, 2024

Can you add this line into your Dockerfile right after the line that installs renv itself (not the renv::restore() line, but the install.packages('renv') line)? This will hopefully at least get this action passing for now.

# Install Rhtslib first to prevent package installation errors
RUN R -e 'BiocManager::install("Rhtslib")'

Wooo I am impressed, it works on my side, let's see how it is on github Action. Thank you so much!!!!

@maud-p maud-p requested a review from jashapiro October 9, 2024 13:22
Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Note: we'll get this PR ready to go before moving onto #802, since #802 depends on this code. Once this PR goes in, you can then update #802 to be consistent with the changes made here, and then we'll look at that one.)

It's good to see that Docker is building now without errors! But, the reason I believe this is still not running in CI is because the data for input to copykat/infercnv is not generated in CI. This code therefore needs to also be inside an if statement in 00_run_workflow.R where it's only run if we are not testing. In addition, the notebooks processing that output don't exist in this PR, so that will cause an error too.
Since we'd also like to keep 00_run_workflow.R as modular as possible, here's what I recommend:

  • Create a new R script that contains all the new code you added to 00_run_workflow.R , and call that script from 00_run_workflow.R (inside an if statement since we don't want it to run in CI)
  • But, do not include the code that runs the notebooks. You can add it back in as part of the next PR when those notebooks actually get added to the repo.
  • Note that I also included a comment to create a separate script to download the gene order file, and the step to run that script should be here too.

For the first round of review, I left some inline comments for the inferCNV script, and here is some feedback for the copyKat script. Some of the concepts here apply to the inferCNV code as well, so please read the comments here first and then the inline comments, and let me know if anything needs to be clarified!!

First, this script uses a ton of resources. You have a default of 16 cores and you are calling it with 32 is using quite a few resources (32 cpus). This needs to be documented somewhere. At this point, we should add a scripts/README.md file that briefly explains what each script in there does as well as what resources is requires. In addition, resource requirements should be noted in the overall module README file.

We also should nail down the file/directory naming more precisely. I suggest we save files as follows:

  • It is fine to have nested directories within results/{sample_id}/05_copykat for the different conditions, but we still want the file names to communicate the content without needing to know what directory it's in. So, please name files something like: 05_copykat_{sample_id}_{ref/noref}_{distance}.rds
  • As part of updating this, please make sure that all code to create directories and file names is in the same place during setup steps (in other words, not later in the script when you actually run copyKAT). You will also want to update the results README to communicate this file scheme.

In addition, rather than relying on copyKAT's automatic saving, we should more explicitly save results of interest to files. We should add code to save the literal copyKAT object created to an RDS file. (That said, please let me know if you think I'm mistaken and we actually explicitly need the files that are automatically written out, vs just saving the full object.) This way, we will always be running code from the same directory, and we will have control over how and where files are saved.

I am hoping it's possible to do something similar for inferCNV too - save all results in results/06_inferCNV, and rather than relying out automatic export, actually save the inferCNV object to a file named 06_infercnv_{sample_id}_reference-{whatever the reference is}.rds. Note that if the reference is NULL, you'll need to use a different string to communicate that since NULL won't write as a string.

analyses/cell-type-wilms-tumor-06/results/README.md Outdated Show resolved Hide resolved
analyses/cell-type-wilms-tumor-06/results/README.md Outdated Show resolved Hide resolved
analyses/cell-type-wilms-tumor-06/scripts/05_copyKAT.R Outdated Show resolved Hide resolved
analyses/cell-type-wilms-tumor-06/scripts/06_infercnv.R Outdated Show resolved Hide resolved
analyses/cell-type-wilms-tumor-06/scripts/06_infercnv.R Outdated Show resolved Hide resolved
analyses/cell-type-wilms-tumor-06/scripts/06_infercnv.R Outdated Show resolved Hide resolved
analyses/cell-type-wilms-tumor-06/scripts/06_infercnv.R Outdated Show resolved Hide resolved
analyses/cell-type-wilms-tumor-06/scripts/06_infercnv.R Outdated Show resolved Hide resolved
analyses/cell-type-wilms-tumor-06/scripts/06_infercnv.R Outdated Show resolved Hide resolved
analyses/cell-type-wilms-tumor-06/scripts/06_infercnv.R Outdated Show resolved Hide resolved
@maud-p
Copy link
Contributor Author

maud-p commented Oct 10, 2024

Hi @sjspielman , thank you for the review!

I will commit soon the changes, mostly the scripts structure and file saving + README.md for the scripts and results folder.
Let me know if I forget something or misunderstood something!

I checked that the script is running on few sample/condition, I will load these few and incomplete result on the s3 bucket so you can start assessing the code and output if you like. The copykat is extremly slow and I hope to have the entire results by tomorrow then (for the 5 selected samples).

Thank you!

@maud-p
Copy link
Contributor Author

maud-p commented Oct 10, 2024

Also, for the output of copykat, I saved two additional .txt files with the prediction and CNA value per cells, as they are easy table to work with and the notebook_template uses them, I hope this is OK :)

Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had a quick peek at some of the changes you've made so far, looks on the right track!! Left a couple very small comments in the meantime :)

Comment on lines +85 to +88
# change working directory of the script to the scratch directory
# this ensures copykat files get saved to the right location
# there is no option to specify an output directory when running copykat
setwd(scratch_dir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With full file paths defined, we shouldn't need to change directories here.

Copy link
Contributor Author

@maud-p maud-p Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure, because there is no option to specify an output directory when running copykat, and it will save automatically quite some files.
Or did I missed something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are always going to copy files that we're saving, it shouldn't matter where we run the code from. For any files that copyKAT automatically saved that we don't need to keep, we can just delete those files. Are there any additional files copyKAT is saving that we are not copying into results?

Comment on lines 69 to 70
output_jpeg_ref <- file.path(result_dir, "05_copyKAT", "ref", opts$distance, glue::glue("05_copykat_",opts$sample_id,"_ref_distance-", opts$distance, "_copykat_heatmap.jpeg"))
output_jpeg_noref <- file.path(result_dir, "05_copyKAT", "noref", opts$distance, glue::glue("05_copykat_",opts$sample_id,"_noref_distance-", opts$distance, "_copykat_heatmap.jpeg"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to save these jpeg files? One reason I ask if that our pre-commit hooks currently are only allowing PNG images. Either way, unless the results are getting specifically used, you do not have to save each individual file that copyKAT outputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to save the heatmaps to have then a quick look at the outputs with the different parameters (choice of normal and distance), as done for the Ewing sarcoma analysis (lines 116 and 120)

What about converting to png and then save the png?

Comment on lines 7 to 8
# We build the gene position file reference for infercnv ------------------------
system(command = glue::glue("Rscript ", file.path(module_base, "scripts", "06a_build-geneposition.R")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's instead call this directly from the 00_run_workflow.R script instead. This way, there are no "hidden" steps being called within scripts.

@maud-p
Copy link
Contributor Author

maud-p commented Oct 11, 2024

Hi @sjspielman thank you for the review!

I would like to share with you some thoughs, as I am not sure what to do in which order.

Now that the code is better organized, it is easy to run/change different parameter, change the saved output etc BUT it takes forever to run. What do you think if

  • for this PR we focus on only 1 sample, to make sure the code run smoothly?
  • we then go ahead with the second PR and check that with the saved output we can get the information we need to compare the different approach (copykat versus infercnv, parameters (x,y,z) versus (x',y',z'), etc.)
  • I think that the chance is quite high that we need another PR to change the cnv workflow

Maybe one example to illustrate. I was trying to check what to do with infercnv outputs and found that we can extract and plot as FeaturesPlot the most prevalent large CNV after combining with the Seurat object (https://github.com/broadinstitute/inferCNV/wiki/Extracting-features). But for that we need to run infercnv with HMM = TRUE.

I also read that using a subcluster analysis mode and HMM infercnv::run(analysis_mode='subclusters') could halp use identifying tumor cluster with common alterations.

--> I definitely would love to explore these infercnv functions and how we could use it for our question tumor/normal?

Also, I am really not convinced by copykat in Wilms tumor. Maybe you could have a look at the few notebooks in PR#802 like the 06_cnv_exploration_{sample_id}.html. In these notebooks, I simply ploted the CNV heatmaps for copykat and infercnv. My impressions from these are that:

  • copykat heatmap are very pale and it is difficult to detect CNV
  • I am not always sure how copykat decide what is diploid and aneuploid. Sometime diploid cells showed more alteration than aneuploid ones...
  • copykat is really sensitive to the distance and normal parameters.

I am also not sure what result to upload to s3 bucket as I now ran different additional parameters (like HMM) that generated different outputs.

Let me know what do you think :)

Thank you!

@maud-p
Copy link
Contributor Author

maud-p commented Oct 14, 2024

Hi @sjspielman ,
I wanted to let you know that I am uploading the results to the s3 bucket 🚀

  • copykat:

For each of the 5 selected samples and each condition (reference and distance), we saved in results/{sample_id}/05_copykat/{reference}/{distance}/:

  • the final copykat rds object in 05_copykat_{sample_id}_{reference}_distance-{selection}.rds

  • the heatmap of CNV in 05_copykat_{sample_id}_{reference}_distance-{selection}_copykat_heatmap.png

  • the prediction (aneuploid or diploid value per cell) in 05_copykat_{sample_id}_{reference}_distance-{selection}_copykat_prediction.txt

  • the CNA matrix 05_copykat_{sample_id}_{reference}_distance-{selection}_copykat_CNA_results.txt

  • infercnv:
    For each of the 5 selected samples and each condition (reference), we saved in results/{sample_id}/06_infercnv/reference-{selection}:

  • the final infercnvrds object in 06_infercnv_{sample_id}_reference-{selection}.rds

  • the heatmap of CNV in 06_infercnv_{sample_id}_reference-{selection}_heatmap.png

  • infercnv with HMM=TRUE:
    In addition, fot the condition reference = "both", I ran infercnv with HMM = TRUE.
    HMM CNV prediction methods will allow us to explore the CNV results better, with an easy merge of infercnv result with the Seurat object.
    I am pretty sure we will want to used this output for the next step, and prepared to save it in results/{sample_id}/06_infercnv_SCPCS000208_reference-both.rds directly.

With the next PR, we will chose 1 r 2 methods that are the best to explore the data. I think that the CNV heatmap of infercnv will be enough to decide on the "normal" reference to use. Then I can re-run with this reference infercnv with HMM=TRUE.

Major point is that this infecnv HMM sounds super cool, but requires lot of resource and time! It makes my RSeesion crash quite often (every 2 samples) and needs about 2h of run per sample. To be considered when we start running it for the entire cohorte 😆

I hope I am +/- clear, please do not hesitate to reach out if something isn't that clear 😄
Thank you!

@maud-p maud-p requested a review from sjspielman October 14, 2024 10:32
Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @maud-p, sorry for a bit of a pause here, I was out of the office last Friday. I think the code is about there, but just a few more comments when I think about next steps which I am now realizing; sorry for not thinking of this sooner! When we eventually run copykat and/or infercnv on more samples, we'll be calling the 05 & 06 scripts directly from the script that runs the workflow. The copykat script currently assumes that it should be run under two conditions (reference and no reference), but this will not necessarily match how we will use the script next. So, let's simplify it a bit: please include an option with optparse in that script that specifies whether or not it should use the normal reference (this is similar to what you do with infercnv). This means your script will define only 1 set of output file names (which can be named depending on if using the reference is true or false), and it will only need to run copykat one time which you can do a little more simply using the ifelse function directly in the code, for example...

copykat.result<- copykat(rawmat=exp.rawdata, 
                         sam.name=opts$sample_id, 
                         distance=opts$distance, 
                         # Use ifelse to only provide normal cell names if reference is true
                         norm.cell.names=ifelse(opt$use_reference, normal_cell, ""), 
                         genome="hg20",
                         n.cores= opts$n_core, 
                         id.type = "E",
                         plot.genes = FALSE,
                         output.seg = FALSE,
                         KS.cut = 0.05)

To define file names, you can have only one if statement and then populate everything accordingly, for example:

if (opt$use_reference) {
  ref_string <- "ref"
} else {
  ref_string <- "noref"
}

Then, use the ref_string variable when defining file names.

For now, do not worry about re-running the code after making these changes, since it will take a long time. We can take care of assuring the code works as expected on our side when we go to figure the test data situation with Azimuth (still forthcoming, stay tuned!).

EDIT: note that later in discussion here we decided to leave this code as is, since we almost certainly will not be using copyKAT further.

I am now going to start looking carefully at your results and will come back and comment on that soon. First, let me just say - I don't think it's going to be worth using the HMM with the very long run times you mention, and it may start to get expensive. What I recommend doing is noting in comments in the script that the HMM is also an option, but has a very long run time so it was not chosen. This way in the future, someone can come back to try it later - the module isn't going anywhere, and it can always be further explored in the future! But for now, I don't want that run time to hold you up further.

Also, fyi - I tagged you in a separate discussion post (#808) which another OpenScPCA contributor who is working on a different set of Wilm's samples opened to discuss using CNV methods. There may be some useful insights for you there, as well as insights you might like to share based on their exploration. If you'd like to join in that discussion, please feel free!

Comment on lines +85 to +88
# change working directory of the script to the scratch directory
# this ensures copykat files get saved to the right location
# there is no option to specify an output directory when running copykat
setwd(scratch_dir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are always going to copy files that we're saving, it shouldn't matter where we run the code from. For any files that copyKAT automatically saved that we don't need to keep, we can just delete those files. Are there any additional files copyKAT is saving that we are not copying into results?

Comment on lines 124 to 125
fs::file_copy(scratch_prediction, output_prediction_noref, overwrite = TRUE)
fs::file_copy(scratch_CNA, output_CNA_noref, overwrite = TRUE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than file_copy, we probably want file_move - this way there will not still be an original copy of the file left behind.

Suggested change
fs::file_copy(scratch_prediction, output_prediction_noref, overwrite = TRUE)
fs::file_copy(scratch_CNA, output_CNA_noref, overwrite = TRUE)
fs::file_move(scratch_prediction, output_prediction_noref, overwrite = TRUE)
fs::file_move(scratch_CNA, output_CNA_noref, overwrite = TRUE)

Comment on lines 148 to 149
fs::file_copy(scratch_prediction, output_prediction_ref, overwrite = TRUE)
fs::file_copy(scratch_CNA, output_CNA_ref, overwrite = TRUE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fs::file_copy(scratch_prediction, output_prediction_ref, overwrite = TRUE)
fs::file_copy(scratch_CNA, output_CNA_ref, overwrite = TRUE)
fs::file_move(scratch_prediction, output_prediction_ref, overwrite = TRUE)
fs::file_move(scratch_CNA, output_CNA_ref, overwrite = TRUE)


# We only run the CNV HMM prediction model if the reference is "both"
HMM_logical = FALSE
if(opts$reference == "both){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if(opts$reference == "both){
if(opts$reference == "both"){

# We only run the CNV HMM prediction model if the reference is "both"
HMM_logical = FALSE
if(opts$reference == "both){
HMM_logical = TRUE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
HMM_logical = TRUE
HMM_logical <- TRUE

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's actually rename this to explore-cnv-methods.R, since if/when we run cnv methods on other samples, we'll probably run the other scripts directly since we'll be looping over samples from the module run script.

@sjspielman
Copy link
Member

@maud-p - So, I just left you a review about making the copyKAT script more flexible, and since then I've started to look more in-depth into the actual copyKAT results. I am seeing radically different results among conditions, so I am very hesitant to trust any of these results. The heatmaps do not appear to show a strong visual signal, so I went ahead and made some quick comparisons across inferences. Below, I have pasted some contingency tables comparing predictions between conditions.

Here tables comparing results from SPEARMAN distance between reference and no reference, where each table is a sample:

              aneuploid diploid not.defined
  aneuploid         256    1549           0
  diploid          1659    2199           0
  not.defined         0       0        4994
                                                                                                                   
              aneuploid diploid not.defined
  aneuploid         310     819           0
  diploid          2868    1641           0
  not.defined         0       0         720
                                                                                                                   
              aneuploid diploid not.defined
  aneuploid          21     398           0
  diploid          1210    1845           0
  not.defined         0       0          60
                                                                                                                   
              aneuploid diploid not.defined
  aneuploid        2539     503           0
  diploid          3890    2678           0
  not.defined         0       0        1202
                                                                                                                   
              aneuploid diploid not.defined
  aneuploid         837    7298           0
  diploid          1862    1981           0
  not.defined         0       0         532

Here's how predictions look between spearman and euclidean WITH NO reference:

              aneuploid diploid not.defined
  aneuploid        1711      94           0
  diploid          2482    1376           0
  not.defined         0       0        4994
                                                                                                                   
              aneuploid diploid not.defined
  aneuploid         146     983           0
  diploid          2348    2161           0
  not.defined         0       0         720
                                                                                                                   
              aneuploid diploid not.defined
  aneuploid          11     408           0
  diploid           833    2222           0
  not.defined         0       0          60
                                                                                                                   
              aneuploid diploid not.defined
  aneuploid        2539     503           0
  diploid          3890    2678           0
  not.defined         0       0        1202
                                                                                                                   
              aneuploid diploid not.defined
  aneuploid        3549    4586           0
  diploid          3453     390           0
  not.defined         0       0         532

Finally, predictions between spearman and euclidean WITH reference:

              aneuploid diploid not.defined
  aneuploid        1001     914           0
  diploid           863    2885           0
  not.defined         0       0        4994
                                                                                                                   
              aneuploid diploid not.defined
  aneuploid        1058    2120           0
  diploid           109    2351           0
  not.defined         0       0         720
                                                                                                                   
              aneuploid diploid not.defined
  aneuploid         743     488           0
  diploid            17    2226           0
  not.defined         0       0          60
                                                                                                                   
              aneuploid diploid not.defined
  aneuploid        1093    5336           0
  diploid          1768    1413           0
  not.defined         0       0        1202
                                                                                                                   
              aneuploid diploid not.defined
  aneuploid        2329     370           0
  diploid          3921    5358           0
  not.defined         0       0         532

While it is clear that all parameter settings here agree on which cells not to classify, there is serious disagreement over which cells are diploid vs. aneuploid. Given that there is such little consistency across conditions, I do not think we should proceed with copyKAT here. Therefore, it's probably not worth the effort to modify that script much more!! I'm going to check in with some of my other colleagues too to see if they have additional opinions, but please let me know also what you think!

@maud-p
Copy link
Contributor Author

maud-p commented Oct 14, 2024

Dear @sjspielman,

Thank you very much for looking into it! I didn't wanted to influence and overconclude too rapidly but I definitly think that we shouldn't spend too much time on copykat. So I agree with you! 😃

Looking at the CNV heatmaps, there is no obvious CNV and I don't see how copykat assign aneuploid or diploid status most of the time... Because I had to re-run and change the .jpeg into .png, I realized that the results, with same conditions, differ from one call to the other... 😠

We also ran internally copykat on few VISIUM data I generated and for which we have the pathologist annotations "normal", "cancer blastema" and "cancer stroma". copykat failed identifying the normal/cancer.

I am much more happier with the infercnv results, with some large CNV like 1p, 1q, chr11 that pop-up, I am more confident with this 😃 However, results are not 100% the same depending on the normal cells selected (immune, endothelium or both). I guess this is due to specific portion of chromosomes that are specific to some cell types including endothelium and some immune cells.

I am not sure if we can really avoid working with HMM, I don't know if we can extract infercnv features if we didn't run an HMM model. 😕 Maybe I will try to play with the results generated for this PR and check what would be really useful?

Would it be an alternative option to downsample the Seurat obejct per cluster, assuming that cells in a cluster might have the same CNV profile, and run infercnv grouping per clusters?

One though: looking at infercnv heatmap, it seems that all fetal nephron predicted cells and mos of the stroma predicted cells are indeed cancer cells. Normal kidney cells might actually be under-represented in Wilms tumor.

As you suggested, I let the copykat a bit on the side, reports generated for the next PR show that we can't conclude much fro it. We can keep them as an info that copykat might not be the best for Wilms tumor. I will work more on how to explore infercnv results, hopefully tomorrow 😄

Thank you!

@sjspielman
Copy link
Member

Would it be an alternative option to downsample the Seurat obejct per cluster, assuming that cells in a cluster might have the same CNV profile, and run infercnv grouping per clusters?

I think this could be an interesting alternative approach, but in order for it to be feasible and reliable, there would be quite a few more things we'd need to do. First, if clusters will truly form the basis of the results themselves, we'd want to revisit clustering in the first place to ensure they are "optimal." The quality of clusters we have was not rigorously explored, including a systematic varying of clustering parameters. (This is actually something we think about in the Data Lab a lot, since it's really hard to choose the "best" way to build clusters. In fact, I'm actually in the process of writing an analysis module that demonstrates how to use functions in package we wrote for OpenScPCA-analysis to thoroughly explore cluster quality to identify a more reliable clustering result.) Next, the success of that analysis would also depend on infercnv's power to detect signal from smaller datasets, as well as how much dataset size influences results, so we'd possibly want to do some of bootstrapping with infercnv to get reliable cluster-wide results. Plus, how much we can generalize those results will also depend on how "tight" clusters are - some clusters will be more heterogeneous than others, and those annotations would be much less reliable compared to very compact clusters where cells are more similar. All this is to say - it's definitely a very interesting idea to explore, but I think it's actually a much bigger analysis! It might be something to pursue later in general, but I don't think right now is the time for it.

I did have a look at the inferCNV heatmaps, and I agree they do look a bit better than copyKAT, but I also I'm not sure how this will work out across the whole cohort since not every sample will have a set of normal cells to use as reference. Further, as I mentioned before in the context of copyKAT, having agreement among different conditions is a good way to evaluate how reliable a method is, and if we get really different results for the same sample run with and without a reference, that makes me a bit worried. I have not personally run inferCNV unfortunately, so I don't have many specific suggestions there, but I do see that their wiki says their 3-state model is similar to a different method HoneyBADGER - I wonder if that method could give us similar information about whether cells have certain CNVs associated with WT (I assume this is what you want to explore with the HMM results, or am I wrong?). Or, it may be that a marker-genes based approach might be better for WT than CNV methods, or it could be complementary at least to the CNV methods.

Either way, let's go ahead and get this code in and move onto the next notebooks where we can explore more carefully to "officially" rule out copyKAT, and make decisions about inferCNV. For that, please go ahead and accept the suggestions I left, and also rename that script as I commented; being called "run_cnv.R" implies that it's the main script that will run CNV across samples, but it's really just meant to explore the methods to get a sense of what we should run across samples. Then, I'll approve this PR and get it merged, so that the other branch with your notebooks can be updated. Once that PR is updated so its tests pass, we'll start that review!

@maud-p
Copy link
Contributor Author

maud-p commented Oct 15, 2024

Either way, let's go ahead and get this code in and move onto the next notebooks where we can explore more carefully to "officially" rule out copyKAT, and make decisions about inferCNV. For that, please go ahead and accept the suggestions I left, and also rename that script as I commented; being called "run_cnv.R" implies that it's the main script that will run CNV across samples, but it's really just meant to explore the methods to get a sense of what we should run across samples. Then, I'll approve this PR and get it merged, so that the other branch with your notebooks can be updated. Once that PR is updated so its tests pass, we'll start that review!

I should have done the requested changes, hopefully I haven't mess up things (I somehow loosed track of my branches 😖 )

Once this is merged, I'd like to (in separated PR)

  • adapt the notebook template for copykat
  • adapt the notebook template for comparison of the infercnv methods (using maybe additional cool plots I saw in @JingxuanChen7 PR#808 and exploring at the same time the predicted label from the fetal kidney reference.

I think that the predicted cell types from the kidney atlas will be the most informative.

Just thinking, when comparing fetal with a mature kidney. I am not 100% sure but I would say a kidney from a borned child should be closer to mature, while Wilms tumor is known to be more resembling to fetal kidney. When we look at the composition of mature versus fetal kidney:
image
Cap Mesenchyme (CM, named for some reason mesenchymal cell in the fetal_kidney_reference, predicted.cell.type) is only present in fetal kidney. I checked, on completion of nephrogenesis, the cap mesenchyme is exhausted. Meaning that cells annotated as cap mesenchyme = "mesenchymal cell" should be cancer cells.

If we now look at more differentiated and functional units of kidney, they are present in different proportion in embryonic and mature kidney. Cells annotated as "kidney cells" or "podocytes" can be either cancer epithelial cells or normal kidney cells.

  • For sample 179, clusters 1, 5 and 6 would be either epithelial cancer or normal cells.
    image
  • For sample 184, cluster 9 could be a small cluster of normal cells.
    image
  • For sample 194, clusters 1, 4, 5 and 11 would be either epithelial cancer or normal cells. Looking at infercnv profile, all cells seems however to have 11q gain.
    image
    image

I'll stop here the examples but I have the impression that combining info that we already have might be helpful 😄 I'll try to generate a notebook template to explore this better.

@maud-p
Copy link
Contributor Author

maud-p commented Oct 15, 2024

I did have a look at the inferCNV heatmaps, and I agree they do look a bit better than copyKAT, but I also I'm not sure how this will work out across the whole cohort since not every sample will have a set of normal cells to use as reference. Further, as I mentioned before in the context of copyKAT, having agreement among different conditions is a good way to evaluate how reliable a method is, and if we get really different results for the same sample run with and without a reference, that makes me a bit worried. I have not personally run inferCNV unfortunately, so I don't have many specific suggestions there, but I do see that their wiki says their 3-state model is similar to a different method HoneyBADGER - I wonder if that method could give us similar information about whether cells have certain CNVs associated with WT (I assume this is what you want to explore with the HMM results, or am I wrong?). Or, it may be that a marker-genes based approach might be better for WT than CNV methods, or it could be complementary at least to the CNV methods.

Thank you for sharing, I don't know about it. I had a quick look and I am note sure if we wouldn't need the bam files to run HoneyBADGER. But maybe one idea to test (if we need it in a 2nd time) would be to run HMM i3 instead of i6 and see if it could be a bit more rapid?

@maud-p
Copy link
Contributor Author

maud-p commented Oct 15, 2024

Would it be an alternative option to downsample the Seurat obejct per cluster, assuming that cells in a cluster might have the same CNV profile, and run infercnv grouping per clusters?

I think this could be an interesting alternative approach, but in order for it to be feasible and reliable, there would be quite a few more things we'd need to do. First, if clusters will truly form the basis of the results themselves, we'd want to revisit clustering in the first place to ensure they are "optimal." The quality of clusters we have was not rigorously explored, including a systematic varying of clustering parameters. (This is actually something we think about in the Data Lab a lot, since it's really hard to choose the "best" way to build clusters. In fact, I'm actually in the process of writing an analysis module that demonstrates how to use functions in package we wrote for OpenScPCA-analysis to thoroughly explore cluster quality to identify a more reliable clustering result.) Next, the success of that analysis would also depend on infercnv's power to detect signal from smaller datasets, as well as how much dataset size influences results, so we'd possibly want to do some of bootstrapping with infercnv to get reliable cluster-wide results. Plus, how much we can generalize those results will also depend on how "tight" clusters are - some clusters will be more heterogeneous than others, and those annotations would be much less reliable compared to very compact clusters where cells are more similar. All this is to say - it's definitely a very interesting idea to explore, but I think it's actually a much bigger analysis! It might be something to pursue later in general, but I don't think right now is the time for it.

Good to know, I'll keep it in mind then, we can re-talk about it later if needed! Thank you for sharing your thoughts on it 😃

@maud-p maud-p requested a review from sjspielman October 15, 2024 13:16
@sjspielman
Copy link
Member

But maybe one idea to test (if we need it in a 2nd time) would be to run HMM i3 instead of i6 and see if it could be a bit more rapid?

Yeah, i3 would be worth trying both for speed reasons, and for "data-related" reasons. Since we expect some, but not a huge amount, of CNV in this data, I worry that an i6 model might be too parameter-rich for the level of signal in the data. The i3 model might be a better fit for the amount of information our data has. But, we'll think about that in the next PR :)

@sjspielman sjspielman merged commit ea91b17 into AlexsLemonade:main Oct 15, 2024
5 checks passed
@maud-p maud-p deleted the 05_CopyKAT branch October 15, 2024 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants