Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Level 3-5 Cell Painting Data Questions #3

Closed
gwaybio opened this issue Feb 19, 2020 · 6 comments
Closed

Adding Level 3-5 Cell Painting Data Questions #3

gwaybio opened this issue Feb 19, 2020 · 6 comments
Assignees

Comments

@gwaybio
Copy link
Member

gwaybio commented Feb 19, 2020

I am in the process of adding level 3-5 profiles to this repo (using git lfs). I will use this issue to document various questions I have about the process.

  1. Confirm what the levels actually are! 😆
  2. I assume I should add cytominer profiles here? We should consider the pycytominer-based profiles less mature (and therefore less stable)? The cytominer profiles are the ones that were originally computed.
@shntnu shntnu self-assigned this Feb 19, 2020
@shntnu
Copy link
Collaborator

shntnu commented Feb 19, 2020

  • I've addressed 1. via your comment
  • Is it blocking if I get to 2. mid next next week? I'd need to dig up a few things

@gwaybio
Copy link
Member Author

gwaybio commented Feb 19, 2020

Is it blocking if I get to 2. mid next next week? I'd need to dig up a few things

Not currently blocking 👍

@shntnu
Copy link
Collaborator

shntnu commented Feb 27, 2020

This was the workflow used (repo is private), which predates the profiling handbook.

My suggestion is to use the the Level 3 data in /home/ubuntu/bucket/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/backend/2016_04_01_a549_48hr_batch1 and then recreate Level 4, 5 using the handbook. For 10 plates, you would also need to create the Level 3 data, because they were processed incorrectly (see below).

If you have the time, I would suggested comparing the level 3, 4, 5 data to that generated using pycytominer (assuming you already have this), and if the differences are within expected floating point errors, then using the pycytominer version instead (given that at least at the moment, cytominer is not on CRAN and therefore not easily reproducible)

A hack for figuring out which statistic was used to summarize single cell profiles:

x <-
  list.files(
    "../../backend/2016_04_01_a549_48hr_batch1/",
    recursive = TRUE,
    full.names = TRUE,
    pattern = "*augmented.csv"
  ) %>%
  map_df(function(fname) {
    read_csv(
      fname,
      col_types =
        cols_only(
          Cells_AreaShape_Area = col_double(),
          Metadata_Plate = col_character()
        )
    )
  })
x %<>%
  group_by(Metadata_Plate) %>%
  summarise(is_median = sum(is_median), n = n())
x %<>%
  mutate(is_median = ceiling(Cells_AreaShape_Area * 2) == Cells_AreaShape_Area * 2)
x %<>%
  group_by(Metadata_Plate) %>% summarise(is_median = sum(is_median), n = n())
x %<>%
  filter(is_median != n)
x %>%
  knitr::kable()

The level 3 data for these plates were created using means, not medians and should be reprocessed to using medians.

Metadata_Plate is_median n
SQ00015116 0 384
SQ00015117 0 384
SQ00015118 1 384
SQ00015119 1 384
SQ00015120 1 384
SQ00015121 1 384
SQ00015122 0 384
SQ00015123 1 384
SQ00015125 1 384
SQ00015126 2 384

@gwaybio
Copy link
Member Author

gwaybio commented Mar 6, 2020

I am working through confirming pycytominer and cytominer equivalency. I added cytomining/pycytominer#72 to mirror the cytominer "robust" function.

I tested this using one example plate (SQ00014814) and level 3 and level 4a data that were derived from pycytominer in broadinstitute/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad#3 (private repo) and cytominer, presumably from this pipeline.

I am noting here an observation and potential discrepancy in the cytominer processing details. Specifically, I noticed that the cytominer processing (link above) notes that the plate was normalized against DMSO. However, when I compare the cytominer results to the pycytominer results, the cytominer results are closer to pycytominer results when normalizing against the whole plate. Furthermore, the cytominer level 4a results are more similar to cytominer level 3 results processed with pycytominer using all samples vs. using DMSO samples.

from pycytominer.cyto_utils import infer_cp_features
from pycytominer import normalize

cytominer_df = data["level_three"]["cytominer"]
pycytominer_df = data["level_three"]["pycytominer"]
pycytominer_df.Metadata_broad_sample = pycytominer_df.Metadata_broad_sample.fillna("DMSO")

cp_cols = infer_cp_features(pycytominer_df)

# Process pycytominer level 3 data with two different normalization strategies
pycytominer_norm_all_df = normalize(
    profiles=pycytominer_df,
    features="infer",
    method="mad_robustize",
    samples="all",
    output_file="none",
).loc[:, cp_cols]

pycytominer_norm_dmso_df = normalize(
    profiles=pycytominer_df,
    features="infer",
    method="mad_robustize",
    samples="Metadata_broad_sample == 'DMSO'",
    output_file="none",
).loc[:, cp_cols]

# Process cytominer level 3 data with two different normalization strategies
cytominer_norm_all_df = normalize(
    profiles=cytominer_df,
    features="infer",
    method="mad_robustize",
    samples="all",
    output_file="none",
).loc[:, cp_cols]

cytominer_norm_dmso_df = normalize(
    profiles=cytominer_df,
    features="infer",
    method="mad_robustize",
    samples="Metadata_broad_sample == 'DMSO'",
    output_file="none",
).loc[:, cp_cols]

# Also load cytominer level four data
cytominer_level_four_df = data["level_four"]["cytominer"]

# Screenshot of results below
((cytominer_level_four_df - pycytominer_norm_all_df).sum().abs() > 1e-10).sum()
((cytominer_level_four_df - pycytominer_norm_dmso_df).sum().abs() > 1e-10).sum()
((cytominer_level_four_df - cytominer_norm_all_df).sum().abs() > 1e-10).sum()
((cytominer_level_four_df - cytominer_norm_dmso_df).sum().abs() > 1e-10).sum()

image

Summary

Cytominer normalization seems closer to whole plate normalization than DMSO normalization.

Question

Which is the most appropriate normalization strategy? All pycytominer-derived data is based on a DMSO-normalized strategy. This includes the results with batch effect observed across DMSO wells broadinstitute/cell-health#84

@gwaybio
Copy link
Member Author

gwaybio commented Mar 8, 2020

Note an update to robustize_mad in cytomining/pycytominer#74 that deals with missing values and divide by zero errors. It will however, potentially introduce exploding features (features containing extreme values)

@gwaybio
Copy link
Member Author

gwaybio commented Apr 30, 2020

closing in favor of #22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants