Adding Level 3-5 Cell Painting Data Questions #3

gwaybio · 2020-02-19T19:47:01Z

I am in the process of adding level 3-5 profiles to this repo (using git lfs). I will use this issue to document various questions I have about the process.

Confirm what the levels actually are! 😆
- Described here: Note about citing clue.io #1 (comment)
I assume I should add cytominer profiles here? We should consider the pycytominer-based profiles less mature (and therefore less stable)? The cytominer profiles are the ones that were originally computed.
- Located here: /home/ubuntu/bucket/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/backend/2016_04_01_a549_48hr_batch1
- Were they processed using the standard profiling workflow?
  - https://github.com/cytomining/profiling-handbook

The text was updated successfully, but these errors were encountered:

shntnu · 2020-02-19T21:14:44Z

I've addressed 1. via your comment
Is it blocking if I get to 2. mid next next week? I'd need to dig up a few things

gwaybio · 2020-02-19T21:31:14Z

Is it blocking if I get to 2. mid next next week? I'd need to dig up a few things

Not currently blocking 👍

shntnu · 2020-02-27T14:17:33Z

This was the workflow used (repo is private), which predates the profiling handbook.

My suggestion is to use the the Level 3 data in /home/ubuntu/bucket/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/backend/2016_04_01_a549_48hr_batch1 and then recreate Level 4, 5 using the handbook. For 10 plates, you would also need to create the Level 3 data, because they were processed incorrectly (see below).

If you have the time, I would suggested comparing the level 3, 4, 5 data to that generated using pycytominer (assuming you already have this), and if the differences are within expected floating point errors, then using the pycytominer version instead (given that at least at the moment, cytominer is not on CRAN and therefore not easily reproducible)

A hack for figuring out which statistic was used to summarize single cell profiles:

x <-
  list.files(
    "../../backend/2016_04_01_a549_48hr_batch1/",
    recursive = TRUE,
    full.names = TRUE,
    pattern = "*augmented.csv"
  ) %>%
  map_df(function(fname) {
    read_csv(
      fname,
      col_types =
        cols_only(
          Cells_AreaShape_Area = col_double(),
          Metadata_Plate = col_character()
        )
    )
  })
x %<>%
  group_by(Metadata_Plate) %>%
  summarise(is_median = sum(is_median), n = n())
x %<>%
  mutate(is_median = ceiling(Cells_AreaShape_Area * 2) == Cells_AreaShape_Area * 2)
x %<>%
  group_by(Metadata_Plate) %>% summarise(is_median = sum(is_median), n = n())
x %<>%
  filter(is_median != n)
x %>%
  knitr::kable()

The level 3 data for these plates were created using means, not medians and should be reprocessed to using medians.

Metadata_Plate	is_median	n
SQ00015116	0	384
SQ00015117	0	384
SQ00015118	1	384
SQ00015119	1	384
SQ00015120	1	384
SQ00015121	1	384
SQ00015122	0	384
SQ00015123	1	384
SQ00015125	1	384
SQ00015126	2	384

gwaybio · 2020-03-06T16:58:37Z

I am working through confirming pycytominer and cytominer equivalency. I added cytomining/pycytominer#72 to mirror the cytominer "robust" function.

I tested this using one example plate (SQ00014814) and level 3 and level 4a data that were derived from pycytominer in broadinstitute/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad#3 (private repo) and cytominer, presumably from this pipeline.

I am noting here an observation and potential discrepancy in the cytominer processing details. Specifically, I noticed that the cytominer processing (link above) notes that the plate was normalized against DMSO. However, when I compare the cytominer results to the pycytominer results, the cytominer results are closer to pycytominer results when normalizing against the whole plate. Furthermore, the cytominer level 4a results are more similar to cytominer level 3 results processed with pycytominer using all samples vs. using DMSO samples.

from pycytominer.cyto_utils import infer_cp_features
from pycytominer import normalize

cytominer_df = data["level_three"]["cytominer"]
pycytominer_df = data["level_three"]["pycytominer"]
pycytominer_df.Metadata_broad_sample = pycytominer_df.Metadata_broad_sample.fillna("DMSO")

cp_cols = infer_cp_features(pycytominer_df)

# Process pycytominer level 3 data with two different normalization strategies
pycytominer_norm_all_df = normalize(
    profiles=pycytominer_df,
    features="infer",
    method="mad_robustize",
    samples="all",
    output_file="none",
).loc[:, cp_cols]

pycytominer_norm_dmso_df = normalize(
    profiles=pycytominer_df,
    features="infer",
    method="mad_robustize",
    samples="Metadata_broad_sample == 'DMSO'",
    output_file="none",
).loc[:, cp_cols]

# Process cytominer level 3 data with two different normalization strategies
cytominer_norm_all_df = normalize(
    profiles=cytominer_df,
    features="infer",
    method="mad_robustize",
    samples="all",
    output_file="none",
).loc[:, cp_cols]

cytominer_norm_dmso_df = normalize(
    profiles=cytominer_df,
    features="infer",
    method="mad_robustize",
    samples="Metadata_broad_sample == 'DMSO'",
    output_file="none",
).loc[:, cp_cols]

# Also load cytominer level four data
cytominer_level_four_df = data["level_four"]["cytominer"]

# Screenshot of results below
((cytominer_level_four_df - pycytominer_norm_all_df).sum().abs() > 1e-10).sum()
((cytominer_level_four_df - pycytominer_norm_dmso_df).sum().abs() > 1e-10).sum()
((cytominer_level_four_df - cytominer_norm_all_df).sum().abs() > 1e-10).sum()
((cytominer_level_four_df - cytominer_norm_dmso_df).sum().abs() > 1e-10).sum()

Summary

Cytominer normalization seems closer to whole plate normalization than DMSO normalization.

Question

Which is the most appropriate normalization strategy? All pycytominer-derived data is based on a DMSO-normalized strategy. This includes the results with batch effect observed across DMSO wells broadinstitute/cell-health#84

gwaybio · 2020-03-08T17:20:05Z

Note an update to robustize_mad in cytomining/pycytominer#74 that deals with missing values and divide by zero errors. It will however, potentially introduce exploding features (features containing extreme values)

gwaybio · 2020-04-30T17:42:23Z

closing in favor of #22

shntnu self-assigned this Feb 19, 2020

gwaybio mentioned this issue Mar 4, 2020

Adding Robust Normalization by MAD cytomining/pycytominer#72

Merged

gwaybio mentioned this issue Mar 7, 2020

Reprocess image-based profiles with robustize_mad broadinstitute/cell-health#113

Closed

gwaybio mentioned this issue Mar 8, 2020

Required Steps for Depositing Profiles #4

Closed

shntnu mentioned this issue Apr 29, 2020

Stated aggregation method (mean) is inconsistent with that used in other projects cytomining/profiling-handbook#53

Closed

gwaybio mentioned this issue Apr 30, 2020

Updated Strategy for Adding Profiles #22

Closed

gwaybio closed this as completed Apr 30, 2020

gwaybio mentioned this issue May 10, 2020

Comparing Cytominer and Pycytominer Profiles #28

Closed

gwaybio mentioned this issue Sep 16, 2022

Reprocess 10 plates because 36 features are missing #88

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Level 3-5 Cell Painting Data Questions #3

Adding Level 3-5 Cell Painting Data Questions #3

gwaybio commented Feb 19, 2020 •

edited

Loading

shntnu commented Feb 19, 2020

gwaybio commented Feb 19, 2020

shntnu commented Feb 27, 2020 •

edited

Loading

gwaybio commented Mar 6, 2020

gwaybio commented Mar 8, 2020

gwaybio commented Apr 30, 2020

Adding Level 3-5 Cell Painting Data Questions #3

Adding Level 3-5 Cell Painting Data Questions #3

Comments

gwaybio commented Feb 19, 2020 • edited Loading

shntnu commented Feb 19, 2020

gwaybio commented Feb 19, 2020

shntnu commented Feb 27, 2020 • edited Loading

gwaybio commented Mar 6, 2020

Summary

Question

gwaybio commented Mar 8, 2020

gwaybio commented Apr 30, 2020

gwaybio commented Feb 19, 2020 •

edited

Loading

shntnu commented Feb 27, 2020 •

edited

Loading