Dropping outlier features #65

shntnu · 2021-04-20T19:40:03Z

MB said:

I have found a “error” in the Lincs dataset and I was wondering if you guys knew of this and if there needs to be some fixing of the pycyto pipeline? I am analyzing the Level 5 consensus data from here. When running the cyto eval functions on this data, I noticed some very high correlations. They come from this one feature (Nuclei_AreaShape_MedianRadius) that is 10^13 times larger than the others. The image shows a scatter plot of two samples which have a 1.000 similarity but are different compounds.

This is almost definitely because of mad of these features being zero in DMSO (at least for the plates that those compounds come from.

https://github.com/cytomining/pycytominer/blob/a04397d9cd7e25828d2f24f986a3386a79e6193d/pycytominer/operations/transform.py#L142

Add drop_outliers to https://github.com/broadinstitute/lincs-cell-painting/blob/master/profiles/profile_cells.py
Reprocess

The text was updated successfully, but these errors were encountered:

gwaybio · 2021-04-20T20:17:36Z

Nice - i don't think it's worth doing before the first data freeze (see #63)

But it is definitely worth noting which features this impacts - @michaelbornholdt do you have this info? Are they only the three features?

Nuclei_AreaShape_MedianRadius
Cells_Correlation_Manders_AGP_RNA and
Cells_Neighbors_NumberOfNeighbors_10

I can add a prominent note to make sure these are dropped in all downstream analyses in a README in #63

michaelbornholdt · 2021-04-20T21:31:40Z

@gwaygenomics
Here are the features that have higher values than 200:

So just to be sure, I will not do anything to the pipeline but just locally delete these features so I can carry on with my analysis. Correct? @shntnu

shntnu · 2021-04-20T23:10:04Z

will not do anything to the pipeline but just locally delete these features so I can carry on with my analysis. Correct?

yes

shntnu · 2021-04-20T23:10:23Z

i don't think it's worth doing before the first data freeze

yes

gwaybio · 2021-04-21T20:03:14Z

@shntnu - it turns out that I can very easily update #63 consensus and spherized profiles to add drop_outliers without having to rerun everything.

@michaelbornholdt do you recommend using 200 as a cutoff? I use 60 currently in spherized profiles, but I'd be happy to update to 200 if you have any data-driven rationale

michaelbornholdt · 2021-04-21T21:07:29Z

I can try several dropouts and look at the precision recall or do you guys have a better idea of deciding the threshold?

gwaybio · 2021-04-21T21:43:59Z

that sounds good to me. What specifically will you try? Altering outlier_num in (np.abs(df[feature]) > outlier_num) ?

michaelbornholdt · 2021-04-21T21:47:11Z

The following is the precision at k = 5 for different threshold values:
It looks like anything from 100-500
is a sensible value to use

threshold 60.000000
precision 0.776667
threshold 100.000000
precision 0.786667
threshold 200.00
precision 0.78
threshold 500.000000
precision 0.783333
threshold 1000.000000
precision 0.783333
threshold 10000.000000
precision 0.733333

michaelbornholdt · 2021-04-21T21:48:23Z

I haven't worked with the outlier functionality. Will need to get my head around that part of the pipeline first.
I just wrote my own function to drop the columns with the high values

gwaybio · 2021-04-21T21:51:33Z

awesome, thanks Michael!

gwaybio · 2021-04-21T21:54:48Z

The pycytominer drop outlier strategy is simple:

https://github.com/cytomining/pycytominer/blob/a04397d9cd7e25828d2f24f986a3386a79e6193d/pycytominer/cyto_utils/features.py#L141-L143

based on your code screenshot in #65 (comment) i think you're doing something very similar, if not exactly the same

michaelbornholdt · 2021-04-21T22:01:27Z

Can you update the files in the consensus then so that people don't run into the same problems?

gwaybio · 2021-04-21T22:05:10Z

yep, that is the plan in #63 I'll use 100 for the threshold

gwaybio · 2021-04-22T15:21:14Z

alright, I tried 100 (and then bumped it up to 200). I remember now why I didn't originally do this!

Setting the threshold to 200 keeps only 15 features in one of the normalization schemes 😬

How about we use your approach instead (somehow it must be different). Can you create a .txt file with a column header: outlier_features and each of those features in that screenshot as independent rows. I can easily remove them this way.

michaelbornholdt · 2021-04-23T16:03:22Z

This is what I am using.

For threshhold of 100, this drops 32 features. You want me to send those to you then?

def drop_bad_feats(df_old, features_old, threshold):
    drop_features = []
    for feat in features_old:
        if (np.abs(df_old[feat]) > threshold).any():
            drop_features.append(feat)
    df_out = df.drop(drop_features,  axis = "columns")
    print("dropped {} features".format(len(drop_features)))
    return df_out

gwaybio · 2021-04-23T16:10:45Z

yes, that would be great.

Can you create a .txt file with a column header: outlier_features and each of those features in that screenshot as independent rows.

There is a pycytominer function to drop custom columns - i'll just need to be careful with documentation.

michaelbornholdt · 2021-04-23T20:43:16Z

listfile.txt

Voila

* upgrade pycytominer and pandas * update profiling pipeline for upgraded pycytominer also adding recoded dose to all annotated profiles * add dose utility file * fix compression in aggregate * add cyto_utils import and run black * add an execution script * remove instructions from duplicate sources * reprocess batch 2 data * reprocess batch 1 data * rerun comparison module with reprocessed profiles * update consensus signatures for data freeze * update spherized data freeze * update per cell line spherize * add link to drop outlier features in #65 * update documentation adding dose info to profile readme, adding outlier feature drop to consensus readme * add additional blocklist features * remove outlier features using a file * add consensus profiles * fix spherize two ways: spherize based on both time and cell line, and perform blocklist feature selection beforehand * add updated spherized profiles * add umap visualization of spherized data * track large umap pdfs

gwaybio · 2021-05-21T20:18:32Z

Addressed in #63 - thanks everyone!

gwaybio added the Version 2 Wishlist Items to process before a version 2 release label Apr 20, 2021

gwaybio mentioned this issue Apr 20, 2021

Frozen data version 1 #63

Merged

5 tasks

gwaybio changed the title ~~Drop outliers~~ Dropping outlier features Apr 21, 2021

gwaybio removed the Version 2 Wishlist Items to process before a version 2 release label Apr 22, 2021

gwaybio added a commit to gwaybio/lincs-cell-painting that referenced this issue Apr 26, 2021

add link to drop outlier features in broadinstitute#65

8a794f9

gwaybio closed this as completed May 21, 2021

shntnu mentioned this issue Nov 3, 2022

Questions about drop_outlier_features cytomining/pycytominer#237

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dropping outlier features #65

Dropping outlier features #65

shntnu commented Apr 20, 2021

gwaybio commented Apr 20, 2021

michaelbornholdt commented Apr 20, 2021

shntnu commented Apr 20, 2021

shntnu commented Apr 20, 2021

gwaybio commented Apr 21, 2021

michaelbornholdt commented Apr 21, 2021

gwaybio commented Apr 21, 2021

michaelbornholdt commented Apr 21, 2021 •

edited

Loading

michaelbornholdt commented Apr 21, 2021

gwaybio commented Apr 21, 2021

gwaybio commented Apr 21, 2021

michaelbornholdt commented Apr 21, 2021

gwaybio commented Apr 21, 2021

gwaybio commented Apr 22, 2021

michaelbornholdt commented Apr 23, 2021

gwaybio commented Apr 23, 2021

michaelbornholdt commented Apr 23, 2021 •

edited

Loading

gwaybio commented May 21, 2021

Dropping outlier features #65

Dropping outlier features #65

Comments

shntnu commented Apr 20, 2021

gwaybio commented Apr 20, 2021

michaelbornholdt commented Apr 20, 2021

shntnu commented Apr 20, 2021

shntnu commented Apr 20, 2021

gwaybio commented Apr 21, 2021

michaelbornholdt commented Apr 21, 2021

gwaybio commented Apr 21, 2021

michaelbornholdt commented Apr 21, 2021 • edited Loading

michaelbornholdt commented Apr 21, 2021

gwaybio commented Apr 21, 2021

gwaybio commented Apr 21, 2021

michaelbornholdt commented Apr 21, 2021

gwaybio commented Apr 21, 2021

gwaybio commented Apr 22, 2021

michaelbornholdt commented Apr 23, 2021

gwaybio commented Apr 23, 2021

michaelbornholdt commented Apr 23, 2021 • edited Loading

gwaybio commented May 21, 2021

michaelbornholdt commented Apr 21, 2021 •

edited

Loading

michaelbornholdt commented Apr 23, 2021 •

edited

Loading