Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropping outlier features #65

Closed
shntnu opened this issue Apr 20, 2021 · 18 comments
Closed

Dropping outlier features #65

shntnu opened this issue Apr 20, 2021 · 18 comments

Comments

@shntnu
Copy link
Collaborator

shntnu commented Apr 20, 2021

MB said:

I have found a “error” in the Lincs dataset and I was wondering if you guys knew of this and if there needs to be some fixing of the pycyto pipeline? I am analyzing the Level 5 consensus data from here. When running the cyto eval functions on this data, I noticed some very high correlations. They come from this one feature (Nuclei_AreaShape_MedianRadius) that is 10^13 times larger than the others. The image shows a scatter plot of two samples which have a 1.000 similarity but are different compounds.

image

This is almost definitely because of mad of these features being zero in DMSO (at least for the plates that those compounds come from.

https://github.com/cytomining/pycytominer/blob/a04397d9cd7e25828d2f24f986a3386a79e6193d/pycytominer/operations/transform.py#L142

  1. Add drop_outliers to https://github.com/broadinstitute/lincs-cell-painting/blob/master/profiles/profile_cells.py
  2. Reprocess
@gwaybio
Copy link
Member

gwaybio commented Apr 20, 2021

Nice - i don't think it's worth doing before the first data freeze (see #63)

But it is definitely worth noting which features this impacts - @michaelbornholdt do you have this info? Are they only the three features?

  • Nuclei_AreaShape_MedianRadius
  • Cells_Correlation_Manders_AGP_RNA and
  • Cells_Neighbors_NumberOfNeighbors_10

I can add a prominent note to make sure these are dropped in all downstream analyses in a README in #63

@gwaybio gwaybio added the Version 2 Wishlist Items to process before a version 2 release label Apr 20, 2021
@gwaybio gwaybio mentioned this issue Apr 20, 2021
5 tasks
@michaelbornholdt
Copy link
Contributor

@gwaygenomics
Here are the features that have higher values than 200:
image

So just to be sure, I will not do anything to the pipeline but just locally delete these features so I can carry on with my analysis. Correct? @shntnu

@shntnu
Copy link
Collaborator Author

shntnu commented Apr 20, 2021

will not do anything to the pipeline but just locally delete these features so I can carry on with my analysis. Correct?

yes

@shntnu
Copy link
Collaborator Author

shntnu commented Apr 20, 2021

i don't think it's worth doing before the first data freeze

yes

@gwaybio
Copy link
Member

gwaybio commented Apr 21, 2021

@shntnu - it turns out that I can very easily update #63 consensus and spherized profiles to add drop_outliers without having to rerun everything.

@michaelbornholdt do you recommend using 200 as a cutoff? I use 60 currently in spherized profiles, but I'd be happy to update to 200 if you have any data-driven rationale

@gwaybio gwaybio changed the title Drop outliers Dropping outlier features Apr 21, 2021
@michaelbornholdt
Copy link
Contributor

I can try several dropouts and look at the precision recall or do you guys have a better idea of deciding the threshold?

@gwaybio
Copy link
Member

gwaybio commented Apr 21, 2021

that sounds good to me. What specifically will you try? Altering outlier_num in (np.abs(df[feature]) > outlier_num) ?

@michaelbornholdt
Copy link
Contributor

michaelbornholdt commented Apr 21, 2021

The following is the precision at k = 5 for different threshold values:
It looks like anything from 100-500
is a sensible value to use

threshold 60.000000
precision 0.776667
threshold 100.000000
precision 0.786667
threshold 200.00
precision 0.78
threshold 500.000000
precision 0.783333
threshold 1000.000000
precision 0.783333
threshold 10000.000000
precision 0.733333

@michaelbornholdt
Copy link
Contributor

I haven't worked with the outlier functionality. Will need to get my head around that part of the pipeline first.
I just wrote my own function to drop the columns with the high values

@gwaybio
Copy link
Member

gwaybio commented Apr 21, 2021

awesome, thanks Michael!

@gwaybio
Copy link
Member

gwaybio commented Apr 21, 2021

The pycytominer drop outlier strategy is simple:

https://github.com/cytomining/pycytominer/blob/a04397d9cd7e25828d2f24f986a3386a79e6193d/pycytominer/cyto_utils/features.py#L141-L143

based on your code screenshot in #65 (comment) i think you're doing something very similar, if not exactly the same

@michaelbornholdt
Copy link
Contributor

Can you update the files in the consensus then so that people don't run into the same problems?

@gwaybio
Copy link
Member

gwaybio commented Apr 21, 2021

yep, that is the plan in #63 I'll use 100 for the threshold

@gwaybio gwaybio removed the Version 2 Wishlist Items to process before a version 2 release label Apr 22, 2021
@gwaybio
Copy link
Member

gwaybio commented Apr 22, 2021

alright, I tried 100 (and then bumped it up to 200). I remember now why I didn't originally do this!

Setting the threshold to 200 keeps only 15 features in one of the normalization schemes 😬

How about we use your approach instead (somehow it must be different). Can you create a .txt file with a column header: outlier_features and each of those features in that screenshot as independent rows. I can easily remove them this way.

@michaelbornholdt
Copy link
Contributor

This is what I am using.

For threshhold of 100, this drops 32 features. You want me to send those to you then?

def drop_bad_feats(df_old, features_old, threshold):
    drop_features = []
    for feat in features_old:
        if (np.abs(df_old[feat]) > threshold).any():
            drop_features.append(feat)
    df_out = df.drop(drop_features,  axis = "columns")
    print("dropped {} features".format(len(drop_features)))
    return df_out

@gwaybio
Copy link
Member

gwaybio commented Apr 23, 2021

yes, that would be great.

Can you create a .txt file with a column header: outlier_features and each of those features in that screenshot as independent rows.

There is a pycytominer function to drop custom columns - i'll just need to be careful with documentation.

@michaelbornholdt
Copy link
Contributor

michaelbornholdt commented Apr 23, 2021

listfile.txt

Voila

gwaybio added a commit to gwaybio/lincs-cell-painting that referenced this issue Apr 26, 2021
gwaybio added a commit that referenced this issue May 21, 2021
* upgrade pycytominer and pandas

* update profiling pipeline for upgraded pycytominer

also adding recoded dose to all annotated profiles

* add dose utility file

* fix compression in aggregate

* add cyto_utils import

and run black

* add an execution script

* remove instructions from duplicate sources

* reprocess batch 2 data

* reprocess batch 1 data

* rerun comparison module with reprocessed profiles

* update consensus signatures for data freeze

* update spherized data freeze

* update per cell line spherize

* add link to drop outlier features in #65

* update documentation

adding dose info to profile readme, adding outlier feature drop to consensus readme

* add additional blocklist features

* remove outlier features using a file

* add consensus profiles

* fix spherize two ways:

spherize based on both time and cell line, and perform blocklist feature selection beforehand

* add updated spherized profiles

* add umap visualization of spherized data

* track large umap pdfs
@gwaybio
Copy link
Member

gwaybio commented May 21, 2021

Addressed in #63 - thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants