-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dropping outlier features #65
Comments
Nice - i don't think it's worth doing before the first data freeze (see #63) But it is definitely worth noting which features this impacts - @michaelbornholdt do you have this info? Are they only the three features?
I can add a prominent note to make sure these are dropped in all downstream analyses in a README in #63 |
@gwaygenomics So just to be sure, I will not do anything to the pipeline but just locally delete these features so I can carry on with my analysis. Correct? @shntnu |
yes |
yes |
@shntnu - it turns out that I can very easily update #63 consensus and spherized profiles to add @michaelbornholdt do you recommend using 200 as a cutoff? I use 60 currently in spherized profiles, but I'd be happy to update to 200 if you have any data-driven rationale |
I can try several dropouts and look at the precision recall or do you guys have a better idea of deciding the threshold? |
that sounds good to me. What specifically will you try? Altering |
The following is the precision at k = 5 for different threshold values: threshold 60.000000 |
I haven't worked with the outlier functionality. Will need to get my head around that part of the pipeline first. |
awesome, thanks Michael! |
The pycytominer drop outlier strategy is simple: based on your code screenshot in #65 (comment) i think you're doing something very similar, if not exactly the same |
Can you update the files in the consensus then so that people don't run into the same problems? |
yep, that is the plan in #63 I'll use 100 for the threshold |
alright, I tried 100 (and then bumped it up to 200). I remember now why I didn't originally do this! Setting the threshold to 200 keeps only 15 features in one of the normalization schemes 😬 How about we use your approach instead (somehow it must be different). Can you create a .txt file with a column header: |
This is what I am using. For threshhold of 100, this drops 32 features. You want me to send those to you then?
|
yes, that would be great. Can you create a .txt file with a column header: outlier_features and each of those features in that screenshot as independent rows. There is a pycytominer function to drop custom columns - i'll just need to be careful with documentation. |
Voila |
* upgrade pycytominer and pandas * update profiling pipeline for upgraded pycytominer also adding recoded dose to all annotated profiles * add dose utility file * fix compression in aggregate * add cyto_utils import and run black * add an execution script * remove instructions from duplicate sources * reprocess batch 2 data * reprocess batch 1 data * rerun comparison module with reprocessed profiles * update consensus signatures for data freeze * update spherized data freeze * update per cell line spherize * add link to drop outlier features in #65 * update documentation adding dose info to profile readme, adding outlier feature drop to consensus readme * add additional blocklist features * remove outlier features using a file * add consensus profiles * fix spherize two ways: spherize based on both time and cell line, and perform blocklist feature selection beforehand * add updated spherized profiles * add umap visualization of spherized data * track large umap pdfs
Addressed in #63 - thanks everyone! |
MB said:
This is almost definitely because of mad of these features being zero in DMSO (at least for the plates that those compounds come from.
https://github.com/cytomining/pycytominer/blob/a04397d9cd7e25828d2f24f986a3386a79e6193d/pycytominer/operations/transform.py#L142
drop_outliers
to https://github.com/broadinstitute/lincs-cell-painting/blob/master/profiles/profile_cells.pyThe text was updated successfully, but these errors were encountered: