-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CellProfiler, Image-based Profiling Pipelines, and Missing Values #79
Comments
Great points Greg! I'm glad you addressed this explicitly! Regarding subpoints of Point 2: Yes, especially since missing columns and NaN values are not part of the pytests yet. For the
|
@DavidStirling - do you know off hand how CP3 (and also CP4) handles missing values? (Congrats on CP4 release btw, super exciting) |
@gwaygenomics If you're running ExportToSpreadsheet there is a setting to choose whether invalid numerical values (Nan/inf) are represented with null or nan, but this has been there for a long time. However, I'm not sure what happens with missing values of other types or those that aren't recorded at all. I think it'll largely depend on what the module which created the particular measurement is set up to do. |
Thanks David
This is important info - it sounds like we need to handle cases where there are multiple different kinds of missing values (nan, na, NA, NaN, null, etc.) since there is no standard. From a software perspective, CellProfiler should manage this. However, we do need to buffer against this b/c legacy data exists and people use older CellProfiler versions. |
For the most part it looks like modules use numpy.NaN when a measurement is invalid, rather than specifying a particular string. Nonetheless with so many modules that I didn't write myself it's difficult to be sure! |
got it - and it looks like the David, it is the case that the single cell csv files that are ingested into the |
I believe so, ExportToDatabase can generate a |
|
The reason to dig into this deeply is to ensure consistency between pycytominer and cytominer processing. This originally came up in the context of the analysis in the But.... ready or not, I am now going to throw in an additional layer of complexity! 🔧 The current method of processing DecisionWe do not need to dig into how The way forward is to make sure the pre-ingest missing values are not incorrectly converted post-ingest. @diskontinuum - what do you think about this strategy? Would it be quick to add a test to both |
Note that this is only the case for the |
Thanks @diskontinuum - a couple followup questions/comments:
Cool, yeah I remember this functionality being important. However, should we be using a value other than
It looks like the default behavior in the config.ini file is set to
Adding an issue would be a great first step (in the cytominer-database repo). I want to make sure all of the knowledge you worked hard on acquiring isn't lost once you start at Google! |
Yes, this can easily be done by changing the parameters in the
Great! Should I remove the string option in the next PR then ? |
|
A common issue that keeps surfacing involves how missing values are handled in CellProfiler, and, subsequently, in downstream image-based profiling pipelines.
There appear to be many somewhat independent issues around this problem. I was not sure where to file this issue, since it does seem to permeate into many other codebases. I will attempt to outline the issue here.
na
while < CellProfiler 2 outputsNaN
cytominer-database
to ingest all compartment.csvs
into a single.sqlite
database.sqlite
is encoding missing valuesparquet
backend option to cytominer-database (cc @diskontinuum)pandas
(python) anddplyr
(R) extract missing values from the.sqlite
backend.pandas.read_sql()
attempts to convert values to non-string values. This may or may not explain conversion of CellProfilerna
orNaN
values to zero.As @shntnu noted, the missing value problem is solved by the aggregation and ignoring missing values. This solution boils down to a mean imputation solution. This problem is important for single cell profiles, however.
The text was updated successfully, but these errors were encountered: