Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CellProfiler, Image-based Profiling Pipelines, and Missing Values #79

Open
gwaybio opened this issue May 14, 2020 · 13 comments
Open

CellProfiler, Image-based Profiling Pipelines, and Missing Values #79

gwaybio opened this issue May 14, 2020 · 13 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@gwaybio
Copy link
Member

gwaybio commented May 14, 2020

A common issue that keeps surfacing involves how missing values are handled in CellProfiler, and, subsequently, in downstream image-based profiling pipelines.

There appear to be many somewhat independent issues around this problem. I was not sure where to file this issue, since it does seem to permeate into many other codebases. I will attempt to outline the issue here.

  1. Different CellProfiler versions output missing values with different notations
    • Specifically, >= CellProfiler 3 outputs na while < CellProfiler 2 outputs NaN
    • This should be specified and deliberately coded. It is possible it is now deliberately coded in CP3, and, if so, this component should be considered solved. (cc @DavidStirling)
  2. The first step after CellProfiler in a traditional image-based profiling pipeline is to call cytominer-database to ingest all compartment .csvs into a single .sqlite database.
  3. There are additional differences in how pandas (python) and dplyr (R) extract missing values from the .sqlite backend.
  4. Pandas version 1.0 updates include updated handling of missing values

As @shntnu noted, the missing value problem is solved by the aggregation and ignoring missing values. This solution boils down to a mean imputation solution. This problem is important for single cell profiles, however.

@diskontinuum
Copy link

Great points Greg! I'm glad you addressed this explicitly!

Regarding subpoints of Point 2: Yes, especially since missing columns and NaN values are not part of the pytests yet.

For the parquet option, I added an entirely new chunk of code to deal with this scenario (which is one of the reasons why it took a while). The functions have been tested locally in the colaboratory notebook. More precisely, if the missing values manifest as missing columns, then the columns will simply be added as NaN-valued columns during the process of aligning the dataframes, before they are written to Parquet. For this we need to make sure that the writer schema encompassed all possible columns, though. That is, the Parquet writer needs to be opened with a .csv that certainly contains all columns. I build two options for this, following the suggestions of Beth and Greg: One is to put these reference .csvsinto a folder and save the relative path to the folder in the config file. Another is to set the option 'sample', which will choose the .csv file with the most columns from a subset of .csv files sampled uniformly at random (more info: readme on my branch ) .
What I have in mind to tackle this, is:

  • check how NaN v.s. na are read from .parquet or .csv back to the pandas dataframe and compare with original value and type --> include pytests for NaN values
  • include pytests for missing columns (easy for .parquet option)
  • solve the issue for the .sqlite backend (easy if we use the same pandas.dataframe.align() method and the same reference-table principle with the functions I already built for Parquet)

@gwaybio
Copy link
Member Author

gwaybio commented Sep 24, 2020

This should be specified and deliberately coded. It is possible it is now deliberately coded in CP3, and, if so, this component should be considered solved. (cc @DavidStirling)

@DavidStirling - do you know off hand how CP3 (and also CP4) handles missing values? (Congrats on CP4 release btw, super exciting)

@DavidStirling
Copy link

@gwaygenomics If you're running ExportToSpreadsheet there is a setting to choose whether invalid numerical values (Nan/inf) are represented with null or nan, but this has been there for a long time. However, I'm not sure what happens with missing values of other types or those that aren't recorded at all. I think it'll largely depend on what the module which created the particular measurement is set up to do.

@gwaybio
Copy link
Member Author

gwaybio commented Sep 24, 2020

Thanks David

I think it'll largely depend on what the module which created the particular measurement is set up to do.

This is important info - it sounds like we need to handle cases where there are multiple different kinds of missing values (nan, na, NA, NaN, null, etc.) since there is no standard.

From a software perspective, CellProfiler should manage this. However, we do need to buffer against this b/c legacy data exists and people use older CellProfiler versions.

@DavidStirling
Copy link

For the most part it looks like modules use numpy.NaN when a measurement is invalid, rather than specifying a particular string. Nonetheless with so many modules that I didn't write myself it's difficult to be sure!

@gwaybio
Copy link
Member Author

gwaybio commented Sep 24, 2020

got it - and it looks like the ExportToSpreadsheet CellProfiler module uses csv.writer() for output https://github.com/CellProfiler/CellProfiler/blob/master/cellprofiler/modules/exporttospreadsheet.py#L1080

David, it is the case that the single cell csv files that are ingested into the .sqlite file per plate are originally written using ExportToSpreadsheet?

@DavidStirling
Copy link

I believe so, ExportToDatabase can generate a .sqlite file natively but for DCP we tend to use ExportToSpreadsheet to ease combining the data from different cluster machines.

@gwaybio gwaybio added bug Something isn't working help wanted Extra attention is needed labels Sep 25, 2020
@AdeboyeML
Copy link
Contributor

  • @gwaygenomics I think the best way to check how sqlite db backend encode missing values, is to check and ascertain the number of null values before and after ingestion of the csv files into the database/parquet

  • I tried ingesting pandas df with null/nan values into sqlite on my local machine, and when I read the pandas df back from the sqlite db, I realized that the number of null values before and after ingestion still remained the same.

  • The only difference is that before ingestion, the pandas df Null values are represented as NaN/nan which is a float datatype BUT after ingestion and calling back the pandas from sqlite db, the Null values are represented as None which is a Nonetype datatype...but this doesn't affect the way pandas check for null/None/Nan values

  • For example:
    data - before ingestion, data_sql - after ingestion

image

@gwaybio
Copy link
Member Author

gwaybio commented Sep 25, 2020

The reason to dig into this deeply is to ensure consistency between pycytominer and cytominer processing.

This originally came up in the context of the analysis in the lincs-cell-painting repository, where we compared all 136 Drug Repurposing Hub plates processed with pycytominer (python) vs. cytominer (R) (see here). Briefly, we saw very small floating point differences between the two processing pipelines, and we hypothesized that the reason was because of how python and R handles missing values.

But.... ready or not, I am now going to throw in an additional layer of complexity! 🔧

The current method of processing sqlite files uses sqlalchemy, whereas the old method of processing sqlite files used odo. Here is the change that introduced this switch. It is possible that odo causes the sqlite missing value handling issue.

Decision

We do not need to dig into how odo handles missing values 😁 . odo is deprecated and we will never use it again.

The way forward is to make sure the pre-ingest missing values are not incorrectly converted post-ingest. @diskontinuum - what do you think about this strategy? Would it be quick to add a test to both sqlite and parquet options?

@diskontinuum
Copy link

diskontinuum commented Sep 28, 2020

  • Regarding value-based tests: Yes, absolutely! ETA: This week :)

  • Regarding how the current cytominer-database code deals with missing values, NaN etc. as follows:

  1. Missing columns: The code uses a reference table (either from a fixed path, or, via sampling of the largest table - as specified in the config.ini file). When tables with missing columns are appended to the opened writer-table, the missing columns are added as value None.

  2. NaN values and other type incompatibilities: The 'schema' section of the config.ini file allows to set  type_conversion  = all2string, in the case of which all values are converted to string. 

Note that this is only the case for the --parquet backend, but the code can easily be adapted to allow sampling and reference frames also for the --sqlite backend [also on my TODO list, I can add it as a new issue].

@gwaybio
Copy link
Member Author

gwaybio commented Sep 28, 2020

Thanks @diskontinuum - a couple followup questions/comments:

When tables with missing columns are appended to the opened writer-table, the missing columns are added as value None.

Cool, yeah I remember this functionality being important. However, should we be using a value other than None so that we're consistent with other missing value types coming from CellProfiler ExportToSpreadsheet? (see #79 (comment))

The 'schema' section of the config.ini file allows to set type_conversion = all2string

It looks like the default behavior in the config.ini file is set to int2float. The reason we wanted all2string was because there are multiple ways CellProfiler encoded missing values. However, as @AdeboyeML demonstrated in #79 (comment), "but this doesn't affect the way pandas check for null/None/Nan values". I now believe that the int2float default will actually solve this issue for free, in pandas, under the hood.

can easily be adapted to allow sampling and reference frames also for the --sqlite backend [also on my TODO list, I can add it as a new issue].

Adding an issue would be a great first step (in the cytominer-database repo). I want to make sure all of the knowledge you worked hard on acquiring isn't lost once you start at Google!

@diskontinuum
Copy link

diskontinuum commented Sep 29, 2020

Cool, yeah I remember this functionality being important. However, should we be using a value other than None so that we're consistent with other missing value types coming from CellProfiler ExportToSpreadsheet? (see #79 (comment))

Yes, this can easily be done by changing the parameters in the pandas.dataframe.align() method that we're using to concatenate tables with a potentially different number or order of columns. By adding the parameter fill_value we can choose the value with which missing columns are padded (instead of the default None).

It looks like the default behavior in the config.ini file is set to int2float. The reason we wanted all2string was because there are multiple ways CellProfiler encoded missing values. However, as @AdeboyeML demonstrated in #79 (comment), "but this doesn't affect the way pandas check for null/None/Nan values". I now believe that the int2float default will actually solve this issue for free, in pandas, under the hood.

Great! Should I remove the string option in the next PR then ?

@diskontinuum
Copy link

Adding an issue would be a great first step (in the cytominer-database repo).

I opened issues #127, #128, #129.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants