Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address SQLite Read Performance Issues #198

Closed
d33bs opened this issue May 21, 2022 · 12 comments
Closed

Address SQLite Read Performance Issues #198

d33bs opened this issue May 21, 2022 · 12 comments
Assignees

Comments

@d33bs
Copy link
Member

d33bs commented May 21, 2022

SQLite files used by this repo create performance challenges during runtime that may hinder or completely stall progress. This issue is dedicated to targeting SQLite read performance issues, including solutions which convert away from the format or may not use SQLite at all.

Issues which may be related or tied to this:

@d33bs
Copy link
Member Author

d33bs commented May 24, 2022

Hi @gwaygenomics - I'm noticing that within SQ00014613.sqlite (from here), there are numeric columns which specify not-null for column attributes and which sometimes include 'nan' (textual) actual values. Examples may be found in that file within the cytoplasm table and column Cytoplasm_Correlation_Costes_AGP_DNA (there are many others with this issue). The following SQL shows this as well:

select ObjectNumber, 
    Cytoplasm_Correlation_Costes_AGP_DNA,
    typeof(Cytoplasm_Correlation_Costes_AGP_DNA) from Cytoplasm
where typeof(Cytoplasm_Correlation_Costes_AGP_DNA) != 'real';

SQLite may include column values that don't match the column specification. This could be one cause of the performance issues using Pandas and SQLAlchemy, but it also can impact alternatives. Because these columns are sometimes indicated as not-null, it means we may not use NULL unless we create a new column, copying the original.

Do you know of a way we could interpret these 'nan' values correctly for the purposes of pycytominer (for example, replace with 0, filter from results in database, etc)?

d33bs added a commit to d33bs/pycytominer-performance that referenced this issue May 24, 2022
@gwaybio
Copy link
Member

gwaybio commented May 25, 2022

Sorry for the delayed reply @d33bs, I'm at a conference!

Yes, you note a very important limitation in our current setup. This is a well known issue, and nobody has seemed to be able to successfully track down its origin and fix. See #79 :)

@d33bs
Copy link
Member Author

d33bs commented May 25, 2022

No worries and thank you @gwaygenomics - I'll take a look. Enjoy the conference!

d33bs added a commit to d33bs/pycytominer-performance that referenced this issue May 27, 2022
@d33bs
Copy link
Member Author

d33bs commented May 27, 2022

Hi @gwaygenomics - I just wanted to follow up with some findings, open to any guidance or input you have along the way. I've spent some time investigating ways to speed up sqlite reads, taking note of the background you mentioned a little earlier. I don't yet have thoughts on solutions for the cytominer-database issue, as a brainstorm here it feels like pandas.dataframe.to_sql() or SQLAlchemy may have had a hand in this with built-in assumptions or type inferences.

The file I've focused on for this has so far been only SQ00014613.sqlite. For the below, I have the table cytoplasm column Cytoplasm_Correlation_Costes_AGP_DNA within SQ00014613.sqlite in mind, which includes mostly floating point number values in addition to a handful of 'nan' values.

SQLite Data Type Flexibility

SQLite is a database with sometimes uncommon flexibility when compared to other RDBMS's. One way this is true is through affinity types and storage classes (see docs here). Affinity typing enables table columns to have a "preference" for certain storage classes. Storage classes are ultimately the source of truth concerning a particular value in a column. For example, this means we can have a column of affinity type FLOAT with values 0.1 (REAL class) and 'nan' (TEXT class) coexisting in a completely valid way (where other RDBMS's may throw an error). See below for a table representing this.

SQLite Column Affinity Type SQLite Value SQLite Storage Class
FLOAT 0.1 REAL
FLOAT 'nan' TEXT

Pandas read_sql Interpretation of Mixed-Type Values

Pandas does some legwork to interpret mixed-type values within fields we send to it. In the case of the above (a single SQLite table column of affinity FLOAT with REAL and TEXT values) it will infer the column as an O (object) type and include values of float and str (string) types (version pandas==1.4.2). See below for a table representing this.

SQLite Column Affinity Type SQLite Value SQLite Storage Class Pandas DF Column Type Pandas Value Type
FLOAT 0.1 REAL Object float
FLOAT 'nan' TEXT Object str

In and of itself, this may be a bottleneck; Pandas may be performing sub-optimally due mixed typing of columns. We could modify our SQL queries to sidestep this but we'd likely be stepping into more performance concerns (in-flight SQL data transforms). Recasting types within the Pandas dataframe may also have costly impacts to performance.

What About SQLite NULL?

SQLite provides a special storage class called NULL which may be used for non-values. Let's say we replace the 'nan' value(s) with NULL within the table column. When we do this and read it using Pandas, the resulting column type is float64 and two values of numpy.float64 (SQLite NULL is interpreted as numpy.nan).

SQLite Column Affinity Type SQLite Value SQLite Storage Class Pandas DF Column Type Pandas Value Type
FLOAT 0.1 REAL float64 numpy.float64
FLOAT NULL NULL float64 numpy.float64

As mentioned above, this may lend itself to healthier performance. The catch: NULL may only be used in columns which do not have a constraint of NOT NULL. In a cytoplasm.Cytoplasm_Correlation_Costes_AGP_DNA where I'm witnessing the mixed values occurring, we indeed have a NOT NULL constraint.

Despite SQLite's flexibility with affinity and storage class, it's relatively inflexible when it comes to column constraint modifications. Changing cytoplasm.Cytoplasm_Correlation_Costes_AGP_DNA to be nullable within SQLite requires we create a brand new copy of it which does not include the constraint, removing the original after completion and ideally performing a VACUUM; at some point (quite a lift, oof!). Despite the challenges, there's some promise in doing this work - see below.

Opportunities from Switching 'nan' for NULL

I wanted to restate some of the above and include some additional items below in terms of opportunities as a summary. I ran some testing for these cases to find out more precisely whether and how these might provide you and those who use this library benefit (see test results section below).

  • Pandas memory consumption may improve (numeric types may consume less memory than objects).
  • Pandas performance may improve (merges, joins, and data changes may have greater performance due to relative data uniformity and perhaps fewer in-flight conversions).
  • We'd gain compatibility for other tooling such as ConnectorX, which is purpose-built for greater bulk read performance (see here for example and reported performance improvements).
    • Due to how it's optimized, ConnectorX is not compatible with SQLite databases that include storage class values which differ from their column affinity types (see here).

Some Test Results

Running tests using same constraints as reported and tested in #195 on a GCP e2-highmem-8 machine (8 vCPU, 64 GB Mem).

  • As-is performance with SQLite 'nan's:
  • As-is performance with SQLite 'nan's replaced by NULL:
    • Duration: ~18 minutes (down ~10 minutes from as-is)
    • Peak Memory Consumption: 46.4 GB (down 18.3 GB from as-is)
    • Link to flamegraph
  • ConnectorX reads for load_compartment and with SQLite 'nan's replaced by NULL:
    • Duration: ~6 minutes (down ~22 minutes from as-is)
    • Peak Memory Consumption: 42.4 GB (down 22.3 GB from as-is)
    • Link to flamegraph

@gwaybio
Copy link
Member

gwaybio commented May 27, 2022

Amazing, thanks for this detail @d33bs. A couple followup questions:

  • In your last two tests, how did you convert 'nan's to NULL? Are you suggesting this as a preprocessing step prior to processing? How long does this take? How much memory?
  • Does ConnectorX work with other data formats? We want to keep SQLite support in pycytominer, but we'll eventually want to move away from the format
  • The peak memory consumption drop is looking great! (down ~35%!) However, the memory leak doesn't appear to be completely solved. Do you anticipate that the SQLite nan fix + ConnectorX is only part of the solution?

Thanks again @d33bs - and I hope you have a great Memorial day weekend!

@d33bs
Copy link
Member Author

d33bs commented May 28, 2022

Thank you for the kind words and follow up's @gwaygenomics ! And wishing you a fantastic Memorial day weekend as well! 🙂

Addressing your questions below:

In your last two tests, how did you convert 'nan's to NULL? Are you suggesting this as a preprocessing step prior to processing? How long does this take? How much memory?

I converted the 'nan's to NULL using the following (non-optimal, solely for testing) Python file. I'm still learning the full lifecycle of the data involved, so apologies if I misunderstand things with what follows:

  • I haven't yet profiled the memory or resources with the Python file, and I would consider it non-optimal in it's current state. I estimate that it took about an 1.5 hours to complete and was mostly CPU intensive. There's likely a more elegant and performant method to doing this which involves more specificity at the column level, more targeted 'nan' updates (instead of the full table), or possibly in-memory databases (because we have that option with SQLite via :memory:).
  • Ideally we wouldn't need to make this conversion and the data would be added to SQLite in this format when the database is created.
  • Because there may be others like the one I tested with that include this format, I feel we could add a "post-creation pre-processing" step as you mentioned.
  • An alternative might be to make releases of these datasets as a "v2.0" or similar, retaining the originals but indicating performance increases within new versions. This might enable batch reprocessing to take place outside of pycytominer and minimize the impact to those using it.
  • As another alternative, or maybe more of a sidestep, we could do a quick "SQLite linting" check for storage class mismatches with column affinity, emitting a warning or error to the user for the concerns (and perhaps linking to documentation for updating the database).
  • Based on the above options or others, do you have a preference or feeling towards what might be best? I can create a PR towards addressing this, and defer to your guidance.

Does ConnectorX work with other data formats? We want to keep SQLite support in pycytominer, but we'll eventually want to move away from the format

  • ConnectorX works with mostly RDBMS/database connections (see here).
  • If SQLite is considered a deprecated datasource we may want to warn the user about performance issues therein. We may also want to provide upfront warnings to the user when using the SQLite format more generally, indicating any difference in supportability, perhaps as part of efforts surrounding Add logger #5 .
  • How long do you think pycytominer will support the SQLite data sources? Setting a goal post of when SQLite will be unsupported might help make some of the judgements surrounding what to do presently (through cost vs. benefit estimations, etc).

The peak memory consumption drop is looking great! (down ~35%!) However, the memory leak doesn't appear to be completely solved. Do you anticipate that the SQLite nan fix + ConnectorX is only part of the solution?

  • I believe the SQLite nan fix is only part of the solution as you mention. I feel the findings here do assist with what we might try for merges as Pandas datatype handling can sometimes effect performance.
  • Many discussions and documentation points to in-memory data copies during creation or transformation as a significant issue with Pandas (along with the related garbage collection which may be delayed or non-optimal for certain procedures/systems). I believe that some of this is happening with both the SQLite reads and the merges, and that the SQLite read memory handling could be effecting the merge performance as a result (more memory tasks to take care of, maybe compounding the issue in it's entirety).
  • The chart below from the ConnectorX test seems to show that we gradually accrued memory with the SQLite reads and then experienced a significant spike towards the end of the processing, perhaps once the process reached the final merge(s) or other embedded data transforms within merge_single_cells. Either way, I'm excited to dive into Address Pandas Dataframe Merge and Other Performance Issues #199 more to explore resolutions to the memory challenges with the merges! 👍

image
(this chart was generated from ConnectorX test memray flamegraph by clicking the line chart near the top)

@gwaybio
Copy link
Member

gwaybio commented May 31, 2022

Got it! Here are some responses:

To Converting nans to Nulls

Because there may be others like the one I tested with that include this format, I feel we could add a "post-creation pre-processing" step as you mentioned.

An alternative might be to make releases of these datasets as a "v2.0" or similar, retaining the originals but indicating performance increases within new versions. This might enable batch reprocessing to take place outside of pycytominer and minimize the impact to those using it.

I think this is the way to go. When you get a chance, can you add some version of your python file as a new PR in this repo? I think it should live in cyto_utils and be called sqlite_clean.py. Its API should enable an import (something like: from sqlite_clean import clean_sql_nulls) and we can discuss specific code updates in your PR.

Does this should like a good strategy?

Here's the first two concrete use-cases :)

Dataset Link Repo Stage
Cell Health https://nih.figshare.com/articles/dataset/Cell_Health_-_Cell_Painting_Single_Cell_Profiles/9995672/5 https://github.com/WayScience/cell-health-data Ready for action now!
LINCS Not yet available AWS link, possibly to be migrated to figshare plus https://github.com/broadinstitute/lincs-cell-painting broadinstitute/lincs-cell-painting#84

SQLite support timeline

How long do you think pycytominer will support the SQLite data sources? Setting a goal post of when SQLite will be unsupported might help make some of the judgements surrounding what to do presently (through cost vs. benefit estimations, etc).

My estimation is indefinitely, unfortunately. We might consider moving SQLite support in single_cells to some form of sqlite conversion, and we'll likely want to proceed as follows:

  1. Decide new format
  2. Create conversion functionality
  3. Migrate single_cells support to this new data type
  4. Update tutorials detailing how to handle new and old input data

We may also decide to create this conversion functionality in an entirely new repo (upstream of pycytominer in the workflow)

Scheduling a full SQLite deprecation will require other stakeholders to chime in.

Memory leak solution

Sounds great!! I'm looking forward to hearing about what you discover!

@d33bs
Copy link
Member Author

d33bs commented Jun 1, 2022

Thank you @gwaygenomics !

I think this is the way to go. When you get a chance, can you add some version of your python file as a new PR in this repo? I think it should live in cyto_utils and be called sqlite_clean.py. Its API should enable an import (something like: from sqlite_clean import clean_sql_nulls) and we can discuss specific code updates in your PR.

Does this should like a good strategy?

I like this idea and might expand a bit to include some functionality for detecting challenges with these datasets based on what I found here. More to follow in a PR for these ideas.

@gwaybio
Copy link
Member

gwaybio commented Jun 7, 2023

@d33bs - does SQLiteClean or CytoTable mean we can close this issue?

@d33bs
Copy link
Member Author

d33bs commented Jun 8, 2023

Hi @gwaybio - thanks, I think we could close this issue based on work in SQLite-clean and CytoTable as you mentioned. Maybe we could document expectations somewhere reasonable with user guidance when it comes to large SQLite data handling within Pycytominer? This could be a new issue or it could be a "closing task" for this issue.

@d33bs d33bs moved this to Paused in SET Projects Jun 9, 2023
@gwaybio
Copy link
Member

gwaybio commented Jun 9, 2023

Maybe we could document expectations somewhere reasonable with user guidance when it comes to large SQLite data handling within Pycytominer?

This sounds good. Something along the lines of:

Note: Single-cell processing speed of SQLite files depends on SQLite file size and amount of memory (RAM) on the computational environment machine.

I can add this somewhere and will tag you to review. Also feel free to modify this text here (or in the PR, whichever is useful).

I think this can be the closing task for this issue.

@d33bs
Copy link
Member Author

d33bs commented Jun 10, 2023

Sounds great!

@d33bs d33bs moved this from Paused to Up next in SET Projects Jun 14, 2023
@gwaybio gwaybio moved this to Todo in Version 1 release Jun 14, 2023
@d33bs d33bs moved this from Up next to Paused in SET Projects Jun 21, 2023
@gwaybio gwaybio moved this from Todo to In Progress in Version 1 release Jun 23, 2023
gwaybio added a commit to gwaybio/pycytominer that referenced this issue Jun 23, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in Version 1 release Jun 23, 2023
@github-project-automation github-project-automation bot moved this from Paused to Done in SET Projects Jun 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants