Improve Snowflake Generate Performance #4587

SteveDMurphy · 2024-02-01T00:21:40Z

Closes PROD-1639

Description Of Changes

The Snowflake connector has some known out of the box issues, and the suggested workaround has some performance hit. This PR introduces the ability to surface the schema in parallel to reduce the time required. No data or funactionality change, just the option to surface the schema in multiple threads.

This was found as a problem for larger datasets that caused us to hit the FastAPI timeout (60s by default)

Code Changes

Include joblib
separate column discovery to new function
run the discovery in parallel
use a scale for the number of threads to be used
ensure test coverage

Steps to Confirm

Tested manually with our POCDB in Snowflake, where it took the time down from ~8 seconds to under 3 when using 4 threads

Pre-Merge Checklist

All CI Pipelines Succeeded
Documentation:
- documentation complete, PR opened in fidesdocs
- documentation issue created in fidesdocs
Issue Requirements are Met
Relevant Follow-Up Issues Created
Update CHANGELOG.md
For API changes, the Postman collection has been updated

vercel · 2024-02-01T00:21:47Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
fides-plus-nightly	⬜️ Ignored (Inspect)	Visit Preview		Apr 22, 2024 6:52am

cypress · 2024-02-01T00:33:04Z

Passing run #7354 ↗︎

0	4	0	0	0
⚠️ You've recorded test results over your free plan limit. Upgrade your plan to view test results.

Details:

Merge `7708cd3` into `b6a774b`...
Project: fides	Commit: `77e56efcc0 ℹ️`
Status: Passed	Duration: 00:34 💡
Started: Apr 19, 2024 9:17 PM	Ended: Apr 19, 2024 9:17 PM

Review all test suite changes for PR #4587 ↗︎

codecov · 2024-02-01T00:56:50Z

Codecov Report

Attention: Patch coverage is 25.00000% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 86.59%. Comparing base (b6a774b) to head (7708cd3).

Files	Patch %	Lines
src/fides/core/dataset.py	25.00%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4587      +/-   ##
==========================================
- Coverage   86.60%   86.59%   -0.02%     
==========================================
  Files         339      339              
  Lines       20100    20105       +5     
  Branches     2587     2588       +1     
==========================================
+ Hits        17407    17409       +2     
- Misses       2220     2223       +3     
  Partials      473      473

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

SteveDMurphy · 2024-02-01T21:40:19Z

I think I may need to add some test coverage but was hoping to get an early review to try and get this on an alpha release if one of you has any time @pattisdr or @adamsachs

It adds a new dependency (joblib) but one we already have on Fidesplus so I don't see it as a huge trade (but could be wrong!)

adamsachs

nice, lookin really solid @SteveDMurphy! a couple of comments that you can feel free to take or leave, but generally this seems like a great improvement that looks straightforward enough.

i agree that some test coverage here would be very nice, though it may be a bit hard to get the parallelization functionality itself covered in automated tests. do we at least have the overall functionality of this codepath covered in existing tests, i.e. enough to ensure that this change doesn't cause regressions?

src/fides/core/dataset.py

adamsachs · 2024-02-01T23:06:50Z

src/fides/core/dataset.py

-                    text(f'SHOW COLUMNS IN "{schema}"."{table}"')
-                )
-                columns = [row[2] for row in column_cursor]
+            number_of_threads = 8 if len(db_tables) > 250 else 4


i think it'd be really nice to make this config-driven in some way, if that's not too crazy of a thing to do. it feels like something we'd potentially want to fine tune "on the fly" - either for pure testing or just depending on various different environmental factors. could end up saving a lot of iteration overhead, where we could simply tweak in a deployed env, rather than having to cut new releases to iterate on the values we want here.

curious to hear what you think though - maybe this would less helpful than i'm imagining!

It felt like a bit of a pickle, I thought about it and strayed away a bit due to wanting to manage it without adding further config variables. I didn't like the idea of having to restart/re-deploy to generate a database (likely just one time and only for Snowflake). If these values didn't work, then the next step may be to up the timeout of the API or Generate the values in a separate thread entirely (similar to how we Classify today).

These aren't terribly heavy operations really as much as tedious so I did feel hesitant to add much extra, especially as it feels like there may be something better overall in a minor rewrite or other change specific to handling Snowflake

got it, that seems like a totally fair trade off if you don't imagine needing to fine tune the number of threads!

adamsachs · 2024-02-01T23:18:36Z

src/fides/core/dataset.py

-                )
-                columns = [row[2] for row in column_cursor]
+            number_of_threads = 8 if len(db_tables) > 250 else 4
+            fields = Parallel(n_jobs=number_of_threads, backend="threading")(


not trying to make things more difficult, but could you use the builtin multiprocessing package to get the same functionality here? (i think just a Pool.map may be able to get you want you need?)

granted - i'm not a python parallelization expert, and i don't know a whole lot about joblib. it also seems pretty lightweight, so i don't think it's necessarily a problem. but i do generally think we should have compelling reasons to stray away from the standard library packages -- maybe there are some here, and i just don't know 'em!

to be honest, I knew we used this as a fairly simple way to improve performance in Fidesplus today with Classify so considered it relatively low-risk here as well even though it is introducing something "new" (especially taking the required turn-time component into account)

Will take a look now to trial multiprocessing! Would be nice to not have to deal with a new addition if we can

ah ok totally fine if this is tried and true - doesn't seem like a heavyweight dep and yeah it's already in the plus image.

it's always nice to try and be as slim as possible, especially here in OSS where this gets packaged into the python package. but it seems like minimal impact in this case and not worth a big struggle to switch over!

absolutely, would love to get rid of the requirement in both if possible!

Co-authored-by: Adam Sachs <[email protected]>

SteveDMurphy · 2024-02-02T00:44:52Z

i agree that some test coverage here would be very nice, though it may be a bit hard to get the parallelization functionality itself covered in automated tests. do we at least have the overall functionality of this codepath covered in existing tests, i.e. enough to ensure that this change doesn't cause regressions?

I believe we do, in that the same output should match the results after this change - in the end, I think the only functional change is taking the parallelized output and outputting it correctly so I was going to look at testing that distinctly but it does feel covered... The current testing path should be redirected to hit both of these steps with Snowflake and we should match the same output (it's how I tested it manually as well!)

adamsachs

some potential tweaks/improvements to look into around the edges, but overall this looks like a solid incremental update and i think it's ready for a alpha/beta deployment 👍

…-1639-snowflake-generate

Co-authored-by: Adam Sachs <[email protected]>

feat: include joblib and get fields in parallel

b914f31

SteveDMurphy added 3 commits February 1, 2024 15:01

fix: missing mypy override

108f58e

changelog

ff072f3

feat: allow for increased scaling of threads for larger schemas

ae40bed

SteveDMurphy requested review from adamsachs and pattisdr February 1, 2024 21:37

adamsachs reviewed Feb 1, 2024

View reviewed changes

Update src/fides/core/dataset.py

b124d1d

Co-authored-by: Adam Sachs <[email protected]>

adamsachs approved these changes Feb 2, 2024

View reviewed changes

SteveDMurphy added 2 commits April 19, 2024 22:03

Merge branch 'main' of github.com:ethyca/fides into SteveDMurphy-prod…

8d6ad36

…-1639-snowflake-generate

fix: update changelog

7708cd3

SteveDMurphy marked this pull request as ready for review April 19, 2024 21:14

Merge branch 'main' into SteveDMurphy-prod-1639-snowflake-generate

d5695df

SteveDMurphy merged commit 2aae779 into main Apr 22, 2024
14 checks passed

SteveDMurphy deleted the SteveDMurphy-prod-1639-snowflake-generate branch April 22, 2024 06:53

SteveDMurphy added a commit that referenced this pull request Apr 22, 2024

Improve Snowflake Generate Performance (#4587)

74fb8ca

Co-authored-by: Adam Sachs <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Snowflake Generate Performance #4587

Improve Snowflake Generate Performance #4587

SteveDMurphy commented Feb 1, 2024 •

edited

Loading

vercel bot commented Feb 1, 2024 •

edited

Loading

cypress bot commented Feb 1, 2024 •

edited

Loading

codecov bot commented Feb 1, 2024 •

edited

Loading

SteveDMurphy commented Feb 1, 2024

adamsachs left a comment •

edited

Loading

adamsachs Feb 1, 2024

SteveDMurphy Feb 2, 2024

adamsachs Feb 2, 2024

adamsachs Feb 1, 2024

SteveDMurphy Feb 2, 2024

SteveDMurphy Feb 2, 2024

adamsachs Feb 2, 2024

SteveDMurphy Feb 2, 2024

SteveDMurphy commented Feb 2, 2024

adamsachs left a comment

Improve Snowflake Generate Performance #4587

Improve Snowflake Generate Performance #4587

Conversation

SteveDMurphy commented Feb 1, 2024 • edited Loading

Description Of Changes

Code Changes

Steps to Confirm

Pre-Merge Checklist

vercel bot commented Feb 1, 2024 • edited Loading

cypress bot commented Feb 1, 2024 • edited Loading

Passing run #7354 ↗︎

Review all test suite changes for PR #4587 ↗︎

codecov bot commented Feb 1, 2024 • edited Loading

Codecov Report

SteveDMurphy commented Feb 1, 2024

adamsachs left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SteveDMurphy commented Feb 2, 2024

adamsachs left a comment

Choose a reason for hiding this comment

SteveDMurphy commented Feb 1, 2024 •

edited

Loading

vercel bot commented Feb 1, 2024 •

edited

Loading

cypress bot commented Feb 1, 2024 •

edited

Loading

codecov bot commented Feb 1, 2024 •

edited

Loading

adamsachs left a comment •

edited

Loading