Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Snowflake Generate Performance #4587

Merged
merged 8 commits into from
Apr 22, 2024

Conversation

SteveDMurphy
Copy link
Contributor

@SteveDMurphy SteveDMurphy commented Feb 1, 2024

Closes PROD-1639

Description Of Changes

The Snowflake connector has some known out of the box issues, and the suggested workaround has some performance hit. This PR introduces the ability to surface the schema in parallel to reduce the time required. No data or funactionality change, just the option to surface the schema in multiple threads.

This was found as a problem for larger datasets that caused us to hit the FastAPI timeout (60s by default)

Code Changes

  • Include joblib
  • separate column discovery to new function
  • run the discovery in parallel
  • use a scale for the number of threads to be used
  • ensure test coverage

Steps to Confirm

  • Tested manually with our POCDB in Snowflake, where it took the time down from ~8 seconds to under 3 when using 4 threads

Pre-Merge Checklist

Copy link

vercel bot commented Feb 1, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
fides-plus-nightly ⬜️ Ignored (Inspect) Visit Preview Apr 22, 2024 6:52am

Copy link

cypress bot commented Feb 1, 2024

Passing run #7354 ↗︎

0 4 0 0 Flakiness 0
⚠️ You've recorded test results over your free plan limit.
Upgrade your plan to view test results.

Details:

Merge 7708cd3 into b6a774b...
Project: fides Commit: 77e56efcc0 ℹ️
Status: Passed Duration: 00:34 💡
Started: Apr 19, 2024 9:17 PM Ended: Apr 19, 2024 9:17 PM

Review all test suite changes for PR #4587 ↗︎

Copy link

codecov bot commented Feb 1, 2024

Codecov Report

Attention: Patch coverage is 25.00000% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 86.59%. Comparing base (b6a774b) to head (7708cd3).

Files Patch % Lines
src/fides/core/dataset.py 25.00% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4587      +/-   ##
==========================================
- Coverage   86.60%   86.59%   -0.02%     
==========================================
  Files         339      339              
  Lines       20100    20105       +5     
  Branches     2587     2588       +1     
==========================================
+ Hits        17407    17409       +2     
- Misses       2220     2223       +3     
  Partials      473      473              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@SteveDMurphy
Copy link
Contributor Author

I think I may need to add some test coverage but was hoping to get an early review to try and get this on an alpha release if one of you has any time @pattisdr or @adamsachs

It adds a new dependency (joblib) but one we already have on Fidesplus so I don't see it as a huge trade (but could be wrong!)

Copy link
Contributor

@adamsachs adamsachs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, lookin really solid @SteveDMurphy! a couple of comments that you can feel free to take or leave, but generally this seems like a great improvement that looks straightforward enough.

i agree that some test coverage here would be very nice, though it may be a bit hard to get the parallelization functionality itself covered in automated tests. do we at least have the overall functionality of this codepath covered in existing tests, i.e. enough to ensure that this change doesn't cause regressions?

text(f'SHOW COLUMNS IN "{schema}"."{table}"')
)
columns = [row[2] for row in column_cursor]
number_of_threads = 8 if len(db_tables) > 250 else 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it'd be really nice to make this config-driven in some way, if that's not too crazy of a thing to do. it feels like something we'd potentially want to fine tune "on the fly" - either for pure testing or just depending on various different environmental factors. could end up saving a lot of iteration overhead, where we could simply tweak in a deployed env, rather than having to cut new releases to iterate on the values we want here.

curious to hear what you think though - maybe this would less helpful than i'm imagining!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It felt like a bit of a pickle, I thought about it and strayed away a bit due to wanting to manage it without adding further config variables. I didn't like the idea of having to restart/re-deploy to generate a database (likely just one time and only for Snowflake). If these values didn't work, then the next step may be to up the timeout of the API or Generate the values in a separate thread entirely (similar to how we Classify today).

These aren't terribly heavy operations really as much as tedious so I did feel hesitant to add much extra, especially as it feels like there may be something better overall in a minor rewrite or other change specific to handling Snowflake

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, that seems like a totally fair trade off if you don't imagine needing to fine tune the number of threads!

)
columns = [row[2] for row in column_cursor]
number_of_threads = 8 if len(db_tables) > 250 else 4
fields = Parallel(n_jobs=number_of_threads, backend="threading")(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not trying to make things more difficult, but could you use the builtin multiprocessing package to get the same functionality here? (i think just a Pool.map may be able to get you want you need?)

granted - i'm not a python parallelization expert, and i don't know a whole lot about joblib. it also seems pretty lightweight, so i don't think it's necessarily a problem. but i do generally think we should have compelling reasons to stray away from the standard library packages -- maybe there are some here, and i just don't know 'em!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be honest, I knew we used this as a fairly simple way to improve performance in Fidesplus today with Classify so considered it relatively low-risk here as well even though it is introducing something "new" (especially taking the required turn-time component into account)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take a look now to trial multiprocessing! Would be nice to not have to deal with a new addition if we can

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok totally fine if this is tried and true - doesn't seem like a heavyweight dep and yeah it's already in the plus image.

it's always nice to try and be as slim as possible, especially here in OSS where this gets packaged into the python package. but it seems like minimal impact in this case and not worth a big struggle to switch over!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely, would love to get rid of the requirement in both if possible!

@SteveDMurphy
Copy link
Contributor Author

i agree that some test coverage here would be very nice, though it may be a bit hard to get the parallelization functionality itself covered in automated tests. do we at least have the overall functionality of this codepath covered in existing tests, i.e. enough to ensure that this change doesn't cause regressions?

I believe we do, in that the same output should match the results after this change - in the end, I think the only functional change is taking the parallelized output and outputting it correctly so I was going to look at testing that distinctly but it does feel covered... The current testing path should be redirected to hit both of these steps with Snowflake and we should match the same output (it's how I tested it manually as well!)

Copy link
Contributor

@adamsachs adamsachs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some potential tweaks/improvements to look into around the edges, but overall this looks like a solid incremental update and i think it's ready for a alpha/beta deployment 👍

@SteveDMurphy SteveDMurphy marked this pull request as ready for review April 19, 2024 21:14
@SteveDMurphy SteveDMurphy merged commit 2aae779 into main Apr 22, 2024
14 checks passed
@SteveDMurphy SteveDMurphy deleted the SteveDMurphy-prod-1639-snowflake-generate branch April 22, 2024 06:53
SteveDMurphy added a commit that referenced this pull request Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants