-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Snowflake Generate Performance #4587
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
Passing run #7354 ↗︎
Details:
Review all test suite changes for PR #4587 ↗︎ |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4587 +/- ##
==========================================
- Coverage 86.60% 86.59% -0.02%
==========================================
Files 339 339
Lines 20100 20105 +5
Branches 2587 2588 +1
==========================================
+ Hits 17407 17409 +2
- Misses 2220 2223 +3
Partials 473 473 ☔ View full report in Codecov by Sentry. |
I think I may need to add some test coverage but was hoping to get an early review to try and get this on an alpha release if one of you has any time @pattisdr or @adamsachs It adds a new dependency ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, lookin really solid @SteveDMurphy! a couple of comments that you can feel free to take or leave, but generally this seems like a great improvement that looks straightforward enough.
i agree that some test coverage here would be very nice, though it may be a bit hard to get the parallelization functionality itself covered in automated tests. do we at least have the overall functionality of this codepath covered in existing tests, i.e. enough to ensure that this change doesn't cause regressions?
text(f'SHOW COLUMNS IN "{schema}"."{table}"') | ||
) | ||
columns = [row[2] for row in column_cursor] | ||
number_of_threads = 8 if len(db_tables) > 250 else 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it'd be really nice to make this config-driven in some way, if that's not too crazy of a thing to do. it feels like something we'd potentially want to fine tune "on the fly" - either for pure testing or just depending on various different environmental factors. could end up saving a lot of iteration overhead, where we could simply tweak in a deployed env, rather than having to cut new releases to iterate on the values we want here.
curious to hear what you think though - maybe this would less helpful than i'm imagining!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It felt like a bit of a pickle, I thought about it and strayed away a bit due to wanting to manage it without adding further config variables. I didn't like the idea of having to restart/re-deploy to generate a database (likely just one time and only for Snowflake). If these values didn't work, then the next step may be to up the timeout of the API or Generate the values in a separate thread entirely (similar to how we Classify today).
These aren't terribly heavy operations really as much as tedious so I did feel hesitant to add much extra, especially as it feels like there may be something better overall in a minor rewrite or other change specific to handling Snowflake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, that seems like a totally fair trade off if you don't imagine needing to fine tune the number of threads!
) | ||
columns = [row[2] for row in column_cursor] | ||
number_of_threads = 8 if len(db_tables) > 250 else 4 | ||
fields = Parallel(n_jobs=number_of_threads, backend="threading")( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not trying to make things more difficult, but could you use the builtin multiprocessing
package to get the same functionality here? (i think just a Pool.map
may be able to get you want you need?)
granted - i'm not a python parallelization expert, and i don't know a whole lot about joblib
. it also seems pretty lightweight, so i don't think it's necessarily a problem. but i do generally think we should have compelling reasons to stray away from the standard library packages -- maybe there are some here, and i just don't know 'em!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be honest, I knew we used this as a fairly simple way to improve performance in Fidesplus today with Classify so considered it relatively low-risk here as well even though it is introducing something "new" (especially taking the required turn-time component into account)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will take a look now to trial multiprocessing
! Would be nice to not have to deal with a new addition if we can
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok totally fine if this is tried and true - doesn't seem like a heavyweight dep and yeah it's already in the plus image.
it's always nice to try and be as slim as possible, especially here in OSS where this gets packaged into the python package. but it seems like minimal impact in this case and not worth a big struggle to switch over!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
absolutely, would love to get rid of the requirement in both if possible!
Co-authored-by: Adam Sachs <[email protected]>
I believe we do, in that the same output should match the results after this change - in the end, I think the only functional change is taking the parallelized output and outputting it correctly so I was going to look at testing that distinctly but it does feel covered... The current testing path should be redirected to hit both of these steps with Snowflake and we should match the same output (it's how I tested it manually as well!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some potential tweaks/improvements to look into around the edges, but overall this looks like a solid incremental update and i think it's ready for a alpha/beta deployment 👍
…-1639-snowflake-generate
Co-authored-by: Adam Sachs <[email protected]>
Closes PROD-1639
Description Of Changes
The Snowflake connector has some known out of the box issues, and the suggested workaround has some performance hit. This PR introduces the ability to surface the schema in parallel to reduce the time required. No data or funactionality change, just the option to surface the schema in multiple threads.
This was found as a problem for larger datasets that caused us to hit the FastAPI timeout (60s by default)
Code Changes
joblib
Steps to Confirm
Pre-Merge Checklist
CHANGELOG.md