Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix run repeatability/caching issues (#1138, #1139, #1140) #1144

Merged
merged 7 commits into from
Nov 20, 2018

Conversation

beckjake
Copy link
Contributor

This fixes a number of issues discovered in 0.12.1. Two are sinter-only issues, and one is a kind of egregious (but in practice mostly benign) bug around case comparisons that almost entirely manifests in snowflake.

I had a heck of a time reproducing user reports, and the new integration tests have to really dig into the cache a bit more than I'd like, but I do believe this fixes #1140 and the tests will catch any regressions.

Basically, cache dict keys are now always lowercased, and access to fields from the inner Relation are also lowercased, but the underlying Relation representation is not.

One thing that's a bit weird is that added nodes now show up in list_relations as their inserted text (though with proper quoting fields). dbt seems to handle that well and the tests are okay, but it is a potential source of issues for things that inspect list_relations output. I don't see a way around that without trying to guess what the db would do to your relation names, which we discussed in slack and decided is not a great plan.

@beckjake beckjake requested a review from drewbanin November 19, 2018 17:00
Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this out a bunch on Snowflake and Redshift. Everything worked as expected, and the logs (with --log-cache-events) all checked out.

Just for my understanding, you mentioned:

One thing that's a bit weird is that added nodes now show up in list_relations as their inserted text (though with proper quoting fields).

Can you give me an example of what that would look like in practice? I think it's ok, but just want to confirm.

@beckjake
Copy link
Contributor Author

beckjake commented Nov 20, 2018

Sure! The easy example is on snowflake so I'll use that. Imagine you have models/model.sql and your schema is analytics. In snowflake, with quoting set to False and assuming you haven't set the server-side flag that defaults things to lowercase, the name of the relation will become ANALYTICS.MODEL on the database end when it's created.

Calling _list_relations() would then come back with schema="ANALYTICS", identifier="MODEL" with quoting set to True for schema/identifier on the resulting relation.

With this cache, when you call list_relations() and have a cache hit, you'll get schema="analytics", identifier="model" and quoting set to False. The old behavior is more precise, but the new behavior is as accurate as we can be without going out to the database and actually re-querying.

@drewbanin
Copy link
Contributor

drewbanin commented Nov 20, 2018

ok, thanks! That's really helpful. I think dbt has to work this way. It would be wholly unreasonable for us to predict the exact casing that Snowflake will pick. The user might set quoted_identifier_ignore_case to TRUE, for example, and checking before each cache entry would defeat the purpose of caching. This is great, 🚢 it

@beckjake beckjake merged commit b5aab26 into dev/grace-kelly Nov 20, 2018
@beckjake beckjake deleted the fix/run-repeatability-issues branch November 20, 2018 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants