-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snowflake catalog/manifest support (#832) #849
Conversation
…un + dbt docs generate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plugged this into the (presently unreleased) docs site, and the catalog worked really well :)
I think we'll probably want to be more aggressive about filtering the catalog, but keen to discuss in the comments.
|
||
from information_schema.tables | ||
|
||
union |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're in the habit of using union all
instead of union
. The only difference is that union all
preserves duplicates. I don't think that's an issue here, but when I see union
, I think "dupes are specifically not wanted here".
Via: https://github.com/fishtown-analytics/corp/blob/master/dbt_coding_conventions.md#style-guide
dbt/adapters/default/impl.py
Outdated
results = cls.run_operation(profile, project_cfg, manifest, | ||
GET_CATALOG_OPERATION_NAME, | ||
GET_CATALOG_RESULT_KEY) | ||
schemas = cls.get_existing_schemas(profile, project_cfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The get_existing_schemas
function just returns all of the schemas in the database. I think that means that the filter on the line below doesn't actually do anything! I think we probably don't want to index every single schema in the db, just insofar as it could be a lot of data.
I imagined the catalog only being generated for the dbt-managed schemas in the database. There's an example of logic like that in the snippet over here: #830
schemas = list({node.to_dict()['schema'] for node in manifest.nodes.values()})
I think eventually, we'll want to allow users to turn on catalog generation for non-db schemas, but we're not quite there yet. Do you agree @cmcarthur?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I just migrated it into the default adapter since it isn't really intended to be postgres-specific and didn't want to make any operational changes.
Does that mean the catalog shouldn't include stuff loaded by dbt seed
, or does that count as managed by dbt?
Either way, we will want to make bigquery consistent so I'll have to change one of these!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, totally, I think this is something I missed in the PR that originally added the catalog code for Postgres.
I think it makes sense to include seed schemas in the catalog too. Seeds, like models, can be totally rebuilt from scratch, so I consider them "dbt managed" also. Further, seeds are nodes
, so their schemas will be plucked by the code sample above too.
I think that just leaves out archives. Archives are stateful, and definitely cannot be rebuilt from scratch if they are dropped, so we recommend that folks put them in totally different schemas than their models & seeds. Even further: we consider archived tables to be "source data" in that archives don't participate in the graph and cannot be ref
d. We should someday normalize their behavior, but I think it's ok to leave them out of the catalog for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that really makes things a bit nicer anyway in my opinion, we don't have to re-generate the same data already in the manifest. 👍
Implement catalog and manifest creation for snowflake, and created/improved some tests. Implements #832.