Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2124] [Feature] --no-connection flag for dbt docs generate #6980

Closed
3 tasks done
jaypeedevlin opened this issue Feb 14, 2023 · 18 comments · Fixed by #7202
Closed
3 tasks done

[CT-2124] [Feature] --no-connection flag for dbt docs generate #6980

jaypeedevlin opened this issue Feb 14, 2023 · 18 comments · Fixed by #7202
Labels
enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors!

Comments

@jaypeedevlin
Copy link
Contributor

jaypeedevlin commented Feb 14, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Provide an optional flag to be used with doc generation that skips the metadata queries and generates catalog.json based only off what it can get from compilation (and/or the existing manifest.json if --no-compile is also passed).

The problem I am solving for is around enabling the docs to be used locally for exploratory purposes during local development in large projects where building the catalog would otherwise take a long time — in the case of our project building the catalog without compile takes >30 minutes (11k nodes, Snowflake, 8 threads, 2XL).

This problem and solution is actually described exactly in #3947 (comment):

I work with a few teams in the company and, as a developer, I would love to see their DAGs and understand better how they are developing/deploying their environment. How do other teams generate certain metrics, how do they use those metrics, and what kinds of tests do they typically perform across their metrics? I know that some teams use pre-hooks and post-hooks to determine things like row counts before and after runs to ensure things look right. I'd love to be able to explore this in the dbt docs serve format.

I can manually parse through this info on my own but digging through dozens of sql files across multiple folders is... a bit intense. That said, I'm not as interested in the actual data and don't have access anyways.

I understand the huge benefit to having connectivity to the database when generating docs but it would be great if you could put a flag on the command that allows you to forego any benefits of connectivity. Something like dbt docs generate --no_connection so that I could create enough of the docs to view their processes but knowing that I wouldn't have access to some of the metadata that I might get if I didn't use the flag.

The intention would be that it's not recommended to use this flag for building production-quality docs, but it would unlock quick exploration in large projects.

Describe alternatives you've considered

  1. Using dbt Cloud's in-editor DAG
  2. Parsing the manifest manually to create an explorable DAG (maybe this could be bundled as part of a vscode extension 😉)

Who will this benefit?

Developers of large projects.

Are you interested in contributing this feature?

Provided it is within my capabilities, then yes!

Anything else?

I imagine this feature would be (ab)used by many folks who want to build their docs in CI without a DB connection. I'm not sure that this is behaviour we should actually be preventing though, as long as they understand the tradeoffs.

@jaypeedevlin jaypeedevlin added enhancement New feature or request triage labels Feb 14, 2023
@github-actions github-actions bot changed the title [Feature] --no-connection flag for dbt docs generate [CT-2124] [Feature] --no-connection flag for dbt docs generate Feb 14, 2023
@jtcohen6 jtcohen6 self-assigned this Feb 15, 2023
@manugarri
Copy link

+1 to this feature request... there was a similar discussion as back as 2021.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 25, 2023

@jaypeedevlin Thanks for opening! This feels totally reasonable to me. I agree with your take that:

  • We could be clear about the intended use case
  • Even if someone were to misuse this, I don't think it's something we should actively prevent, as much as discourage / make sure we clearly communicate the trade-offs

Implementations

I believe such a flag could conditionally wrap the logic here in the GenerateTask. For example:

        # this should go at the top of the file
        import agate
        
        ...
        
        # initialize empty typed objects, based on return types of get_catalog
        catalog_table: agate.Table = agate.Table([])
        exceptions: List[Exception] = []

        if not self.args.no_connection: # or whatever we call this -- see below
            adapter = get_adapter(self.config)
            with adapter.connection_named("generate_catalog"):
                fire_event(BuildingCatalog())
                catalog_table, exceptions = adapter.get_catalog(self.manifest)
        
        # otherwise, use the empty objects, and proceed

What should the flag be called?

  • --no-connection
  • --no-catalog, --empty-catalog, --skip-catalog-generation
  • ...?

Then:

$ dbt docs generate --no-compile --no-catalog

Would produce a manifest.json without any "compiled" SQL (since that could require introspective database queries), and without any catalog entries pulled from database metadata.

Going to mark this as a good first issue!


Incidentally, I think we could also solve for this via:

$ dbt docs generate --exclude fqn:* source:*

Just sharing as an observation — I don't think that's a reason to not also offer this flag!


In the meantime, here's a fun hack, as a treat:

# write manifest.json -- starting in v1.5, no need to pass this flag, 'parse' always writes
$ dbt parse --write-manifest

# write an empty catalog.json
$ echo '{"metadata": {}, "nodes": {}, "sources": {}, "errors": null}' > target/catalog.json

$ dbt docs serve

Okay, I realize the last step required an Internet connection — but otherwise it was a totally didn't! You could even pull down whatever docs site you wanted to use, instead of the one distributed by default with dbt-core, if you were someone who wanted to build/use your own :)

@jtcohen6 jtcohen6 added good_first_issue Straightforward + self-contained changes, good for new contributors! Team:Execution and removed triage labels Feb 25, 2023
@jtcohen6 jtcohen6 removed their assignment Feb 25, 2023
@manugarri
Copy link

FYI, providing the ability to bypass connection would also be very helpful for linting with sqlfluff, right now, in order to lint and test the sql/macros before actually running them it requires an active connection (which is an antipattern in my opinion) sqlfluff/sqlfluff#4397

@jaypeedevlin
Copy link
Contributor Author

@manugarri this issue is about generating artifacts for docs, not for compilation.

At the risk of derailing this issue, the reason dbt needs a connection to compile is because jinja-sql supports functions like run_query() which call to the database at compile-time. Even if there were a mechanism for compilation without a connection, sqlfluff would not be able to support projects with this (reasonably frequent) pattern.

Disclosure, I don't work for dbt Labs and this is just my personal opinion

@manugarri
Copy link

@jaypeedevlin thanks for the clarification, I would argue that just because we want to support run_query (which might or might not be used in individual dbt projects) we are forcing everyone to setup a connection on their ci/cd. IMHO it should be the other way around, optional connection, and if a user wants to use something like run_query then that subset of users can choose to use a connection.

@jaypeedevlin
Copy link
Contributor Author

I would encourage you to open a issue (or maybe a discussion is more appropriate) around this and/or find an existing one to continue the conversation!

@AndyBys
Copy link
Contributor

AndyBys commented Mar 21, 2023

Hey there!

Unsure if issue still high in demand, but I've made a linked draft with changes discussed here. Let me know if it makes sense and I'll carry on.

@culpgrant
Copy link

I would be interested in this

@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 21, 2023

@manugarri @jaypeedevlin Regarding the other conversation, about compiling models without an active connection, this will be newly supported in v1.5:

$ dbt compile --no-introspect

Note that this will fail if you have run_query in your code:

$ dbt compile --no-introspect
17:38:54  Running with dbt=1.5.0-b4
...
dbt.exceptions.DbtRuntimeError: Runtime Error
  Runtime Error in model my_model (models/my_model.sql)
    connection already closed

Not very graceful, but it's a first cut. What really happened is, the connection was never opened!

@jaypeedevlin
Copy link
Contributor Author

@AndyBys I haven't started work on this yet so consider it yours!

@manugarri
Copy link

@jtcohen6 great! does that mean we will be able to generate docs without an active connection in the future?

@timle2
Copy link
Contributor

timle2 commented Mar 29, 2023

Big plus one to this.
Here's some more info on how it would be helpful to us. Particularly as maintainers of a huuuge dbt project.
We have a 400+ materialized models in our project. At times, our warehouse slows down (it's a shared resourced across the company, not just our team), meaning docs generation can take 10-20 minutes on really slow days (yes really). We do docs generation as part of our CICD process (and ship out the docs to a server), which means catalog generation can become a massive bottleneck.
For some idea on how I'd use this: try classic catalog generation, time out after a few minutes (warehouse congested), and fall back to this option. Right now we time out after 10 minutes and just accept that the docs will be stale. In fact, we may even do all doc generation as --no-connection just to reduce DB traffic on builds for larger projects and be ok with the loss of fidelity as a trade off for not spamming our db with 400+ connections several times a day.

@dwolfeu
Copy link

dwolfeu commented Apr 5, 2023

If this goes ahead, then dbt docs generate and dbt docs serve should no longer require a profile if --no-connection is specified, right? This is now the case for dbt deps (see this PR).

@jtcohen6
Copy link
Contributor

jtcohen6 commented Apr 5, 2023

@dwolfeu A valid profile is still required for project parsing today, because users may access values of {{ target }} within their configuration logic, and within macros such as generate_schema_name that are resolved at pares time. We had some initial conversation about what would be required to support parsing without a profile/adapter loaded:

@dwolfeu
Copy link

dwolfeu commented Apr 5, 2023

@jtcohen6 Thank you for your quick reply! From the perspective of an end user (who knows nothing of the inner workings of dbt-core): We have an image that fetches the manifest and project files from external storage and then serves the dbt docs. Having to provide a profile/connect to the database seems conceptually unnecessary and can be problematic, for instance if one wants to avoid network traffic.* Perhaps it would be possible to allow dbt docs to run without a profile and throw an error if something like {{ target }} is referenced?

*We are not the only ones: see this workaround, for instance.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Apr 5, 2023

@dwolfeu To clarify my message above: dbt docs generate will still require a connection profile, in order to access the static configuration supplied therein ({{ target }} values) — but if you were to pass both --no-compile and --empty-catalog (as proposed in #7202), dbt would not actually connect to the database and run any queries. You wouldn't even need an Internet connection.

@dwolfeu
Copy link

dwolfeu commented Apr 5, 2023

@jtcohen6 We are using this dummy profiles.yml as a workaround:

profile-name:
  target: docs
  outputs:
    docs:
      type: postgres
      host: foo
      database: foo
      schema: foo
      port: 00
      threads: 1
      user: foo
      password: foo

It works, but it's hacky. It would be nicer not to have to include a profile at all.

Some context (so you don't think I'm crazy!): We of course have a "real" profiles.yml in our main dbt repo, but we don't want to include it in our other repo whose sole purpose is to generate the docs coming from the main repo (DRY and all that).

@ntn-rjdn
Copy link

Is this --no-connection flag available in the latest version of dbt-core? I couldn't get it work. I'm using dbt-databricks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants