Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1767] [Feature] dbt parse should return a manifest #6547

Closed
Tracked by #7162 ...
jtcohen6 opened this issue Jan 8, 2023 · 7 comments · Fixed by #7314
Closed
Tracked by #7162 ...

[CT-1767] [Feature] dbt parse should return a manifest #6547

jtcohen6 opened this issue Jan 8, 2023 · 7 comments · Fixed by #7314
Assignees
Labels
enhancement New feature or request python_api Issues related to dbtRunner Python entry point

Comments

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jan 8, 2023

Prerequisite: dbt parse supported as top-level command in new API (#5550)

dbt parse should return the Manifest as its result. This makes it clearer that the output of parsing is the manifest*, and it would also enable programmatic users to do handy things like:

# just for type hint below
from dbt.contracts.graph.manifest import Manifest

# initialize
from dbt.cli.main import dbtRunner
dbt = dbtRunner()

# 'parse' command
results, success = dbt.invoke(['parse'])

# hand waving this part for now -- we'd want 'results' to have a consistent type for all 'invoke' calls
manifest: Manifest = results...['manifest']

# now I can programmatically inspect the project contents!
# pseudo code: assert every non-ephemeral model has an "owner"
for model in [n for n in manifest.nodes if n.resource_type == 'model']:
    if model.materialized != 'ephemeral':
        assert (model.meta['owner'] is not None)

*There are caveats to this statement, of course—the manifest is modified during compile, and technically there is other config resolution (profile, project, packages) that happens during parsing and is not included in the manifest—but it still holds 85% true today, and I'd like to see us getting to a place where it's 95% true.

Current behavior

Currently, dbt parse returns None as its results.

It's possible to call the (undocumented, internal) get_full_manifest method directly and return a manifest. This is what dbt.lib.parse_to_manifest does:

dbt-core/core/dbt/lib.py

Lines 186 to 189 in 94d6d19

def parse_to_manifest(config):
from dbt.parser.manifest import ManifestLoader
return ManifestLoader.get_full_manifest(config)

The proposal in this issue is to expose that capability within dbt-core's documented, contracted, top-level API.

Motivation

Programmatic, read-only access to project manifests, as part of dbt-core's initial public Python API + "library" capabilities.

This would make it quite easy to write "project quality" rules, that run in pre-commit or CI. Because this would be the actual Manifest Python object, it would also makes available "helper" methods specific to each node class, which are lost during serialization to flat dictionaries (Jinja model + graph variables) and JSON (manifest.json).

With that great power, comes great responsibility: we'd need to be very clear about which classes, attributes, & methods can be considered public versus private. (To date, all dbt-core Python internals are considered private, and any direct use of them is considered undocumented & un-guaranteed functionality.) Is this an API we'd be ready & willing to stand behind? Related: #6391. If we're saying that the real contract is a proto class, and any Python dataclasses used to generate those data structures are implementation details (liable to change) rather than contracted APIs in & of themselves, that's something we'd need to document very clearly.

Alternatives

  • Parsing manifest.json after it's written by dbt parse (see [CT-1759] dbt parse should (over)write manifest.json by default #6534) — with the additional cost of writing to / reading from disk, and without the helper methods described above
  • Writing these rules in Jinja today. Because it's Jinja, it's gnarlier to implement & harder to test, but it does have the appeal of keeping it all in "dbt code"—which also means it can be installed as a dbt package, and run in dbt Cloud—without requiring additional Python scripting / infrastructure to run. The dbt Labs ProServ team has shown it's possible: https://github.com/dbt-labs/dbt-project-evaluator
@jtcohen6 jtcohen6 added enhancement New feature or request python_api Issues related to dbtRunner Python entry point Team:Execution labels Jan 8, 2023
@github-actions github-actions bot changed the title [Feature] dbt parse should return a manifest [CT-1767] [Feature] dbt parse should return a manifest Jan 8, 2023
@jtcohen6
Copy link
Contributor Author

An alternative would be to return just Manifest.flat_graph, which serializes everything to vanilla data structures (lists/dicts), and more closely matches the existing contents of manifest.json and the Jinja {{ graph }} context variable.

The downside is, we're not giving users access to the ergonomics of our built-in classes/methods. It's really no different from reading in manifest.json, just without the extra step of needing to write/read a file.

The upside would be, we wouldn't be exposing any of the internals that we might not be comfortable committing to maintaining as a public interface.

@jtcohen6
Copy link
Contributor Author

We're already a bit inconsistent as far as the return type of various commands. The "building" commands (run, test, snapshot, seed, build) all return RunExecutionResult (→ run_results.json), but not others:

  • source freshness returns a FreshnessResult (→ sources.json)
  • docs generate returns a CatalogArtifact (→ catalog.json)
>>> from dbt.cli.main import dbtRunner
>>> dbt = dbtRunner()
>>> results, success = dbt.invoke(['docs', 'generate'])
07:44:09  Found 1 model, 0 tests, 0 snapshots, 0 analyses, 401 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
07:44:09
07:44:09  Concurrency: 1 threads (target='dev')
07:44:09
07:44:09  Done.
07:44:09  Building catalog
07:44:09  Catalog written to /Users/jerco/dev/scratch/testy/target/catalog.json
>>> type(results)
<class 'dbt.contracts.results.CatalogArtifact'>

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Mar 13, 2023

The dbt parse command should return a Manifest to unblock shorter-term needs. Let's wrap the Manifest in a ParseResult, and explicitly say that we're not guaranteeing any of its underlying methods or anything about its structure in the future.

Out of scope (but worth doing later): creating a base result class, and having all commands return a subclass of that base class.

@iknox-fa
Copy link
Contributor

iknox-fa commented Apr 10, 2023

@jtcohen6 I'm a little confused by the AC for this ticket. It's easy enough to return the manifest as generated in pre-flight (here's the PR to do so), but...

Let's wrap the Manifest in a ParseResult

ParseResult hasn't been a thing for a while. The best I can tell, it was removed during the partial parsing work about 2 years ago. Do you want to bring it back? Build something else?

creating a base result class, and having all commands return a subclass of that base class.

What would be the point in that, exactly? As pointed out, we're already not at all consistent in what results means across tasks.

we'd need to be very clear about which classes, attributes, & methods can be considered public versus private. (To date, all dbt-core Python internals are considered private, and any direct use of them is considered undocumented & un-guaranteed functionality.) Is this an API we'd be ready & willing to stand behind?

All great questions, but there's zero AC to guide decision-making here.

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Apr 10, 2023

@iknox-fa The comment above was from live notes during a scoping/estimation meeting with @ChenyuLInx @aranke last month, where folks had different opinions.

Goals:

  • return a Manifest object
  • make it easy to save & reuse that object
  • communicate that the Manifest object is not fully contracted, and that the return signature of parse may change in the future

To be honest, I'm perfectly fine with just returning a Manifest object directly (as you have it in the draft PR currently), and noting in our documentation that:

  • different commands return different types
  • updating the type hint for the results here to be Optional[Any] instead of Optional[List] (the wrapper will change from a Tuple to a typed object in Handle internal exceptions in postflight #7242)
  • we're not contracting the Manifest object any more than we already have — caveat usor!

@iknox-fa
Copy link
Contributor

@jtcohen6
Gotcha-- I'll move that PR from draft to open and make one for the docs site for communicating the changes.

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Apr 10, 2023

Re: docs site: I think it will make sense to include as part of dbt-labs/docs.getdbt.com#3118

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python_api Issues related to dbtRunner Python entry point
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants