-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Selection method based on source freshness: new max_loaded_at
, new data
#4050
Comments
Wow, smart freshness runs. Now this answers a question I get all the time: "How do I get dbt jobs to run based on data freshness and not generically process everything?" |
@anaisvaillant and I want to work on this together 🧃 @jtcohen6, let us know if this is dependent on other functionality before we start! I noticed you explicitly mentioned having the one selector to rule them all. Is that something you'd want baked in for testing or that can wait? |
Is this a good idea? A bad one? A huge step in the wrong direction? I'm open to discussion, and also experimentation. I included the big hair-brained idea as a motivation for this line of thinking; it can definitely wait. @sungchun12 @anaisvaillant In that spirit, I'd be thrilled if you two want to play around with this. I imagine you could reuse some similar logic from #4016, though you'll need to find ways to handle multiple @barryaron @drewbanin I'd be especially interested in your thoughts here :) |
Hi Jérémy ! Here is a really good paper on the matter : Data-Vault-Implementation-and-Automation-Consistency-and-Referential-Integrity by Roelant Vos, a very very skilled data vault consultant. He wrote a paper about dbt Did you already hear about him ? His notion of "eventual consistency" is really interesting, and I wonder how it could apply to dbt, with its current threading feature, not to mention my dream of multi-processed partitioned DAG. In data vault's world, The timestamp the data was fiirst loaded in the DWH is a first class citizen. |
This is a great idea, and I'll expand on your original description with how I see it live and breathe in practice. Problem 1
Regarding Benn's Tweet, latency guarantees based on models to maintain(think: exposures) is a great working philosophy. Building this feature seems like it combats this vision at a surface level. However, this feature goes a step further in vision: latency is defined by ALL models in scope(with freshness requirements), not only those we manually scope downstream. If we work backwards from exposures, it will require at least one more step in the above workflow. I don't see value in an extra step to make some data fresh knowing ALL data can be fresh just as easily with less mental overhead. "I want all my data to be as fresh as possible. I shouldn't have all these mental models and over-engineered project setups to feel confident of that." - Future data person I'm excited for a future where events-based data experiences are the norm and not the exception. For your open technical implementation questions, those are for future Sung and Anais to figure out in the planning process :) |
Completely agree with @sungchun12 here. This would be awesome, @jtcohen6. |
Currently we are not able to do these things with dbt and therefore created all kinds of complex and expensive macros to check for looking at source freshness states. A solution like @sungchun12 did, would be great! |
I love this idea, it will be so great. |
From reading dbt documentation, I was under the impression that this functionality was already implemented. But maybe it's not really implemented as I'm unable to get it working (I modify a source table, but it's not picked up in It's totally possible I'm misunderstanding something here. Thanks for any clarity anyone can provide! |
This is saying that if you change the source freshness config of a source, that is considered a modification and all dependent models will run starting with everything using that table as a source. This issue suggestion is to only run dependencies that have new records after calculating source freshness. |
Thank you Garrett, that makes sense. Seems it's an ambiguous use of "freshness property" in the documentation. The issue suggestion is really the functionality I'm interested in. |
See also: #2465 (comment), #2743 (comment), #3804 (comment), #3862 (comment), Benn's tweet
Use the
sources.json
artifact to determine which sources have new data since the last dbt build.Imagine:
This would require handling multiple
sources.json
:--state
directorytarget/
directoryThen, dbt would compare the high watermarks recorded in each, and select all source tables with a greater/changed
max_loaded_at
.Questions
freshness
task withindbt build
? I think it would be tricky to check source freshness and perform dynamic selection and build those now-selected resources, all in one invocation. It would require blurring lines that are pretty well delineated today (selection → execution → artifacts).sources.json
intorun_results.json
? It seems like the separation would actually be a plus, for the sake of this feature, but I'm sure we could manage either waysource:stripe
andsource:salesforce.account
have new data, it's easy enough to then define a job withdbt build -s source:stripe+ source:salesforce.account+
.Big idea
Coupled with
state:modified
andresult:<status>
(#4017), you could dynamically select the part of your DAG that's relevant by comparing the current state of your sources/project and the artifacts from your last run.Schedule that to run every 5 minutes, set up solid slack notifications on failures, and put on a good album.
While this is very cool, it does take its cue from the source data loaders—whenever they've ingested new rows, we're going to build downstream tables. This is the inverse of defining freshness / latency expectations for specific "final" models, and using that as the primary input to deployment/scheduling. That still seems like a cool and important thing to do... perhaps via
exposures
??The text was updated successfully, but these errors were encountered: