-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(ingest): ensure workunits are created for all LookML views #2965
fix(ingest): ensure workunits are created for all LookML views #2965
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To answer your question - we don't need to worry about compatibility of LookerViewFileLoader
since I don't think anyone is importing it directly.
path=include, | ||
connection=model.connection, | ||
reporter=self.reporter, | ||
update_cache=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question about the approach here - I would imagine that the "loading file from filesystem" cache should be separate from the markers of "have we produced a workunit for this yet"
It seems that using the cache exclusively for tracking if the workunit has been produced will result in us reading some files from the filesystem multiple times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the "loading file from filesystem" cache should be separate from the markers of "have we produced a workunit for this yet"
I totally agree with you! I think it would be better to keep around a separate cache in the LookMLSource
that tracks "have we produced a workunit for this yet".
The only reason I didn't go with that approach in this PR was that I felt it was a more invasive change. If you want me to do the work to separate those two concerns in this PR, I'd be happy to!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the "have we produced a workunit for this yet" cache can be a local variable in the get_workunits
method, which should prevent it from being too invasive. That way, the "loading file from filesystem" cache behavior would also remain unchanged.
Would be great if you could make that edit!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the "have we produced a workunit for this yet" cache can be a local variable in the get_workunits method
ah yeah, good point. Ok for sure, I can make that change here. I like that a lot better than the current state of this PR 😀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok @hsheth2 I think this is ready for another review. The diff is a lot simpler now, thanks for the advice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for tracking this down @jameslamb!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
In metadata ingestion for LookML files, the method
LookerViewFileLoader.load_viewfile()
is responsible for reading in a.view.lkml
file, parsing it withlkml
, and returning aLookerViewFile
instance from it.https://github.com/linkedin/datahub/blob/328b098d0156f01a91a226e03003152fadc1ac03/metadata-ingestion/src/datahub/ingestion/source/lookml.py#L207-L213
Since one view file could be matched by
include
statements in multiple model/view files, it's possible fordatahub ingest
to result in multipleload_viewfile()
calls for the same view file. To avoid re-parsing the same file multiple times, theLookerViewFileLoader
holds a cache of parsed files.https://github.com/linkedin/datahub/blob/328b098d0156f01a91a226e03003152fadc1ac03/metadata-ingestion/src/datahub/ingestion/source/lookml.py#L203-L204
After some investigation today, I found that there is another place in LookML metadata ingestion where that cache is also being used as a source of truth for "which viewfiles have had workunits created for them".
https://github.com/linkedin/datahub/blob/328b098d0156f01a91a226e03003152fadc1ac03/metadata-ingestion/src/datahub/ingestion/source/lookml.py#L632-L633
That can lead to some views being silently skipped (i.e., never having their metadata ingestion), if the first time they were loaded was in this other check where
load_viewfile()
is used during the process of resolving views that extend other views.https://github.com/linkedin/datahub/blob/328b098d0156f01a91a226e03003152fadc1ac03/metadata-ingestion/src/datahub/ingestion/source/lookml.py#L372-L377
This PR proposes adding a flag to
load_viewfile()
that allows the use of that method without updating the loader's cache. In manual tests against a DataHub instance I have access to, I found that this fixed the "some views are not getting Datasets created" issue I was facing.Checklist
Notes for Reviewers
I added argument
update_cache
with a default value to make this change backwards-compatible, but I think it would be a bit safer to not have that default value at all. Do I need to care about backwards-compatibility from the perspective of people doingfrom datahub.ingestion.source.lookml import LookerViewFileLoader
?Thanks for your time and consideration.