Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingest): ensure workunits are created for all LookML views #2965

Merged
merged 7 commits into from
Jul 29, 2021
Merged

fix(ingest): ensure workunits are created for all LookML views #2965

merged 7 commits into from
Jul 29, 2021

Conversation

jameslamb
Copy link
Contributor

In metadata ingestion for LookML files, the method LookerViewFileLoader.load_viewfile() is responsible for reading in a .view.lkml file, parsing it with lkml, and returning a LookerViewFile instance from it.

https://github.com/linkedin/datahub/blob/328b098d0156f01a91a226e03003152fadc1ac03/metadata-ingestion/src/datahub/ingestion/source/lookml.py#L207-L213

Since one view file could be matched by include statements in multiple model/view files, it's possible for datahub ingest to result in multiple load_viewfile() calls for the same view file. To avoid re-parsing the same file multiple times, the LookerViewFileLoader holds a cache of parsed files.

https://github.com/linkedin/datahub/blob/328b098d0156f01a91a226e03003152fadc1ac03/metadata-ingestion/src/datahub/ingestion/source/lookml.py#L203-L204

After some investigation today, I found that there is another place in LookML metadata ingestion where that cache is also being used as a source of truth for "which viewfiles have had workunits created for them".

https://github.com/linkedin/datahub/blob/328b098d0156f01a91a226e03003152fadc1ac03/metadata-ingestion/src/datahub/ingestion/source/lookml.py#L632-L633

That can lead to some views being silently skipped (i.e., never having their metadata ingestion), if the first time they were loaded was in this other check where load_viewfile() is used during the process of resolving views that extend other views.

https://github.com/linkedin/datahub/blob/328b098d0156f01a91a226e03003152fadc1ac03/metadata-ingestion/src/datahub/ingestion/source/lookml.py#L372-L377

This PR proposes adding a flag to load_viewfile() that allows the use of that method without updating the loader's cache. In manual tests against a DataHub instance I have access to, I found that this fixed the "some views are not getting Datasets created" issue I was facing.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable)

Notes for Reviewers

I added argument update_cache with a default value to make this change backwards-compatible, but I think it would be a bit safer to not have that default value at all. Do I need to care about backwards-compatibility from the perspective of people doing from datahub.ingestion.source.lookml import LookerViewFileLoader?

Thanks for your time and consideration.

Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer your question - we don't need to worry about compatibility of LookerViewFileLoader since I don't think anyone is importing it directly.

path=include,
connection=model.connection,
reporter=self.reporter,
update_cache=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about the approach here - I would imagine that the "loading file from filesystem" cache should be separate from the markers of "have we produced a workunit for this yet"

It seems that using the cache exclusively for tracking if the workunit has been produced will result in us reading some files from the filesystem multiple times

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "loading file from filesystem" cache should be separate from the markers of "have we produced a workunit for this yet"

I totally agree with you! I think it would be better to keep around a separate cache in the LookMLSource that tracks "have we produced a workunit for this yet".

The only reason I didn't go with that approach in this PR was that I felt it was a more invasive change. If you want me to do the work to separate those two concerns in this PR, I'd be happy to!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "have we produced a workunit for this yet" cache can be a local variable in the get_workunits method, which should prevent it from being too invasive. That way, the "loading file from filesystem" cache behavior would also remain unchanged.

Would be great if you could make that edit!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "have we produced a workunit for this yet" cache can be a local variable in the get_workunits method

ah yeah, good point. Ok for sure, I can make that change here. I like that a lot better than the current state of this PR 😀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok @hsheth2 I think this is ready for another review. The diff is a lot simpler now, thanks for the advice!

Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for tracking this down @jameslamb!

Copy link
Contributor

@shirshanka shirshanka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@shirshanka shirshanka merged commit e88ccd9 into datahub-project:master Jul 29, 2021
@jameslamb jameslamb deleted the fix/lookml-extends branch July 29, 2021 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants