fix(ingest): ensure workunits are created for all LookML views #2965

jameslamb · 2021-07-28T03:45:21Z

In metadata ingestion for LookML files, the method LookerViewFileLoader.load_viewfile() is responsible for reading in a .view.lkml file, parsing it with lkml, and returning a LookerViewFile instance from it.

https://github.com/linkedin/datahub/blob/328b098d0156f01a91a226e03003152fadc1ac03/metadata-ingestion/src/datahub/ingestion/source/lookml.py#L207-L213

Since one view file could be matched by include statements in multiple model/view files, it's possible for datahub ingest to result in multiple load_viewfile() calls for the same view file. To avoid re-parsing the same file multiple times, the LookerViewFileLoader holds a cache of parsed files.

https://github.com/linkedin/datahub/blob/328b098d0156f01a91a226e03003152fadc1ac03/metadata-ingestion/src/datahub/ingestion/source/lookml.py#L203-L204

After some investigation today, I found that there is another place in LookML metadata ingestion where that cache is also being used as a source of truth for "which viewfiles have had workunits created for them".

https://github.com/linkedin/datahub/blob/328b098d0156f01a91a226e03003152fadc1ac03/metadata-ingestion/src/datahub/ingestion/source/lookml.py#L632-L633

That can lead to some views being silently skipped (i.e., never having their metadata ingestion), if the first time they were loaded was in this other check where load_viewfile() is used during the process of resolving views that extend other views.

https://github.com/linkedin/datahub/blob/328b098d0156f01a91a226e03003152fadc1ac03/metadata-ingestion/src/datahub/ingestion/source/lookml.py#L372-L377

This PR proposes adding a flag to load_viewfile() that allows the use of that method without updating the loader's cache. In manual tests against a DataHub instance I have access to, I found that this fixed the "some views are not getting Datasets created" issue I was facing.

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable)

Notes for Reviewers

I added argument update_cache with a default value to make this change backwards-compatible, but I think it would be a bit safer to not have that default value at all. Do I need to care about backwards-compatibility from the perspective of people doing from datahub.ingestion.source.lookml import LookerViewFileLoader?

Thanks for your time and consideration.

hsheth2

To answer your question - we don't need to worry about compatibility of LookerViewFileLoader since I don't think anyone is importing it directly.

hsheth2 · 2021-07-28T17:35:27Z

metadata-ingestion/src/datahub/ingestion/source/lookml.py

+                    path=include,
+                    connection=model.connection,
+                    reporter=self.reporter,
+                    update_cache=True,


Question about the approach here - I would imagine that the "loading file from filesystem" cache should be separate from the markers of "have we produced a workunit for this yet"

It seems that using the cache exclusively for tracking if the workunit has been produced will result in us reading some files from the filesystem multiple times

the "loading file from filesystem" cache should be separate from the markers of "have we produced a workunit for this yet"

I totally agree with you! I think it would be better to keep around a separate cache in the LookMLSource that tracks "have we produced a workunit for this yet".

The only reason I didn't go with that approach in this PR was that I felt it was a more invasive change. If you want me to do the work to separate those two concerns in this PR, I'd be happy to!

I think the "have we produced a workunit for this yet" cache can be a local variable in the get_workunits method, which should prevent it from being too invasive. That way, the "loading file from filesystem" cache behavior would also remain unchanged.

Would be great if you could make that edit!

I think the "have we produced a workunit for this yet" cache can be a local variable in the get_workunits method

ah yeah, good point. Ok for sure, I can make that change here. I like that a lot better than the current state of this PR 😀

ok @hsheth2 I think this is ready for another review. The diff is a lot simpler now, thanks for the advice!

hsheth2

LGTM

Thanks for tracking this down @jameslamb!

metadata-ingestion/src/datahub/ingestion/source/lookml.py

shirshanka

LGTM!

fix(ingest): ensure workunits are created for all LookML views

ef27728

This was referenced Jul 28, 2021

fix(ingest): ensure that LookML files are always parsed in the same order #2966

Merged

fix(ingest): add more debug logging to LookML metadata ingestion #2967

Merged

hsheth2 reviewed Jul 28, 2021

View reviewed changes

jameslamb added 5 commits July 28, 2021 14:42

move caching inside get_workunits()

68e8fe5

revert unnecessary changes

1b0d76a

revert more unnecessary diff

91f1bcf

remove even more unnecessary diff

89fcfbf

Merge branch 'master' into fix/lookml-extends

b024afd

hsheth2 approved these changes Jul 29, 2021

View reviewed changes

jameslamb commented Jul 29, 2021

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/lookml.py Outdated Show resolved Hide resolved

Update metadata-ingestion/src/datahub/ingestion/source/lookml.py

cabaa9f

shirshanka approved these changes Jul 29, 2021

View reviewed changes

shirshanka merged commit e88ccd9 into datahub-project:master Jul 29, 2021

jameslamb deleted the fix/lookml-extends branch July 29, 2021 06:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ingest): ensure workunits are created for all LookML views #2965

fix(ingest): ensure workunits are created for all LookML views #2965

jameslamb commented Jul 28, 2021

hsheth2 left a comment

hsheth2 Jul 28, 2021

jameslamb Jul 28, 2021

hsheth2 Jul 28, 2021

jameslamb Jul 28, 2021

jameslamb Jul 28, 2021

hsheth2 left a comment

shirshanka left a comment

fix(ingest): ensure workunits are created for all LookML views #2965

fix(ingest): ensure workunits are created for all LookML views #2965

Conversation

jameslamb commented Jul 28, 2021

Checklist

Notes for Reviewers

hsheth2 left a comment

Choose a reason for hiding this comment

hsheth2 Jul 28, 2021

Choose a reason for hiding this comment

jameslamb Jul 28, 2021

Choose a reason for hiding this comment

hsheth2 Jul 28, 2021

Choose a reason for hiding this comment

jameslamb Jul 28, 2021

Choose a reason for hiding this comment

jameslamb Jul 28, 2021

Choose a reason for hiding this comment

hsheth2 left a comment

Choose a reason for hiding this comment

shirshanka left a comment

Choose a reason for hiding this comment