From a5790263116e47dbbaedf77f4efccc6cd4425875 Mon Sep 17 00:00:00 2001 From: Ellie O'Neil Date: Tue, 8 Oct 2024 14:52:30 -0700 Subject: [PATCH 1/2] update docs --- .../docs/sources/datahub/datahub_pre.md | 23 +++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/metadata-ingestion/docs/sources/datahub/datahub_pre.md b/metadata-ingestion/docs/sources/datahub/datahub_pre.md index cb1cc2c4d59036..ba9255991b281c 100644 --- a/metadata-ingestion/docs/sources/datahub/datahub_pre.md +++ b/metadata-ingestion/docs/sources/datahub/datahub_pre.md @@ -71,3 +71,26 @@ and [mce-consumer](../../../../metadata-jobs/mce-consumer-job/README.md)) - Increase the number of gms pods to add redundancy and increase resilience to node evictions * If you are migrating large amounts of data, consider increasing elasticsearch's thread count via the `ELASTICSEARCH_THREAD_COUNT` environment variable. + +#### Exclusions +You will likely want to exclude some urn types from your ingestion, as they contain instance-specific +metadata. For example, you will likely want to start with this: + +```yaml +source: + config: + urn_pattern: # URN pattern to ignore/include in the ingestion + deny: + # Ignores all datahub metadata where the urn matches the regex + - ^urn:li:role.* # Only exclude if you do not want to ingest roles + - ^urn:li:dataHubRole.* # Only exclude if you do not want to ingest roles + - ^urn:li:dataHubPolicy.* # Only exclude if you do not want to ingest policies + - ^urn:li:dataHubIngestionSource.* # Only exclude if you do not want to ingest ingestion sources + - ^urn:li:dataHubSecret.* + - ^urn:li:dataHubExecutionRequest.* + - ^urn:li:dataHubAccessToken.* + - ^urn:li:dataHubUpgrade.* + - ^urn:li:inviteToken.* + - ^urn:li:globalSettings.* + - ^urn:li:dataHubStepState.* +``` From 6a7052780ac349f0916734b562db6c133f7d0c7e Mon Sep 17 00:00:00 2001 From: Ellie O'Neil Date: Tue, 8 Oct 2024 14:54:30 -0700 Subject: [PATCH 2/2] docs --- metadata-ingestion/docs/sources/datahub/datahub_pre.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/metadata-ingestion/docs/sources/datahub/datahub_pre.md b/metadata-ingestion/docs/sources/datahub/datahub_pre.md index ba9255991b281c..b35eb5811e4c9b 100644 --- a/metadata-ingestion/docs/sources/datahub/datahub_pre.md +++ b/metadata-ingestion/docs/sources/datahub/datahub_pre.md @@ -74,7 +74,8 @@ and [mce-consumer](../../../../metadata-jobs/mce-consumer-job/README.md)) #### Exclusions You will likely want to exclude some urn types from your ingestion, as they contain instance-specific -metadata. For example, you will likely want to start with this: +metadata, such as settings, roles, policies, ingestion sources, and ingestion runs. For example, you +will likely want to start with this: ```yaml source: