diff --git a/metadata-ingestion/examples/recipes/azure_ad_to_datahub.yml b/metadata-ingestion/examples/recipes/azure_ad_to_datahub.yml index fef7000fe9fc45..d7fe7f71d73443 100644 --- a/metadata-ingestion/examples/recipes/azure_ad_to_datahub.yml +++ b/metadata-ingestion/examples/recipes/azure_ad_to_datahub.yml @@ -19,4 +19,4 @@ source: sink: type: "datahub-rest" config: - server: "https://autotrader.acryl.io/gms" \ No newline at end of file + server: "http://localhost:8080" \ No newline at end of file diff --git a/metadata-ingestion/examples/recipes/bigquery_to_datahub.yml b/metadata-ingestion/examples/recipes/bigquery_to_datahub.yml new file mode 100644 index 00000000000000..1c84ff968563e9 --- /dev/null +++ b/metadata-ingestion/examples/recipes/bigquery_to_datahub.yml @@ -0,0 +1,47 @@ +--- +# see https://datahubproject.io/docs/metadata-ingestion/source_docs/bigquery for complete documentation +source: + type: "redshift" + config: + ## Coordinates + project_id: project-id-1234567 + ## Credentials + ## If GOOGLE_APPLICATION_CREDENTIALS environment variable is not set you can specify credentials here + #credential: + # project_id: project-id-1234567 + # private_key_id: "d0121d0000882411234e11166c6aaa23ed5d74e0" + # private_key: "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----\n" + # client_email: "test@suppproject-id-1234567.iam.gserviceaccount.com" + # client_id: "123456678890" + #include_tables: true + #include_views: true + #include_table_lineage: true + #start_time: 2021-12-15T20:08:23.091Z + #end_time: 2023-12-15T20:08:23.091Z + #profiling: + # enabled: true + # turn_off_expensive_profiling_metrics: false + # query_combiner_enabled: true + # max_number_of_fields_to_profile: 8 + # profile_table_level_only: false + # include_field_null_count: true + # include_field_min_value: true + # include_field_max_value: true + # include_field_mean_value: true + # include_field_median_value: true + # include_field_stddev_value: false + # include_field_quantiles: false + # include_field_distinct_value_frequencies: false + # include_field_histogram: false + # include_field_sample_values: false + #profile_pattern: + # allow: + # - "schema.table.column" + # deny: + # - "*.*.*" + +## see https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub for complete documentation +sink: + type: "datahub-rest" + config: + server: "http://localhost:8080" diff --git a/metadata-ingestion/source_docs/bigquery.md b/metadata-ingestion/source_docs/bigquery.md index 567fecca490c85..00adabf328ab03 100644 --- a/metadata-ingestion/source_docs/bigquery.md +++ b/metadata-ingestion/source_docs/bigquery.md @@ -6,6 +6,64 @@ For context on getting started with ingestion, check out our [metadata ingestion To install this plugin, run `pip install 'acryl-datahub[bigquery]'`. +## Prerequisites +### Create a datahub profile in GCP: +1. Create a custom role for datahub (https://cloud.google.com/iam/docs/creating-custom-roles#creating_a_custom_role) +2. Grant the following permissions to this role: +``` + bigquery.datasets.get + bigquery.datasets.getIamPolicy + bigquery.jobs.create + bigquery.jobs.list + bigquery.jobs.listAll + bigquery.models.getMetadata + bigquery.models.list + bigquery.routines.get + bigquery.routines.list + bigquery.tables.create # Needs for profiling + bigquery.tables.get + bigquery.tables.getData # Needs for profiling + bigquery.tables.list + logging.logEntries.list # Needs for lineage generation + resourcemanager.projects.get +``` +### Create a service account: + +1. Setup a ServiceAccount (https://cloud.google.com/iam/docs/creating-managing-service-accounts#iam-service-accounts-create-console) +and assign the previously created role to this service account. +2. Download a service account JSON keyfile. + Example credential file: +```json +{ + "type": "service_account", + "project_id": "project-id-1234567", + "private_key_id": "d0121d0000882411234e11166c6aaa23ed5d74e0", + "private_key": "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----", + "client_email": "test@suppproject-id-1234567.iam.gserviceaccount.com", + "client_id": "113545814931671546333", + "auth_uri": "https://accounts.google.com/o/oauth2/auth", + "token_uri": "https://oauth2.googleapis.com/token", + "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", + "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/test%suppproject-id-1234567.iam.gserviceaccount.com" +} +``` +3. To provide credentials to the source, you can either: + Set an environment variable: + $ export GOOGLE_APPLICATION_CREDENTIALS="/path/to/keyfile.json" + + *or* + + Set credential config in your source based on the credential json file. For example: + +```yml + credential: + project_id: project-id-1234567 + private_key_id: "d0121d0000882411234e11166c6aaa23ed5d74e0" + private_key: "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----\n" + client_email: "test@suppproject-id-1234567.iam.gserviceaccount.com" + client_id: "123456678890" +``` + ## Capabilities This plugin extracts the following: @@ -44,30 +102,34 @@ Note that a `.` is used to denote nested fields in the YAML recipe. As a SQL-based service, the Athena integration is also supported by our SQL profiler. See [here](./sql_profiles.md) for more details on configuration. -| Field | Required | Default | Description | -| ----------------------------------------- | -------- | ------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `project_id` | | Autodetected | Project ID to ingest from. If not specified, will infer from environment. | -| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. | -| `options.