Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updated connectors documentation across all versions #19382

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#### Connection Details for Azure

- **Azure Credentials**

- **Client ID** : Client ID of the data storage account
- **Client Secret** : Client Secret of the account
- **Tenant ID** : Tenant ID under which the data storage account falls
- **Account Name** : Account Name of the data Storage

- **Required Roles**

Please make sure the following roles associated with the data storage account.
- `Storage Blob Data Contributor`
- `Storage Queue Data Contributor`

The current approach for authentication is based on `app registration`, reach out to us on [slack](https://slack.open-metadata.org/) if you find the need for another auth system
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#### Connection Details

{% partial file="/v1.5/connectors/database/aws.md" /%}

- **S3 Staging Directory**: The S3 staging directory is an optional parameter. Enter a staging directory to override the default staging directory for AWS Athena.
- **Athena Workgroup**: The Athena workgroup is an optional parameter. If you wish to have your Athena connection related to an existing AWS workgroup add your workgroup name here.
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
#### Connection Options

- **Username**: Specify the User to connect to AzureSQL. It should have enough privileges to read all the metadata.
- **Password**: Password to connect to AzureSQL.
- **Host and Port**: Enter the fully qualified hostname and port number for your AzureSQL deployment in the Host and Port field.
- **Database**: The database of the data source is an optional parameter, if you would like to restrict the metadata reading to a single database. If left blank, OpenMetadata ingestion attempts to scan all the databases.
- **Driver**: Connecting to AzureSQL requires ODBC driver to be installed. Specify ODBC driver name in the field.
You can download the ODBC driver from [here](https://learn.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server?view=sql-server-ver16). In case of docker or kubernetes deployment this driver comes out of the box with version `ODBC Driver 18 for SQL Server`.

**Authentication Mode**:

- **Authentication**:
- The `authentication` parameter determines the method of authentication when connecting to AzureSQL using ODBC (Open Database Connectivity).
- If you select **"Active Directory Password"**, you'll need to provide the password associated with your Azure Active Directory account.
- Alternatively, if you choose **"Active Directory Integrated"**, the connection will use the credentials of the currently logged-in user. This mode ensures secure and seamless connections with AzureSQL.

- **Encrypt**:
- The `encrypt` setting in the connection string pertains to data encryption during communication with AzureSQL.
- When enabled, it ensures that data exchanged between your application and the database is encrypted, enhancing security.

- **Trust Server Certificate**:
- The `trustServerCertificate` option also relates to security.
- When set to true, your application will trust the server's SSL certificate without validation. Use this cautiously, as it bypasses certificate validation checks.

- **Connection Timeout**:
- The `connectionTimeout` parameter specifies the maximum time (in seconds) that your application will wait while attempting to establish a connection to AzureSQL.
- If the connection cannot be established within this timeframe, an error will be raised.
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#### Connection Options

**Host and Port**: BigQuery APIs URL. By default, the API URL is `bigquery.googleapis.com` you can modify this if you have custom implementation of BigQuery.

{% partial file="/v1.5/connectors/database/gcp.md" /%}

**Taxonomy Project ID (Optional)**: Bigquery uses taxonomies to create hierarchical groups of policy tags. To apply access controls to BigQuery columns, tag the columns with policy tags. Learn more about how yo can create policy tags and set up column-level access control [here](https://cloud.google.com/bigquery/docs/column-level-security)

If you have attached policy tags to the columns of table available in Bigquery, then OpenMetadata will fetch those tags and attach it to the respective columns.

In this field you need to specify the id of project in which the taxonomy was created.

**Taxonomy Location (Optional)**: Bigquery uses taxonomies to create hierarchical groups of policy tags. To apply access controls to BigQuery columns, tag the columns with policy tags. Learn more about how yo can create policy tags and set up column-level access control [here](https://cloud.google.com/bigquery/docs/column-level-security)

If you have attached policy tags to the columns of table available in Bigquery, then OpenMetadata will fetch those tags and attach it to the respective columns.

In this field you need to specify the location/region in which the taxonomy was created.

**Usage Location (Optional)**:
Location used to query `INFORMATION_SCHEMA.JOBS_BY_PROJECT` to fetch usage data. You can pass multi-regions, such as `us` or `eu`, or your specific region such as `us-east1`. Australia and Asia multi-regions are not yet supported.
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#### Connection Options

{% partial file="/v1.5/connectors/database/gcp.md" /%}
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#### Connection Options

- **Username**: Specify the User to connect to Clickhouse. It should have enough privileges to read all the metadata.
- **Password**: Password to connect to Clickhouse.
- **Host and Port**: Enter the fully qualified hostname and port number for your Clickhouse deployment in the Host and Port field.
- **Use HTTPS Protocol**: Enable this flag when the when the Clickhouse instance is hosted via HTTPS protocol. This flag is useful when you are using `clickhouse+http` connection scheme.
- **Secure Connection**: Establish secure connection with ClickHouse. ClickHouse supports secure communication over SSL/TLS to protect data in transit, by checking this option, it establishes secure connection with ClickHouse. This flag is useful when you are using `clickhouse+native` connection scheme.
- **Key File**: The key file path is the location when ClickHouse looks for a file containing the private key needed for secure communication over SSL/TLS. By default, ClickHouse will look for the key file in the `/etc/clickhouse-server directory`, with the file name `server.key`. However, this can be customized in the ClickHouse configuration file (`config.xml`). This flag is useful when you are using `clickhouse+native` connection scheme and the secure connection flag is enabled.
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#### Connection Details

- **Username**: Username to connect to Couchbase.
- **Password**: Password to connect to Couchbase.
- **Hostport**: If couchbase is hosted on cloud then the hostport parameter specifies the connection string and if you are using couchbase server then the hostport parameter specifies hostname of the Couchbase. This should be specified as a string in the format `hostname` or `xyz.cloud.couchbase.com`. E.g., `localhost`.
- **bucketName**: Optional name to give to the bucket in OpenMetadata. If left blank, If left blank, we will ingest all the bucket names.
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#### Connection Details

- **Host and Port**: Enter the fully qualified hostname and port number for your Databricks deployment in the Host and Port field.
- **Token**: Generated Token to connect to Databricks.
- **HTTP Path**: Databricks compute resources URL.
- **connectionTimeout**: The maximum amount of time (in seconds) to wait for a successful connection to the data source. If the connection attempt takes longer than this timeout period, an error will be returned.
- **Catalog**: Catalog of the data source(Example: hive_metastore). This is optional parameter, if you would like to restrict the metadata reading to a single catalog. When left blank, OpenMetadata Ingestion attempts to scan all the catalog.
- **DatabaseSchema**: databaseSchema of the data source. This is optional parameter, if you would like to restrict the metadata reading to a single databaseSchema. When left blank, OpenMetadata Ingestion attempts to scan all the databaseSchema.
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#### Connection Details

- **Username**: Specify the User to connect to DB2. It should have enough privileges to read all the metadata.
- **Password**: Password to connect to DB2.
- **database**: Database of the data source.
- **Host and Port**: Enter the fully qualified hostname and port number for your DB2 deployment in the Host and Port field.
- **License File Name**: License file name in case the license is required for connection.
- **License**: Contents of your license file if applicable, make sure to replace new lines with `\n` before pasting it here.
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
#### Connection Details For MetastoreConfig

- **Metastore Host Port**: Enter the Host & Port of Hive Metastore Service to configure the Spark Session. Either
of `metastoreHostPort`, `metastoreDb` or `metastoreFilePath` is required.
- **Metastore File Path**: Enter the file path to local Metastore in case Spark cluster is running locally. Either
of `metastoreHostPort`, `metastoreDb` or `metastoreFilePath` is required.
- **Metastore DB**: The JDBC connection to the underlying Hive metastore DB. Either
of `metastoreHostPort`, `metastoreDb` or `metastoreFilePath` is required.
- **appName (Optional)**: Enter the app name of spark session.
- **Connection Arguments (Optional)**: Key-Value pairs that will be used to pass extra `config` elements to the Spark Session builder.

We are internally running with `pyspark` 3.X and `delta-lake` 2.0.0. This means that we need to consider Spark configuration options for 3.X.

**Metastore Host Port**

When connecting to an External Metastore passing the parameter `Metastore Host Port`, we will be preparing a Spark Session with the configuration

```
.config("hive.metastore.uris", "thrift://{connection.metastoreHostPort}")
```

Then, we will be using the `catalog` functions from the Spark Session to pick up the metadata exposed by the Hive Metastore.

**Metastore File Path**

If instead we use a local file path that contains the metastore information (e.g., for local testing with the default `metastore_db` directory), we will set

```
.config("spark.driver.extraJavaOptions", "-Dderby.system.home={connection.metastoreFilePath}")
```

To update the `Derby` information. More information about this in a great [SO thread](https://stackoverflow.com/questions/38377188/how-to-get-rid-of-derby-log-metastore-db-from-spark-shell).

- You can find all supported configurations [here](https://spark.apache.org/docs/latest/configuration.html)
- If you need further information regarding the Hive metastore, you can find it [here](https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html),
and in The Internals of Spark SQL [book](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-hive-metastore.html).

**Metastore Database**

You can also connect to the metastore by directly pointing to the Hive Metastore db, e.g., `jdbc:mysql://localhost:3306/demo_hive`.

Here, we will need to inform all the common database settings (url, username, password), and the driver class name for JDBC metastore.

You will need to provide the driver to the ingestion image, and pass the `classpath` which will be used in the Spark Configuration under `spark.driver.extraClassPath`.

#### Connection Details for StorageConfig - S3

{% partial file="/v1.5/connectors/database/aws.md" /%}
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#### Connection Details

- **Client ID**: Client ID for DOMO Database.
- **Secret Token**: Secret Token to Connect DOMO Database.
- **Access Token**: Access to Connect to DOMO Database.
- **Api Host**: API Host to Connect to DOMO Database instance.
- **Instance Domain**: URL to connect to your Domo instance UI. For example `https://<your>.domo.com`.
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#### Connection Details

- **Username**: Specify the User to connect to Doris. It should have enough privileges to read all the metadata.
- **Password**: Password to connect to Doris.
- **Host and Port**: Enter the fully qualified hostname and port number for your Doris deployment in the Host and Port field.
- **databaseName**: Optional name to give to the database in OpenMetadata. If left blank, we will use default as the database name.
- **databaseSchema**: databaseSchema of the data source. This is optional parameter, if you would like to restrict the metadata reading to a single databaseSchema. When left blank, OpenMetadata Ingestion attempts to scan all the databaseSchema.
- **caCertificate**: Provide the path to ssl ca file.
- **sslCertificate**: Provide the path to ssl client certificate file (ssl_cert).
- **sslKey**: Provide the path to ssl client certificate file (ssl_key).
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#### Connection Details

- **Username**: Specify the User to connect to Druid. It should have enough privileges to read all the metadata.
- **Password**: Password to connect to Druid.
- **Host and Port**: Enter the fully qualified hostname and port number for your Druid deployment in the Host and Port field.
- **Database Name**: Optional name to give to the database in OpenMetadata. If left blank, we will use default as the database name.
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#### Connection Details

{% partial file="/v1.5/connectors/database/aws.md" /%}
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#### Connection Details for GCS

- **Bucket Name**: A bucket name in DataLake is a unique identifier used to organize and store data objects.
It's similar to a folder name, but it's used for object storage rather than file storage.

- **Prefix**: The prefix of a data source in datalake refers to the first part of the data path that identifies the source or origin of the data. It's used to organize and categorize data within the datalake, and can help users easily locate and access the data they need.

**GCS Credentials**

We support two ways of authenticating to GCS:

1. Passing the raw credential values provided by BigQuery. This requires us to provide the following information, all provided by BigQuery:
1. Credentials type, e.g. `service_account`.
2. Project ID
3. Private Key ID
4. Private Key
5. Client Email
6. Client ID
7. Auth URI, [https://accounts.google.com/o/oauth2/auth](https://accounts.google.com/o/oauth2/auth) by default
8. Token URI, [https://oauth2.googleapis.com/token](https://oauth2.googleapis.com/token) by default
9. Authentication Provider X509 Certificate URL, [https://www.googleapis.com/oauth2/v1/certs](https://www.googleapis.com/oauth2/v1/certs) by default
10. Client X509 Certificate URL
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#### connection details

{% partial file="/v1.5/connectors/database/aws.md" /%}
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#### Connection Details

- **Username**: Specify the User to connect to Greenplum. It should have enough privileges to read all the metadata.
- **Auth Type**: Basic Auth or IAM based auth to connect to instances / cloud rds.
- **Basic Auth**:
- **Password**: Password to connect to Greenplum
- **IAM Based Auth**:
{% partial file="/v1.5/connectors/database/aws.md" /%}

- **Host and Port**: Enter the fully qualified hostname and port number for your Greenplum deployment in the Host and Port field.


**SSL Modes**

There are a couple of types of SSL modes that Greenplum supports which can be added to ConnectionArguments, they are as follows:
- **disable**: SSL is disabled and the connection is not encrypted.
- **allow**: SSL is used if the server requires it.
- **prefer**: SSL is used if the server supports it.
- **require**: SSL is required.
- **verify-ca**: SSL must be used and the server certificate must be verified.
- **verify-full**: SSL must be used. The server certificate must be verified, and the server hostname must match the hostname attribute on the certificate.

**SSL Configuration**

In order to integrate SSL in the Metadata Ingestion Config, the user will have to add the SSL config under sslConfig which is placed in the source.
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
#### Connection Details

- **Username**: Specify the User to connect to Hive. It should have enough privileges to read all the metadata.
- **Password**: Password to connect to Hive.
- **Host and Port**: This parameter specifies the host and port of the Hive server instance. This should be specified as a string in the format `hostname:port`. For example, you might set the hostPort parameter to `myhivehost:10000`.
- **Auth Options (Optional)**: The auth parameter specifies the authentication method to use when connecting to the Hive server. Possible values are `LDAP`, `NONE`, `CUSTOM`, or `KERBEROS`. If you are using Kerberos authentication, you should set auth to `KERBEROS`. If you are using custom authentication, you should set auth to `CUSTOM` and provide additional options in the `authOptions` parameter.
- **Kerberos Service Name**: This parameter specifies the Kerberos service name to use for authentication. This should only be specified if using Kerberos authentication. The default value is `hive`.
- **Database Schema**: Schema of the data source. This is optional parameter, if you would like to restrict the metadata reading to a single schema. When left blank, OpenMetadata Ingestion attempts to scan all the schemas.
- **Database Name**: Optional name to give to the database in OpenMetadata. If left blank, we will use default as the database name.


#### For MySQL Metastore Connection

You can also ingest the metadata using Mysql metastore. This step is optional if metastore details are not provided then we will query the hive server directly.

- **Username**: Specify the User to connect to MySQL Metastore. It should have enough privileges to read all the metadata.
- **Password**: Password to connect to MySQL.
- **Host and Port**: Enter the fully qualified hostname and port number for your MySQL Metastore deployment in the Host and Port field in the format `hostname:port`.
- **databaseSchema**: Enter the database schema which is associated with the metastore.

{% partial file="/v1.5/connectors/database/advanced-configuration.md" /%}

#### For Postgres Metastore Connection

You can also ingest the metadata using Postgres metastore. This step is optional if metastore details are not provided then we will query the hive server directly.

- **Username**: Specify the User to connect to Postgres Metastore. It should have enough privileges to read all the metadata.
- **Password**: Password to connect to Postgres.
- **Host and Port**: Enter the fully qualified hostname and port number for your Postgres deployment in the Host and Port field in the format `hostname:port`.
- **Database**: Initial Postgres database to connect to. Specify the name of database associated with metastore instance.
Loading
Loading