diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index f0cab6da677..96a3c47b2a6 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -10,8 +10,17 @@ This document describes how to configure a Hadoop catalog with ADLS (Azure Blob ## Prerequisites -In order to create a Hadoop catalog with ADLS, you need to place [`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle) in Gravitino Hadoop catalog classpath located -at `${GRAVITINO_HOME}/catalogs/hadoop/libs//`. After that, start Gravitino server with the following command: +To set up a Hadoop catalog with ADLS, follow these steps: + +1. Download the [`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle) file. +2. Place the downloaded file into the Gravitino Hadoop catalog classpath at `${GRAVITINO_HOME}/catalogs/hadoop/libs/`. +3. Start the Gravitino server by running the following command: + +```bash +$ bin/gravitino-server.sh start +``` +Once the server is up and running, you can proceed to configure the Hadoop catalog with ADLS. + ```bash $ bin/gravitino-server.sh start @@ -21,7 +30,7 @@ $ bin/gravitino-server.sh start The rest of this document shows how to use the Hadoop catalog with ADLS in Gravitino with a full example. -### Create a ADLS Hadoop catalog +### Configuration for a ADLS Hadoop catalog Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with ADLS: @@ -32,18 +41,20 @@ Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./ | `azure-storage-account-name ` | The account name of Azure Blob Storage. | (none) | Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating | | `azure-storage-account-key` | The account key of Azure Blob Storage. | (none) | Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating | -### Create a schema +### Configuration for a schema Refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations) for more details. -### Create a fileset +### Configuration for a fileset Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for more details. ## Using Hadoop catalog with ADLS -### Create a Hadoop catalog/schema/fileset with ADLS +This section demonstrates how to use the Hadoop catalog with ADLS in Gravitino, with a complete example. + +### Step1: Create a Hadoop catalog with ADLS First, you need to create a Hadoop catalog with ADLS. The following example shows how to create a Hadoop catalog with ADLS: @@ -113,9 +124,9 @@ adls_properties = gravitino_client.create_catalog(name="example_catalog", -Then create a schema and fileset in the catalog created above. +### Step2: Create a schema -Using the following code to create a schema and a fileset: +Once the catalog is created, you can create a schema. The following example shows how to create a schema: @@ -163,6 +174,10 @@ catalog.as_schemas().create_schema(name="test_schema", +### Step3: Create a fileset + +After creating the schema, you can create a fileset. The following example shows how to create a fileset: + @@ -221,6 +236,8 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema +## Accessing a fileset with ADLS + ### Using Spark to access the fileset The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index 1466cb21754..ce23fdd1f3b 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -21,8 +21,7 @@ $ bin/gravitino-server.sh start The rest of this document shows how to use the Hadoop catalog with GCS in Gravitino with a full example. - -### Create a GCS Hadoop catalog +### Configuration for a GCS Hadoop catalog Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with GCS: @@ -32,17 +31,19 @@ Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./ | `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for GCS, if we set this value, we can omit the prefix 'gs://' in the location. | `builtin-local` | No | 0.7.0-incubating | | `gcs-service-account-file` | The path of GCS service account JSON file. | (none) | Yes if it's a GCS fileset. | 0.7.0-incubating | -### Create a schema +### Configuration for a schema Refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations) for more details. -### Create a fileset +### Configuration for a fileset Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for more details. ## Using Hadoop catalog with GCS -### Create a Hadoop catalog/schema/fileset with GCS +This section will show you how to use the Hadoop catalog with GCS in Gravitino, including detailed examples. + +### Create a Hadoop catalog with GCS First, you need to create a Hadoop catalog with GCS. The following example shows how to create a Hadoop catalog with GCS: @@ -109,9 +110,9 @@ gcs_properties = gravitino_client.create_catalog(name="test_catalog", -Then create a schema and a fileset in the catalog created above. +### Step2: Create a schema -Using the following code to create a schema and a fileset: +Once you have created a Hadoop catalog with GCS, you can create a schema. The following example shows how to create a schema: @@ -159,6 +160,11 @@ catalog.as_schemas().create_schema(name="test_schema", + +### Step3: Create a fileset + +After creating a schema, you can create a fileset. The following example shows how to create a fileset: + @@ -217,6 +223,8 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema +## Accessing a fileset with GCS + ### Using Spark to access the fileset The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index 9afdb2e9c79..15e481087c9 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -6,22 +6,26 @@ keyword: Hadoop catalog OSS license: "This software is licensed under the Apache License version 2." --- -This document describes how to configure a Hadoop catalog with Aliyun OSS. +This document explains how to configure a Hadoop catalog with Aliyun OSS (Object Storage Service) in Gravitino. ## Prerequisites -In order to create a Hadoop catalog with OSS, you need to place [`gravitino-aliyun-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun-bundle) in Gravitino Hadoop catalog classpath located -at `${GRAVITINO_HOME}/catalogs/hadoop/libs/`. After that, start Gravitino server with the following command: +To set up a Hadoop catalog with OSS, follow these steps: + +1. Download the [`gravitino-aliyun-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun-bundle) file. +2. Place the downloaded file into the Gravitino Hadoop catalog classpath at `${GRAVITINO_HOME}/catalogs/hadoop/libs/`. +3. Start the Gravitino server by running the following command: ```bash $ bin/gravitino-server.sh start ``` +Once the server is up and running, you can proceed to configure the Hadoop catalog with OSS. ## Create a Hadoop Catalog with OSS -### Create an OSS Hadoop catalog +### Configuration for an OSS Hadoop catalog -Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with OSS: +In addition to the basic configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with OSS: | Configuration item | Description | Default value | Required | Since version | |-------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------------------------|------------------| @@ -31,22 +35,21 @@ Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./ | `oss-access-key-id` | The access key of the Aliyun OSS. | (none) | Yes if it's a OSS fileset. | 0.7.0-incubating | | `oss-secret-access-key` | The secret key of the Aliyun OSS. | (none) | Yes if it's a OSS fileset. | 0.7.0-incubating | -### Create a schema - -Refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations) for more details. +### Configuration for a schema -### Create a fileset +To create a schema, refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations). -Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for more details. +### Configuration for a fileset +For instructions on how to create a fileset, refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for more details. ## Using Hadoop catalog with OSS -The rest of this document shows how to use the Hadoop catalog with OSS in Gravitino with a full example. +This section will show you how to use the Hadoop catalog with OSS in Gravitino, including detailed examples. -### Create a Hadoop catalog/schema/fileset with OSS +### Create a Hadoop catalog with OSS -First, you need to create a Hadoop catalog with OSS. The following example shows how to create a Hadoop catalog with OSS: +First, you need to create a Hadoop catalog for OSS. The following examples demonstrate how to create a Hadoop catalog with OSS: @@ -117,9 +120,9 @@ oss_catalog = gravitino_client.create_catalog(name="test_catalog", -Then create a schema and a fileset in the catalog created above. +Step 2: Create a Schema -Using the following code to create a schema and a fileset: +Once the Hadoop catalog with OSS is created, you can create a schema inside that catalog. Below are examples of how to do this: @@ -167,6 +170,12 @@ catalog.as_schemas().create_schema(name="test_schema", + +### Create a fileset + +Now that the schema is created, you can create a fileset inside it. Here’s how: + + @@ -225,6 +234,8 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema +## Accessing a fileset with OSS + ### Using Spark to access the fileset The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: @@ -432,7 +443,7 @@ Spark: ```python spark = SparkSession.builder - .appName("oss_fielset_test") + .appName("oss_fileset_test") .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") .config("spark.hadoop.fs.gravitino.server.uri", "${GRAVITINO_SERVER_IP:PORT}") diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index 3928f67dc55..c475dd9f29f 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -6,22 +6,28 @@ keyword: Hadoop catalog S3 license: "This software is licensed under the Apache License version 2." --- -This document describes how to configure a Hadoop catalog with S3. +This document explains how to configure a Hadoop catalog with S3 in Gravitino. ## Prerequisites -In order to create a Hadoop catalog with S3, you need to place [`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle) in Gravitino Hadoop catalog classpath located -at `${GRAVITINO_HOME}/catalogs/hadoop/libs/`. After that, start Gravitino server with the following command: +To create a Hadoop catalog with S3, follow these steps: + +1. Download the [`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle) file. +2. Place this file in the Gravitino Hadoop catalog classpath at `${GRAVITINO_HOME}/catalogs/hadoop/libs/`. +3. Start the Gravitino server using the following command: ```bash $ bin/gravitino-server.sh start ``` +Once the server is running, you can proceed to create the Hadoop catalog with S3. + + ## Create a Hadoop Catalog with S3 -### Create a S3 Hadoop catalog +### Configuration for S3 Hadoop Catalog -Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with S3: +In addition to the basic configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are necessary to configure a Hadoop catalog with S3: | Configuration item | Description | Default value | Required | Since version | |-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|---------------------------|------------------| @@ -31,20 +37,20 @@ Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./ | `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes if it's a S3 fileset. | 0.7.0-incubating | | `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes if it's a S3 fileset. | 0.7.0-incubating | -### Create a schema +### Configuration for a schema -Refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations) for more details. +To learn how to create a schema, refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations). -### Create a fileset +### Configuration for a fileset -Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for more details. +For more details on creating a fileset, Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations). -## Using Hadoop catalog with S3 +## Using the Hadoop catalog with S3 -The rest of this document shows how to use the Hadoop catalog with S3 in Gravitino with a full example. +This section demonstrates how to use the Hadoop catalog with S3 in Gravitino, with a complete example. -### Create a Hadoop catalog/schema/fileset with S3 +### Step1: Create a Hadoop Catalog with S3 First of all, you need to create a Hadoop catalog with S3. The following example shows how to create a Hadoop catalog with S3: @@ -118,12 +124,12 @@ s3_catalog = gravitino_client.create_catalog(name="test_catalog", :::note -The value of location should always start with `s3a` NOT `s3` for AWS S3, for instance, `s3a://bucket/root`. Value like `s3://bucket/root` is not supported due to the limitation of the hadoop-aws library. +When using S3 with Hadoop, ensure that the location value starts with s3a:// (not s3://) for AWS S3. For example, use s3a://bucket/root, as the s3:// format is not supported by the hadoop-aws library. ::: -Then create a schema and a fileset in the catalog created above. +### Step2: Create a schema -Using the following code to create a schema and a fileset: +Once your Hadoop catalog with S3 is created, you can create a schema under the catalog. Here are examples of how to do that: @@ -172,6 +178,10 @@ catalog.as_schemas().create_schema(name="test_schema", +### Step3: Create a fileset + +After creating the schema, you can create a fileset. Here are examples for creating a fileset: + @@ -230,10 +240,11 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", "e +## Accessing a fileset with S3 ### Using Spark to access the fileset -The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: +The following Python code demonstrates how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: ```python import logging