Skip to content

Commit

Permalink
Optimize the docs
Browse files Browse the repository at this point in the history
  • Loading branch information
yuqi1129 committed Jan 7, 2025
1 parent ab07455 commit 6c1aac3
Show file tree
Hide file tree
Showing 4 changed files with 94 additions and 47 deletions.
33 changes: 25 additions & 8 deletions docs/hadoop-catalog-with-adls.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,17 @@ This document describes how to configure a Hadoop catalog with ADLS (Azure Blob

## Prerequisites

In order to create a Hadoop catalog with ADLS, you need to place [`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle) in Gravitino Hadoop catalog classpath located
at `${GRAVITINO_HOME}/catalogs/hadoop/libs//`. After that, start Gravitino server with the following command:
To set up a Hadoop catalog with ADLS, follow these steps:

1. Download the [`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle) file.
2. Place the downloaded file into the Gravitino Hadoop catalog classpath at `${GRAVITINO_HOME}/catalogs/hadoop/libs/`.
3. Start the Gravitino server by running the following command:

```bash
$ bin/gravitino-server.sh start
```
Once the server is up and running, you can proceed to configure the Hadoop catalog with ADLS.


```bash
$ bin/gravitino-server.sh start
Expand All @@ -21,7 +30,7 @@ $ bin/gravitino-server.sh start

The rest of this document shows how to use the Hadoop catalog with ADLS in Gravitino with a full example.

### Create a ADLS Hadoop catalog
### Configuration for a ADLS Hadoop catalog

Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with ADLS:

Expand All @@ -32,18 +41,20 @@ Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./
| `azure-storage-account-name ` | The account name of Azure Blob Storage. | (none) | Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating |
| `azure-storage-account-key` | The account key of Azure Blob Storage. | (none) | Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating |

### Create a schema
### Configuration for a schema

Refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations) for more details.

### Create a fileset
### Configuration for a fileset

Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for more details.


## Using Hadoop catalog with ADLS

### Create a Hadoop catalog/schema/fileset with ADLS
This section demonstrates how to use the Hadoop catalog with ADLS in Gravitino, with a complete example.

### Step1: Create a Hadoop catalog with ADLS

First, you need to create a Hadoop catalog with ADLS. The following example shows how to create a Hadoop catalog with ADLS:

Expand Down Expand Up @@ -113,9 +124,9 @@ adls_properties = gravitino_client.create_catalog(name="example_catalog",
</TabItem>
</Tabs>

Then create a schema and fileset in the catalog created above.
### Step2: Create a schema

Using the following code to create a schema and a fileset:
Once the catalog is created, you can create a schema. The following example shows how to create a schema:

<Tabs groupId="language" queryString>
<TabItem value="shell" label="Shell">
Expand Down Expand Up @@ -163,6 +174,10 @@ catalog.as_schemas().create_schema(name="test_schema",
</TabItem>
</Tabs>

### Step3: Create a fileset

After creating the schema, you can create a fileset. The following example shows how to create a fileset:

<Tabs groupId="language" queryString>
<TabItem value="shell" label="Shell">

Expand Down Expand Up @@ -221,6 +236,8 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema
</TabItem>
</Tabs>

## Accessing a fileset with ADLS

### Using Spark to access the fileset

The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset:
Expand Down
22 changes: 15 additions & 7 deletions docs/hadoop-catalog-with-gcs.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,7 @@ $ bin/gravitino-server.sh start

The rest of this document shows how to use the Hadoop catalog with GCS in Gravitino with a full example.


### Create a GCS Hadoop catalog
### Configuration for a GCS Hadoop catalog

Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with GCS:

Expand All @@ -32,17 +31,19 @@ Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./
| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for GCS, if we set this value, we can omit the prefix 'gs://' in the location. | `builtin-local` | No | 0.7.0-incubating |
| `gcs-service-account-file` | The path of GCS service account JSON file. | (none) | Yes if it's a GCS fileset. | 0.7.0-incubating |

### Create a schema
### Configuration for a schema

Refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations) for more details.

### Create a fileset
### Configuration for a fileset

Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for more details.

## Using Hadoop catalog with GCS

### Create a Hadoop catalog/schema/fileset with GCS
This section will show you how to use the Hadoop catalog with GCS in Gravitino, including detailed examples.

### Create a Hadoop catalog with GCS

First, you need to create a Hadoop catalog with GCS. The following example shows how to create a Hadoop catalog with GCS:

Expand Down Expand Up @@ -109,9 +110,9 @@ gcs_properties = gravitino_client.create_catalog(name="test_catalog",
</TabItem>
</Tabs>

Then create a schema and a fileset in the catalog created above.
### Step2: Create a schema

Using the following code to create a schema and a fileset:
Once you have created a Hadoop catalog with GCS, you can create a schema. The following example shows how to create a schema:

<Tabs groupId="language" queryString>
<TabItem value="shell" label="Shell">
Expand Down Expand Up @@ -159,6 +160,11 @@ catalog.as_schemas().create_schema(name="test_schema",
</TabItem>
</Tabs>


### Step3: Create a fileset

After creating a schema, you can create a fileset. The following example shows how to create a fileset:

<Tabs groupId="language" queryString>
<TabItem value="shell" label="Shell">

Expand Down Expand Up @@ -217,6 +223,8 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema
</TabItem>
</Tabs>

## Accessing a fileset with GCS

### Using Spark to access the fileset

The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset:
Expand Down
43 changes: 27 additions & 16 deletions docs/hadoop-catalog-with-oss.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,22 +6,26 @@ keyword: Hadoop catalog OSS
license: "This software is licensed under the Apache License version 2."
---

This document describes how to configure a Hadoop catalog with Aliyun OSS.
This document explains how to configure a Hadoop catalog with Aliyun OSS (Object Storage Service) in Gravitino.

## Prerequisites

In order to create a Hadoop catalog with OSS, you need to place [`gravitino-aliyun-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun-bundle) in Gravitino Hadoop catalog classpath located
at `${GRAVITINO_HOME}/catalogs/hadoop/libs/`. After that, start Gravitino server with the following command:
To set up a Hadoop catalog with OSS, follow these steps:

1. Download the [`gravitino-aliyun-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun-bundle) file.
2. Place the downloaded file into the Gravitino Hadoop catalog classpath at `${GRAVITINO_HOME}/catalogs/hadoop/libs/`.
3. Start the Gravitino server by running the following command:

```bash
$ bin/gravitino-server.sh start
```
Once the server is up and running, you can proceed to configure the Hadoop catalog with OSS.

## Create a Hadoop Catalog with OSS

### Create an OSS Hadoop catalog
### Configuration for an OSS Hadoop catalog

Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with OSS:
In addition to the basic configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with OSS:

| Configuration item | Description | Default value | Required | Since version |
|-------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------------------------|------------------|
Expand All @@ -31,22 +35,21 @@ Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./
| `oss-access-key-id` | The access key of the Aliyun OSS. | (none) | Yes if it's a OSS fileset. | 0.7.0-incubating |
| `oss-secret-access-key` | The secret key of the Aliyun OSS. | (none) | Yes if it's a OSS fileset. | 0.7.0-incubating |

### Create a schema

Refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations) for more details.
### Configuration for a schema

### Create a fileset
To create a schema, refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations).

Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for more details.
### Configuration for a fileset

For instructions on how to create a fileset, refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for more details.

## Using Hadoop catalog with OSS

The rest of this document shows how to use the Hadoop catalog with OSS in Gravitino with a full example.
This section will show you how to use the Hadoop catalog with OSS in Gravitino, including detailed examples.

### Create a Hadoop catalog/schema/fileset with OSS
### Create a Hadoop catalog with OSS

First, you need to create a Hadoop catalog with OSS. The following example shows how to create a Hadoop catalog with OSS:
First, you need to create a Hadoop catalog for OSS. The following examples demonstrate how to create a Hadoop catalog with OSS:

<Tabs groupId="language" queryString>
<TabItem value="shell" label="Shell">
Expand Down Expand Up @@ -117,9 +120,9 @@ oss_catalog = gravitino_client.create_catalog(name="test_catalog",
</TabItem>
</Tabs>

Then create a schema and a fileset in the catalog created above.
Step 2: Create a Schema

Using the following code to create a schema and a fileset:
Once the Hadoop catalog with OSS is created, you can create a schema inside that catalog. Below are examples of how to do this:

<Tabs groupId="language" queryString>
<TabItem value="shell" label="Shell">
Expand Down Expand Up @@ -167,6 +170,12 @@ catalog.as_schemas().create_schema(name="test_schema",
</TabItem>
</Tabs>


### Create a fileset

Now that the schema is created, you can create a fileset inside it. Here’s how:


<Tabs groupId="language" queryString>
<TabItem value="shell" label="Shell">

Expand Down Expand Up @@ -225,6 +234,8 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema
</TabItem>
</Tabs>

## Accessing a fileset with OSS

### Using Spark to access the fileset

The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset:
Expand Down Expand Up @@ -432,7 +443,7 @@ Spark:

```python
spark = SparkSession.builder
.appName("oss_fielset_test")
.appName("oss_fileset_test")
.config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs")
.config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
.config("spark.hadoop.fs.gravitino.server.uri", "${GRAVITINO_SERVER_IP:PORT}")
Expand Down
43 changes: 27 additions & 16 deletions docs/hadoop-catalog-with-s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,22 +6,28 @@ keyword: Hadoop catalog S3
license: "This software is licensed under the Apache License version 2."
---

This document describes how to configure a Hadoop catalog with S3.
This document explains how to configure a Hadoop catalog with S3 in Gravitino.

## Prerequisites

In order to create a Hadoop catalog with S3, you need to place [`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle) in Gravitino Hadoop catalog classpath located
at `${GRAVITINO_HOME}/catalogs/hadoop/libs/`. After that, start Gravitino server with the following command:
To create a Hadoop catalog with S3, follow these steps:

1. Download the [`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle) file.
2. Place this file in the Gravitino Hadoop catalog classpath at `${GRAVITINO_HOME}/catalogs/hadoop/libs/`.
3. Start the Gravitino server using the following command:

```bash
$ bin/gravitino-server.sh start
```

Once the server is running, you can proceed to create the Hadoop catalog with S3.


## Create a Hadoop Catalog with S3

### Create a S3 Hadoop catalog
### Configuration for S3 Hadoop Catalog

Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with S3:
In addition to the basic configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are necessary to configure a Hadoop catalog with S3:

| Configuration item | Description | Default value | Required | Since version |
|-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|---------------------------|------------------|
Expand All @@ -31,20 +37,20 @@ Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./
| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes if it's a S3 fileset. | 0.7.0-incubating |
| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes if it's a S3 fileset. | 0.7.0-incubating |

### Create a schema
### Configuration for a schema

Refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations) for more details.
To learn how to create a schema, refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations).

### Create a fileset
### Configuration for a fileset

Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for more details.
For more details on creating a fileset, Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations).


## Using Hadoop catalog with S3
## Using the Hadoop catalog with S3

The rest of this document shows how to use the Hadoop catalog with S3 in Gravitino with a full example.
This section demonstrates how to use the Hadoop catalog with S3 in Gravitino, with a complete example.

### Create a Hadoop catalog/schema/fileset with S3
### Step1: Create a Hadoop Catalog with S3

First of all, you need to create a Hadoop catalog with S3. The following example shows how to create a Hadoop catalog with S3:

Expand Down Expand Up @@ -118,12 +124,12 @@ s3_catalog = gravitino_client.create_catalog(name="test_catalog",
</Tabs>

:::note
The value of location should always start with `s3a` NOT `s3` for AWS S3, for instance, `s3a://bucket/root`. Value like `s3://bucket/root` is not supported due to the limitation of the hadoop-aws library.
When using S3 with Hadoop, ensure that the location value starts with s3a:// (not s3://) for AWS S3. For example, use s3a://bucket/root, as the s3:// format is not supported by the hadoop-aws library.
:::

Then create a schema and a fileset in the catalog created above.
### Step2: Create a schema

Using the following code to create a schema and a fileset:
Once your Hadoop catalog with S3 is created, you can create a schema under the catalog. Here are examples of how to do that:

<Tabs groupId="language" queryString>
<TabItem value="shell" label="Shell">
Expand Down Expand Up @@ -172,6 +178,10 @@ catalog.as_schemas().create_schema(name="test_schema",
</TabItem>
</Tabs>

### Step3: Create a fileset

After creating the schema, you can create a fileset. Here are examples for creating a fileset:

<Tabs groupId="language" queryString>
<TabItem value="shell" label="Shell">

Expand Down Expand Up @@ -230,10 +240,11 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", "e
</TabItem>
</Tabs>

## Accessing a fileset with S3

### Using Spark to access the fileset

The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset:
The following Python code demonstrates how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset:

```python
import logging
Expand Down

0 comments on commit 6c1aac3

Please sign in to comment.