From c4fb29a2f3589a998b6a8d8436202618340a1f72 Mon Sep 17 00:00:00 2001 From: yuqi Date: Thu, 2 Jan 2025 12:54:12 +0800 Subject: [PATCH 01/39] fix --- docs/cloud-storage-fileset-example.md | 678 ++++++++++++++++++++++++++ docs/hadoop-catalog.md | 14 +- docs/how-to-use-gvfs.md | 74 +-- 3 files changed, 721 insertions(+), 45 deletions(-) create mode 100644 docs/cloud-storage-fileset-example.md diff --git a/docs/cloud-storage-fileset-example.md b/docs/cloud-storage-fileset-example.md new file mode 100644 index 00000000000..17d6d24ff8c --- /dev/null +++ b/docs/cloud-storage-fileset-example.md @@ -0,0 +1,678 @@ +--- +title: "How to use cloud storage fileset" +slug: /how-to-use-cloud-storage-fileset +keyword: fileset S3 GCS ADLS OSS +license: "This software is licensed under the Apache License version 2." +--- + +This document aims to provide a comprehensive guide on how to use cloud storage fileset created by Gravitino, it usually contains the following sections: + +## Necessary steps in Gravitino server + +### Start up Gravitino server + +Before running the Gravitino server, you need to put the following jars into the fileset catalog classpath located at `${GRAVITINO_HOME}/catalogs/hadoop/libs`. For example, if you are using S3, you need to put `gravitino-aws-hadoop-bundle-{gravitino-version}.jar` into the fileset catalog classpath in `${GRAVITINO_HOME}/catalogs/hadoop/libs`. + +| Storage type | Description | Jar file | Since Version | +|--------------|---------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|------------------| +| Local | The local file system. | (none) | 0.5.0 | +| HDFS | HDFS file system. | (none) | 0.5.0 | +| S3 | AWS S3. | [gravitino-aws-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aws-bundle) | 0.8.0-incubating | +| GCS | Google Cloud Storage. | [gravitino-gcp-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-gcp-bundle) | 0.8.0-incubating | +| OSS | Aliyun OSS. | [gravitino-aliyun-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aliyun-bundle) | 0.8.0-incubating | +| ABS | Azure Blob Storage (aka. ABS, or Azure Data Lake Storage (v2) | [gravitino-azure-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-azure-bundle) | 0.8.0-incubating | + +After adding the jars into the fileset catalog classpath, you can start up the Gravitino server by running the following command: + +```shell +cd ${GRAVITINO_HOME} +bin/gravitino.sh start +``` + +### Bundle jars + +Gravitino bundles jars are jars that are used to access the cloud storage, they are divided into two categories: + +- `gravitino-${aws,gcp,aliyun,azure}-bundle-{gravitino-version}.jar` are the jars that contain all the necessary dependencies to access the corresponding cloud storages. For instance, `gravitino-aws-bundle-${gravitino-version}.jar` contains the all necessary classes including `hadoop-common`(hadoop-3.3.1) and `hadoop-aws` to access the S3 storage. +They are used in the scenario where there is no hadoop environment in the runtime. + +- If there is already hadoop environment in the runtime, you can use the `gravitino-${aws,gcp,aliyun,azure}-${gravitino-version}.jar` that does not contain the cloud storage classes (like hadoop-aws) and hadoop environment. Alternatively, you can manually add the necessary jars to the classpath. + +The following table demonstrates which jars are necessary for different cloud storage filesets: + +| Hadoop runtime version | S3 | GCS | OSS | ABS | +|------------------------|------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------| +| No Hadoop environment | `gravitino-aws-bundle-${gravitino-version}.jar` | `gravitino-gcp-bundle-${gravitino-version}.jar` | `gravitino-aliyun-bundle-${gravitino-version}.jar` | `gravitino-azure-bundle-${gravitino-version}.jar` | +| 2.x, 3.x | `gravitino-aws-${gravitino-version}.jar`, `hadoop-aws-${hadoop-version}.jar`, `aws-sdk-java-${version}` and other necessary dependencies | `gravitino-gcp-{gravitino-version}.jar`, `gcs-connector-${hadoop-version}`.jar, other necessary dependencies | `gravitino-aliyun-{gravitino-version}.jar`, hadoop-aliyun-{hadoop-version}.jar, aliyun-sdk-java-{version} and other necessary dependencies | `gravitino-azure-${gravitino-version}.jar`, `hadoop-azure-${hadoop-version}.jar`, and other necessary dependencies | + +For `hadoop-aws-${hadoop-version}.jar`, `hadoop-azure-${hadoop-version}.jar` and `hadoop-aliyun-${hadoop-version}.jar` and related dependencies, you can get them from `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. +For `gcs-connector`, you can download it from the [GCS connector](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases) for hadoop2 or hadoop3. + +If there still have some issues, please report it to the Gravitino community and create an issue. + +## Create fileset catalogs + +Once the Gravitino server is started, you can create the corresponding fileset by the following sentence: + + +### Create a S3 fileset catalog + + + + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "catalog", + "type": "FILESET", + "comment": "comment", + "provider": "hadoop", + "properties": { + "location": "s3a://bucket/root", + "s3-access-key-id": "access_key", + "s3-secret-access-key": "secret_key", + "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com", + "filesystem-providers": "s3" + } +}' http://localhost:8090/api/metalakes/metalake/catalogs +``` + + + + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("http://localhost:8090") + .withMetalake("metalake") + .build(); + +s3Properties = ImmutableMap.builder() + .put("location", "s3a://bucket/root") + .put("s3-access-key-id", "access_key") + .put("s3-secret-access-key", "secret_key") + .put("s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com") + .put("filesystem-providers", "s3") + .build(); + +Catalog s3Catalog = gravitinoClient.createCatalog("catalog", + Type.FILESET, + "hadoop", // provider, Gravitino only supports "hadoop" for now. + "This is a S3 fileset catalog", + s3Properties); +// ... + +``` + + + + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") +s3_properties = { + "location": "s3a://bucket/root", + "s3-access-key-id": "access_key" + "s3-secret-access-key": "secret_key", + "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com" +} + +s3_catalog = gravitino_client.create_catalog(name="catalog", + type=Catalog.Type.FILESET, + provider="hadoop", + comment="This is a S3 fileset catalog", + properties=s3_properties) + +``` + + + + +:::note +The value of location should always start with `s3a` NOT `s3` for AWS S3, for instance, `s3a://bucket/root`. Value like `s3://bucket/root` is not supported due to the limitation of the hadoop-aws library. +::: + +### Create a GCS fileset catalog + + + + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "catalog", + "type": "FILESET", + "comment": "comment", + "provider": "hadoop", + "properties": { + "location": "gs://bucket/root", + "gcs-service-account-file": "path_of_gcs_service_account_file", + "filesystem-providers": "gcs" + } +}' http://localhost:8090/api/metalakes/metalake/catalogs +``` + + + + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("http://localhost:8090") + .withMetalake("metalake") + .build(); + +gcsProperties = ImmutableMap.builder() + .put("location", "gs://bucket/root") + .put("gcs-service-account-file", "path_of_gcs_service_account_file") + .put("filesystem-providers", "gcs") + .build(); + +Catalog gcsCatalog = gravitinoClient.createCatalog("catalog", + Type.FILESET, + "hadoop", // provider, Gravitino only supports "hadoop" for now. + "This is a GCS fileset catalog", + gcsProperties); +// ... + +``` + + + + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") + +gcs_properties = { + "location": "gcs://bucket/root", + "gcs_service_account_file": "path_of_gcs_service_account_file" +} + +s3_catalog = gravitino_client.create_catalog(name="catalog", + type=Catalog.Type.FILESET, + provider="hadoop", + comment="This is a GCS fileset catalog", + properties=gcs_properties) + +``` + + + + +:::note +The prefix of a GCS location should always start with `gs` for instance, `gs://bucket/root`. +::: + +### Create an OSS fileset catalog + + + + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "catalog", + "type": "FILESET", + "comment": "comment", + "provider": "hadoop", + "properties": { + "location": "oss://bucket/root", + "oss-access-key-id": "access_key", + "oss-secret-access-key": "secret_key", + "oss-endpoint": "http://oss-cn-hangzhou.aliyuncs.com", + "filesystem-providers": "oss" + } +}' http://localhost:8090/api/metalakes/metalake/catalogs +``` + + + + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("http://localhost:8090") + .withMetalake("metalake") + .build(); + +ossProperties = ImmutableMap.builder() + .put("location", "oss://bucket/root") + .put("oss-access-key-id", "access_key") + .put("oss-secret-access-key", "secret_key") + .put("oss-endpoint", "http://oss-cn-hangzhou.aliyuncs.com") + .put("filesystem-providers", "oss") + .build(); + +Catalog ossProperties = gravitinoClient.createCatalog("catalog", + Type.FILESET, + "hadoop", // provider, Gravitino only supports "hadoop" for now. + "This is a OSS fileset catalog", + ossProperties); +// ... + +``` + + + + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") +oss_properties = { + "location": "oss://bucket/root", + "oss-access-key-id": "access_key" + "oss-secret-access-key": "secret_key", + "oss-endpoint": "http://oss-cn-hangzhou.aliyuncs.com" +} + +oss_catalog = gravitino_client.create_catalog(name="catalog", + type=Catalog.Type.FILESET, + provider="hadoop", + comment="This is a OSS fileset catalog", + properties=oss_properties) + +``` + +### Create an ABS (Azure Blob Storage or ADLS) fileset catalog + + + + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "catalog", + "type": "FILESET", + "comment": "comment", + "provider": "hadoop", + "properties": { + "location": "abfss://container/root", + "abs-account-name": "The account name of the Azure Blob Storage", + "abs-account-key": "The account key of the Azure Blob Storage", + "filesystem-providers": "abs" + } +}' http://localhost:8090/api/metalakes/metalake/catalogs +``` + + + + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("http://localhost:8090") + .withMetalake("metalake") + .build(); + +absProperties = ImmutableMap.builder() + .put("location", "abfss://container/root") + .put("abs-account-name", "The account name of the Azure Blob Storage") + .put("abs-account-key", "The account key of the Azure Blob Storage") + .put("filesystem-providers", "abs") + .build(); + +Catalog gcsCatalog = gravitinoClient.createCatalog("catalog", + Type.FILESET, + "hadoop", // provider, Gravitino only supports "hadoop" for now. + "This is a Azure Blob storage fileset catalog", + absProperties); +// ... + +``` + + + + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") + +abs_properties = { + "location": "gcs://bucket/root", + "abs_account_name": "The account name of the Azure Blob Storage", + "abs_account_key": "The account key of the Azure Blob Storage" +} + +abs_catalog = gravitino_client.create_catalog(name="catalog", + type=Catalog.Type.FILESET, + provider="hadoop", + comment="This is a Azure Blob Storage fileset catalog", + properties=abs_properties) + +``` + + + + +note::: +The prefix of an ABS (Azure Blob Storage or ADLS (v2)) location should always start with `abfss` NOT `abfs`, for instance, `abfss://container/root`. Value like `abfs://container/root` is not supported. +::: + + +## Create fileset schema + +This part is the same for all cloud storage filesets, you can create the schema by the following sentence: + + + + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "schema", + "comment": "comment", + "properties": { + "location": "file:///tmp/root/schema" + } +}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas +``` + + + + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("http://localhost:8090") + .withMetalake("metalake") + .build(); + +// Assuming you have just created a Hadoop catalog named `catalog` +Catalog catalog = gravitinoClient.loadCatalog("catalog"); + +SupportsSchemas supportsSchemas = catalog.asSchemas(); + +Map schemaProperties = ImmutableMap.builder() + // Property "location" is optional, if specified all the managed fileset without + // specifying storage location will be stored under this location. + .put("location", "file:///tmp/root/schema") + .build(); +Schema schema = supportsSchemas.createSchema("schema", + "This is a schema", + schemaProperties +); +// ... +``` + + + + +You can change the value of property `location` according to which catalog you are using, moreover, if we have set the `location` property in the catalog, we can omit the `location` property in the schema. + +## Create filesets + +The following sentences can be used to create a fileset in the schema: + + + + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "example_fileset", + "comment": "This is an example fileset", + "type": "MANAGED", + "storageLocation": "s3a://bucket/root/schema/example_fileset", + "properties": { + "k1": "v1" + } +}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/filesets +``` + + + + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("http://localhost:8090") + .withMetalake("metalake") + .build(); + +Catalog catalog = gravitinoClient.loadCatalog("catalog"); +FilesetCatalog filesetCatalog = catalog.asFilesetCatalog(); + +Map propertiesMap = ImmutableMap.builder() + .put("k1", "v1") + .build(); + +filesetCatalog.createFileset( + NameIdentifier.of("schema", "example_fileset"), + "This is an example fileset", + Fileset.Type.MANAGED, + "s3a://bucket/root/schema/example_fileset", + propertiesMap, +); +``` + + + + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") + +catalog: Catalog = gravitino_client.load_catalog(name="catalog") +catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", "example_fileset"), + type=Fileset.Type.MANAGED, + comment="This is an example fileset", + storage_location="s3a://bucket/root/schema/example_fileset", + properties={"k1": "v1"}) +``` + + + + +Similar to schema, the `storageLocation` is optional if you have set the `location` property in the schema or catalog. Please change the value of +`location` as the actual location you want to store the fileset. + +The example above is for S3 fileset, you can replace the `storageLocation` with the actual location of the GCS, OSS, or ABS fileset. + + +## Using Spark to access the fileset + +The following code snippet shows how to use **PySpark 3.1.3 with hadoop environment(hadoop 3.2.0)** to access the fileset: + +```python +import logging +from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient +from pyspark.sql import SparkSession +import os + +gravitino_url = "http://localhost:8090" +metalake_name = "test" + +catalog_name = "s3_catalog" +schema_name = "schema" +fileset_name = "example" + +## this is for S3 +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aws-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}-SNAPSHOT.jar,/path/to/hadoop-aws-3.2.0.jar,/path/to/aws-java-sdk-bundle-1.11.375.jar --master local[1] pyspark-shell" +spark = SparkSession.builder + .appName("s3_fielset_test") + .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") + .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.client.metalake", "test") + .config("spark.hadoop.s3-access-key-id", os.environ["S3_ACCESS_KEY_ID"]) + .config("spark.hadoop.s3-secret-access-key", os.environ["S3_SECRET_ACCESS_KEY"]) + .config("spark.hadoop.s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com") + .config("spark.driver.memory", "2g") + .config("spark.driver.port", "2048") + .getOrCreate() + +### this is for GCS +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-gcp-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/gcs-connector-hadoop3-2.2.22-shaded.jar --master local[1] pyspark-shell" +spark = SparkSession.builder + .appName("s3_fielset_test") + .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") + .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.client.metalake", "test") + .config("spark.hadoop.gcs-service-account-file", "/path/to/gcs-service-account-file.json") + .config("spark.driver.memory", "2g") + .config("spark.driver.port", "2048") + .getOrCreate() + +### this is for OSS +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aliyun-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/aliyun-sdk-oss-2.8.3.jar,/path/to/hadoop-aliyun-3.2.0.jar,/path/to/jdom-1.1.jar --master local[1] pyspark-shell" +spark = SparkSession.builder + .appName("s3_fielset_test") + .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") + .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.client.metalake", "test") + .config("spark.hadoop.oss-access-key-id", os.environ["OSS_ACCESS_KEY_ID"]) + .config("spark.hadoop.oss-secret-access-key", os.environ["S3_SECRET_ACCESS_KEY"]) + .config("spark.hadoop.oss-endpoint", "https://oss-cn-shanghai.aliyuncs.com") + .config("spark.driver.memory", "2g") + .config("spark.driver.port", "2048") + .getOrCreate() +spark.sparkContext.setLogLevel("DEBUG") + +### this is for ABS +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/hadoop-azure-3.2.0.jar,/path/to/azure-storage-7.0.0.jar,/path/to/wildfly-openssl-1.0.4.Final.jar --master local[1] pyspark-shell" +spark = SparkSession.builder + .appName("s3_fielset_test") + .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") + .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.client.metalake", "test") + .config("spark.hadoop.azure-storage-account-name", "azure_account_name") + .config("spark.hadoop.azure-storage-account-key", "azure_account_name") + .config("spark.hadoop.fs.azure.skipUserGroupMetadataDuringInitialization", "true") + .config("spark.driver.memory", "2g") + .config("spark.driver.port", "2048") + .getOrCreate() + +data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)] +columns = ["Name", "Age"] +spark_df = spark.createDataFrame(data, schema=columns) +gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people" + +spark_df.coalesce(1).write + .mode("overwrite") + .option("header", "true") + .csv(gvfs_path) + +``` + +If your Spark without Hadoop environment, you can use the following code snippet to access the fileset: + +```python +## replace the env PYSPARK_SUBMIT_ARGS variable in the code above with the following content: +### S3 +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aws-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar --master local[1] pyspark-shell" +### GCS +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-gcp-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar, --master local[1] pyspark-shell" +### OSS +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aliyun-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar, --master local[1] pyspark-shell" +#### Azure Blob Storage +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar --master local[1] pyspark-shell" +``` + +:::note +**In some Spark version, Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work, in this case, you should add the jars to the spark classpath directly.** +::: + +## Using fileset with hadoop fs command + +The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. + +1. Adding the following contents to the `${HADOOP_HOME}/etc/hadoop/core-site.xml` file: + +```xml + + fs.AbstractFileSystem.gvfs.impl + org.apache.gravitino.filesystem.hadoop.Gvfs + + + + fs.gvfs.impl + org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem + + + + fs.gravitino.server.uri + http://192.168.50.188:8090 + + + + fs.gravitino.client.metalake + test + + + + + s3-endpoint + http://s3.ap-northeast-1.amazonaws.com + + + s3-access-key-id + access-key + + + s3-secret-access-key + secret-key + + + + + oss-endpoint + https://oss-cn-shanghai.aliyuncs.com + + + oss-access-key-id + access_key + + + oss-secret-access-key + secret_key + + + + + gcs-service-account-file + /path/your-service-account-file.json + + + + + azure-storage-account-name + account_name + + + azure-storage-account-key + account_key + + + +``` + +2. Copy the necessary jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. + +Copy the corresponding jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. For example, if you are using S3, you need to copy `gravitino-aws-{version}.jar` to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. +then copy hadoop-aws-{version}.jar and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory, for simple you can add all the jars in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. + +More detail, please refer to the [Bundle jars](#bundle-jars) section. + + +3. Run the following command to access the fileset: + +```shell +hadoop dfs -ls gvfs://fileset/s3_catalog/schema/example +hadoop dfs -put /path/to/local/file gvfs://fileset/s3_catalog/schema/example +``` + +### Using fileset with pandas + +The following are examples of how to use the pandas library to access the S3 fileset + +```python +import pandas as pd + +storage_options = { + "server_uri": "http://localhost:8090", + "metalake_name": "test", + "options": { + "s3_access_key_id": "access_key", + "s3_secret_access_key": "secret_key", + "s3_endpoint": "http://s3.ap-northeast-1.amazonaws.com" + } +} +ds = pd.read_csv(f"gvfs://fileset/${catalog_name}/${schema_name}/${fileset_name}/people/part-00000-51d366e2-d5eb-448d-9109-32a96c8a14dc-c000.csv", + storage_options=storage_options) +ds.head() +``` + + diff --git a/docs/hadoop-catalog.md b/docs/hadoop-catalog.md index 9048556ffa5..cf86fde06e4 100644 --- a/docs/hadoop-catalog.md +++ b/docs/hadoop-catalog.md @@ -9,9 +9,9 @@ license: "This software is licensed under the Apache License version 2." ## Introduction Hadoop catalog is a fileset catalog that using Hadoop Compatible File System (HCFS) to manage -the storage location of the fileset. Currently, it supports local filesystem and HDFS. For -object storage like S3, GCS, Azure Blob Storage and OSS, you can put the hadoop object store jar like -`gravitino-aws-bundle-{gravitino-version}.jar` into the `$GRAVITINO_HOME/catalogs/hadoop/libs` directory to enable the support. +the storage location of the fileset. Currently, it supports the local filesystem and HDFS. Since 0.7.0-incubating, Gravitino supports S3, GCS, OSS and Azure Blob Storage fileset through Hadoop catalog. + +The rest of this document will use HDFS or local file as an example to illustrate how to use the Hadoop catalog. For S3, GCS, OSS and Azure Blob Storage, the configuration is similar to HDFS, but more properties need to be set. We will use [separate sections](./cloud-storage-fileset-example.md) to introduce how to use of S3, GCS, OSS and Azure Blob Storage. Note that Gravitino uses Hadoop 3 dependencies to build Hadoop catalog. Theoretically, it should be compatible with both Hadoop 2.x and 3.x, since Gravitino doesn't leverage any new features in @@ -50,8 +50,6 @@ Apart from the above properties, to access fileset like HDFS, S3, GCS, OSS or cu | `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes if it's a S3 fileset. | 0.7.0-incubating | | `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes if it's a S3 fileset. | 0.7.0-incubating | -At the same time, you need to place the corresponding bundle jar [`gravitino-aws-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws-bundle/) in the directory `${GRAVITINO_HOME}/catalogs/hadoop/libs`. - #### GCS fileset | Configuration item | Description | Default value | Required | Since version | @@ -60,8 +58,6 @@ At the same time, you need to place the corresponding bundle jar [`gravitino-aws | `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for GCS, if we set this value, we can omit the prefix 'gs://' in the location. | `builtin-local` | No | 0.7.0-incubating | | `gcs-service-account-file` | The path of GCS service account JSON file. | (none) | Yes if it's a GCS fileset. | 0.7.0-incubating | -In the meantime, you need to place the corresponding bundle jar [`gravitino-gcp-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-gcp-bundle/) in the directory `${GRAVITINO_HOME}/catalogs/hadoop/libs`. - #### OSS fileset | Configuration item | Description | Default value | Required | Since version | @@ -72,9 +68,6 @@ In the meantime, you need to place the corresponding bundle jar [`gravitino-gcp- | `oss-access-key-id` | The access key of the Aliyun OSS. | (none) | Yes if it's a OSS fileset. | 0.7.0-incubating | | `oss-secret-access-key` | The secret key of the Aliyun OSS. | (none) | Yes if it's a OSS fileset. | 0.7.0-incubating | -In the meantime, you need to place the corresponding bundle jar [`gravitino-aliyun-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aliyun-bundle/) in the directory `${GRAVITINO_HOME}/catalogs/hadoop/libs`. - - #### Azure Blob Storage fileset | Configuration item | Description | Default value | Required | Since version | @@ -84,7 +77,6 @@ In the meantime, you need to place the corresponding bundle jar [`gravitino-aliy | `azure-storage-account-name ` | The account name of Azure Blob Storage. | (none) | Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating | | `azure-storage-account-key` | The account key of Azure Blob Storage. | (none) | Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating | -Similar to the above, you need to place the corresponding bundle jar [`gravitino-azure-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-azure-bundle/) in the directory `${GRAVITINO_HOME}/catalogs/hadoop/libs`. :::note - Gravitino contains builtin file system providers for local file system(`builtin-local`) and HDFS(`builtin-hdfs`), that is to say if `filesystem-providers` is not set, Gravitino will still support local file system and HDFS. Apart from that, you can set the `filesystem-providers` to support other file systems like S3, GCS, OSS or custom file system. diff --git a/docs/how-to-use-gvfs.md b/docs/how-to-use-gvfs.md index 0dbfd867a3d..9f34f45d072 100644 --- a/docs/how-to-use-gvfs.md +++ b/docs/how-to-use-gvfs.md @@ -43,7 +43,7 @@ the path mapping and convert automatically. ### Prerequisites + A Hadoop environment with HDFS running. GVFS has been tested against - Hadoop 3.1.0. It is recommended to use Hadoop 3.1.0 or later, but it should work with Hadoop 2. + Hadoop 3.3.0. It is recommended to use Hadoop 3.3.0 or later, but it should work with Hadoop 2. x. Please create an [issue](https://www.github.com/apache/gravitino/issues) if you find any compatibility issues. @@ -71,51 +71,51 @@ Apart from the above properties, to access fileset like S3, GCS, OSS and custom #### S3 fileset -| Configuration item | Description | Default value | Required | Since version | -|--------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|--------------------------|------------------| -| `s3-endpoint` | The endpoint of the AWS S3. | (none) | Yes if it's a S3 fileset.| 0.7.0-incubating | -| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes if it's a S3 fileset.| 0.7.0-incubating | -| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes if it's a S3 fileset.| 0.7.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|------------------------|-------------------------------|---------------|--------------------------|------------------| +| `s3-endpoint` | The endpoint of the AWS S3. | (none) | Yes if it's a S3 fileset.| 0.7.0-incubating | +| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes if it's a S3 fileset.| 0.7.0-incubating | +| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes if it's a S3 fileset.| 0.7.0-incubating | At the same time, you need to add the corresponding bundle jar -1. [`gravitino-aws-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws-bundle/) in the classpath if no hadoop environment is available, or -2. [`gravitino-aws-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws/) and hadoop-aws jar and other necessary dependencies in the classpath. +1. [`gravitino-aws-bundle-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws-bundle/) in the classpath if no Hadoop environment is available, or +2. [`gravitino-aws-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws/) and `hadoop-aws-${hadoop-version}.jar` and other necessary dependencies (They are usually located at `${HADOOP_HOME}/share/hadoop/tools/lib`) in the classpath. #### GCS fileset -| Configuration item | Description | Default value | Required | Since version | -|--------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|---------------------------|------------------| -| `gcs-service-account-file` | The path of GCS service account JSON file. | (none) | Yes if it's a GCS fileset.| 0.7.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|----------------------------|--------------------------------------------|---------------|---------------------------|------------------| +| `gcs-service-account-file` | The path of GCS service account JSON file. | (none) | Yes if it's a GCS fileset.| 0.7.0-incubating | In the meantime, you need to add the corresponding bundle jar -1. [`gravitino-gcp-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-gcp-bundle/) in the classpath if no hadoop environment is available, or -2. or [`gravitino-gcp-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-gcp/) and [gcs-connector jar](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases) and other necessary dependencies in the classpath. +1. [`gravitino-gcp-bundle-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-gcp-bundle/) in the classpath if no hadoop environment is available, or +2. [`gravitino-gcp-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-gcp/) and [gcs-connector jar](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases) and other necessary dependencies in the classpath. #### OSS fileset -| Configuration item | Description | Default value | Required | Since version | -|---------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|---------------------------|------------------| -| `oss-endpoint` | The endpoint of the Aliyun OSS. | (none) | Yes if it's a OSS fileset.| 0.7.0-incubating | -| `oss-access-key-id` | The access key of the Aliyun OSS. | (none) | Yes if it's a OSS fileset.| 0.7.0-incubating | -| `oss-secret-access-key` | The secret key of the Aliyun OSS. | (none) | Yes if it's a OSS fileset.| 0.7.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|-------------------------|-----------------------------------|---------------|---------------------------|------------------| +| `oss-endpoint` | The endpoint of the Aliyun OSS. | (none) | Yes if it's a OSS fileset.| 0.7.0-incubating | +| `oss-access-key-id` | The access key of the Aliyun OSS. | (none) | Yes if it's a OSS fileset.| 0.7.0-incubating | +| `oss-secret-access-key` | The secret key of the Aliyun OSS. | (none) | Yes if it's a OSS fileset.| 0.7.0-incubating | In the meantime, you need to place the corresponding bundle jar -1. [`gravitino-aliyun-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aliyun-bundle/) in the classpath if no hadoop environment is available, or -2. [`gravitino-aliyun-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aliyun/) and hadoop-aliyun jar and other necessary dependencies in the classpath. +1. [`gravitino-aliyun-bundle-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aliyun-bundle/) in the classpath if no hadoop environment is available, or +2. [`gravitino-aliyun-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aliyun/) and `hadoop-aliyun-${hadoop-version}.jar` and other necessary dependencies (They are usually located at `${HADOOP_HOME}/share/hadoop/tools/lib`) in the classpath. #### Azure Blob Storage fileset -| Configuration item | Description | Default value | Required | Since version | -|-----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|-------------------------------------------|------------------| -| `azure-storage-account-name` | The account name of Azure Blob Storage. | (none) | Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating | -| `azure-storage-account-key` | The account key of Azure Blob Storage. | (none) | Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|------------------------------|-----------------------------------------|---------------|-------------------------------------------|------------------| +| `azure-storage-account-name` | The account name of Azure Blob Storage. | (none) | Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating | +| `azure-storage-account-key` | The account key of Azure Blob Storage. | (none) | Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating | Similar to the above, you need to place the corresponding bundle jar -1. [`gravitino-azure-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-azure-bundle/) in the classpath if no hadoop environment is available, or -2. [`gravitino-azure-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-azure/) and hadoop-azure jar and other necessary dependencies in the classpath. +1. [`gravitino-azure-bundle-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-azure-bundle/) in the classpath if no hadoop environment is available, or +2. [`gravitino-azure-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-azure/) and `hadoop-azure-${hadoop-version}.jar` and other necessary dependencies (They are usually located at `${HADOOP_HOME}/share/hadoop/tools/lib) in the classpath. #### Custom fileset Since 0.7.0-incubating, users can define their own fileset type and configure the corresponding properties, for more, please refer to [Custom Fileset](./hadoop-catalog.md#how-to-custom-your-own-hcfs-file-system-fileset). @@ -146,13 +146,8 @@ You can configure these properties in two ways: ``` :::note -If you want to access the S3, GCS, OSS or custom fileset through GVFS, apart from the above properties, you need to place the corresponding bundle jars in the Hadoop environment. -For example, if you want to access the S3 fileset, you need to place -1. The aws hadoop bundle jar [`gravitino-aws-bundle-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws-bundle/) -2. or [`gravitino-aws-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws/), and hadoop-aws jar and other necessary dependencies - -to the classpath, it typically locates in `${HADOOP_HOME}/share/hadoop/common/lib/`). - +If you want to access the S3, GCS, OSS or custom fileset through GVFS, apart from the above properties, you need to place the corresponding bundle jars in the Hadoop environment, For bundles jar and +cloud storage fileset configuration example, please refer to [cloud storage fileset example](./cloud-storage-fileset-example.md). ::: 2. Configure the properties in the `core-site.xml` file of the Hadoop environment: @@ -209,6 +204,10 @@ two ways: ```shell ./gradlew :clients:filesystem-hadoop3-runtime:build -x test ``` +:::note +For cloud storage fileset, some extra steps should be added, please refer to [cloud storage fileset example](./cloud-storage-fileset-example.md). +::: + #### Via Hadoop shell command @@ -226,7 +225,6 @@ cp gravitino-filesystem-hadoop3-runtime-{version}.jar ${HADOOP_HOME}/share/hadoo # You need to ensure that the Kerberos has permission on the HDFS directory. kinit -kt your_kerberos.keytab your_kerberos@xxx.com - # 4. Copy other dependencies to the Hadoop environment if you want to access the S3 fileset via GVFS cp bundles/aws-bundle/build/libs/gravitino-aws-bundle-{version}.jar ${HADOOP_HOME}/share/hadoop/common/lib/ cp clients/filesystem-hadoop3-runtime/build/libs/gravitino-filesystem-hadoop3-runtime-{version}-SNAPSHOT.jar ${HADOOP_HOME}/share/hadoop/common/lib/ @@ -236,6 +234,8 @@ cp ${HADOOP_HOME}/share/hadoop/tools/lib/* ${HADOOP_HOME}/share/hadoop/common/li ./${HADOOP_HOME}/bin/hadoop dfs -ls gvfs://fileset/test_catalog/test_schema/test_fileset_1 ``` +Full example to access S3, GCS, OSS fileset via Hadoop shell command, please refer to [cloud storage fileset example](./cloud-storage-fileset-example.md). + #### Via Java code You can also perform operations on the files or directories managed by fileset through Java code. @@ -285,6 +285,9 @@ FileSystem fs = filesetPath.getFileSystem(conf); fs.getFileStatus(filesetPath); ``` +Full example to access S3, GCS, OSS fileset via Hadoop shell command, please refer to [cloud storage fileset example](./cloud-storage-fileset-example.md). + + #### Via Apache Spark 1. Add the GVFS runtime jar to the Spark environment. @@ -324,6 +327,7 @@ fs.getFileStatus(filesetPath); rdd.foreach(println) ``` +Full example to access S3, GCS, OSS fileset via Spark, please refer to [cloud storage fileset example](./cloud-storage-fileset-example.md). #### Via Tensorflow @@ -521,6 +525,8 @@ options = { fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090", metalake_name="test_metalake", options=options) ``` +Full Python example to access S3, GCS, OSS fileset via GVFS, please refer to [cloud storage fileset example](./cloud-storage-fileset-example.md). + :::note Gravitino python client does not support customized filesets defined by users due to the limit of `fsspec` library. From baf42e19dd4a13dd571dd7c41452b639f5074391 Mon Sep 17 00:00:00 2001 From: yuqi Date: Fri, 3 Jan 2025 10:19:43 +0800 Subject: [PATCH 02/39] fix --- docs/cloud-storage-fileset-example.md | 29 +++++++++++++++++---------- 1 file changed, 18 insertions(+), 11 deletions(-) diff --git a/docs/cloud-storage-fileset-example.md b/docs/cloud-storage-fileset-example.md index 17d6d24ff8c..71e76c73a6d 100644 --- a/docs/cloud-storage-fileset-example.md +++ b/docs/cloud-storage-fileset-example.md @@ -5,7 +5,7 @@ keyword: fileset S3 GCS ADLS OSS license: "This software is licensed under the Apache License version 2." --- -This document aims to provide a comprehensive guide on how to use cloud storage fileset created by Gravitino, it usually contains the following sections: +This document aims to provide a comprehensive guide on how to use cloud storage fileset created by Gravitino, it usually contains the following sections. ## Necessary steps in Gravitino server @@ -31,24 +31,31 @@ bin/gravitino.sh start ### Bundle jars -Gravitino bundles jars are jars that are used to access the cloud storage, they are divided into two categories: +Gravitino bundles jars are used to access the cloud storage. They are divided into two categories: - `gravitino-${aws,gcp,aliyun,azure}-bundle-{gravitino-version}.jar` are the jars that contain all the necessary dependencies to access the corresponding cloud storages. For instance, `gravitino-aws-bundle-${gravitino-version}.jar` contains the all necessary classes including `hadoop-common`(hadoop-3.3.1) and `hadoop-aws` to access the S3 storage. They are used in the scenario where there is no hadoop environment in the runtime. - If there is already hadoop environment in the runtime, you can use the `gravitino-${aws,gcp,aliyun,azure}-${gravitino-version}.jar` that does not contain the cloud storage classes (like hadoop-aws) and hadoop environment. Alternatively, you can manually add the necessary jars to the classpath. -The following table demonstrates which jars are necessary for different cloud storage filesets: +If the Hadoop environment is available, you can use the following jars to access the cloud storage fileset: -| Hadoop runtime version | S3 | GCS | OSS | ABS | -|------------------------|------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------| -| No Hadoop environment | `gravitino-aws-bundle-${gravitino-version}.jar` | `gravitino-gcp-bundle-${gravitino-version}.jar` | `gravitino-aliyun-bundle-${gravitino-version}.jar` | `gravitino-azure-bundle-${gravitino-version}.jar` | -| 2.x, 3.x | `gravitino-aws-${gravitino-version}.jar`, `hadoop-aws-${hadoop-version}.jar`, `aws-sdk-java-${version}` and other necessary dependencies | `gravitino-gcp-{gravitino-version}.jar`, `gcs-connector-${hadoop-version}`.jar, other necessary dependencies | `gravitino-aliyun-{gravitino-version}.jar`, hadoop-aliyun-{hadoop-version}.jar, aliyun-sdk-java-{version} and other necessary dependencies | `gravitino-azure-${gravitino-version}.jar`, `hadoop-azure-${hadoop-version}.jar`, and other necessary dependencies | +- S3: `gravitino-aws-${gravitino-version}.jar`, `hadoop-aws-${hadoop-version}.jar`, `aws-sdk-java-${version}` and other necessary dependencies +- GCS: `gravitino-gcp-{gravitino-version}.jar`, `gcs-connector-${hadoop-version}`.jar, other necessary dependencies +- OSS: `gravitino-aliyun-{gravitino-version}.jar`, hadoop-aliyun-{hadoop-version}.jar, aliyun-sdk-java-{version} and other necessary dependencies +- ABS: `gravitino-azure-${gravitino-version}.jar`, `hadoop-azure-${hadoop-version}.jar`, and other necessary dependencies + +If there is no Hadoop environment, you can use the following jars to access the cloud storage fileset: + +- S3: `gravitino-aws-bundle-${gravitino-version}.jar` +- GCS: `gravitino-gcp-bundle-${gravitino-version}.jar` +- OSS: `gravitino-aliyun-bundle-${gravitino-version}.jar` +- ABS: `gravitino-azure-bundle-${gravitino-version}.jar` For `hadoop-aws-${hadoop-version}.jar`, `hadoop-azure-${hadoop-version}.jar` and `hadoop-aliyun-${hadoop-version}.jar` and related dependencies, you can get them from `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. For `gcs-connector`, you can download it from the [GCS connector](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases) for hadoop2 or hadoop3. -If there still have some issues, please report it to the Gravitino community and create an issue. +If there are some issues, please consider [fill in an issue](https://github.com/apache/gravitino/issues/new/choose). ## Create fileset catalogs @@ -197,7 +204,7 @@ s3_catalog = gravitino_client.create_catalog(name="catalog", :::note -The prefix of a GCS location should always start with `gs` for instance, `gs://bucket/root`. +The prefix of a GCS location should always start with `gs`, for instance, `gs://bucket/root`. ::: ### Create an OSS fileset catalog @@ -389,7 +396,7 @@ Schema schema = supportsSchemas.createSchema("schema", -You can change the value of property `location` according to which catalog you are using, moreover, if we have set the `location` property in the catalog, we can omit the `location` property in the schema. +You can change `location` value based on the catalog you are using. If the `location` property is specified in the catalog, we can omit it in the schema. ## Create filesets @@ -562,7 +569,7 @@ os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-bundle-{gra ``` :::note -**In some Spark version, Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work, in this case, you should add the jars to the spark classpath directly.** +In some Spark version, Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work, in this case, you should add the jars to the spark classpath directly. ::: ## Using fileset with hadoop fs command From b7eb62109c1efc7505fcd00ce8875284b17bf21d Mon Sep 17 00:00:00 2001 From: yuqi Date: Fri, 3 Jan 2025 19:35:02 +0800 Subject: [PATCH 03/39] fix --- docs/cloud-storage-fileset-example.md | 685 -------------------------- docs/hadoop-catalog-with-adls.md | 357 ++++++++++++++ docs/hadoop-catalog-with-gcs.md | 345 +++++++++++++ docs/hadoop-catalog-with-oss.md | 368 ++++++++++++++ docs/hadoop-catalog-with-s3.md | 372 ++++++++++++++ docs/hadoop-catalog.md | 4 +- docs/how-to-use-gvfs.md | 15 +- 7 files changed, 1445 insertions(+), 701 deletions(-) delete mode 100644 docs/cloud-storage-fileset-example.md create mode 100644 docs/hadoop-catalog-with-adls.md create mode 100644 docs/hadoop-catalog-with-gcs.md create mode 100644 docs/hadoop-catalog-with-oss.md create mode 100644 docs/hadoop-catalog-with-s3.md diff --git a/docs/cloud-storage-fileset-example.md b/docs/cloud-storage-fileset-example.md deleted file mode 100644 index 71e76c73a6d..00000000000 --- a/docs/cloud-storage-fileset-example.md +++ /dev/null @@ -1,685 +0,0 @@ ---- -title: "How to use cloud storage fileset" -slug: /how-to-use-cloud-storage-fileset -keyword: fileset S3 GCS ADLS OSS -license: "This software is licensed under the Apache License version 2." ---- - -This document aims to provide a comprehensive guide on how to use cloud storage fileset created by Gravitino, it usually contains the following sections. - -## Necessary steps in Gravitino server - -### Start up Gravitino server - -Before running the Gravitino server, you need to put the following jars into the fileset catalog classpath located at `${GRAVITINO_HOME}/catalogs/hadoop/libs`. For example, if you are using S3, you need to put `gravitino-aws-hadoop-bundle-{gravitino-version}.jar` into the fileset catalog classpath in `${GRAVITINO_HOME}/catalogs/hadoop/libs`. - -| Storage type | Description | Jar file | Since Version | -|--------------|---------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|------------------| -| Local | The local file system. | (none) | 0.5.0 | -| HDFS | HDFS file system. | (none) | 0.5.0 | -| S3 | AWS S3. | [gravitino-aws-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aws-bundle) | 0.8.0-incubating | -| GCS | Google Cloud Storage. | [gravitino-gcp-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-gcp-bundle) | 0.8.0-incubating | -| OSS | Aliyun OSS. | [gravitino-aliyun-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aliyun-bundle) | 0.8.0-incubating | -| ABS | Azure Blob Storage (aka. ABS, or Azure Data Lake Storage (v2) | [gravitino-azure-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-azure-bundle) | 0.8.0-incubating | - -After adding the jars into the fileset catalog classpath, you can start up the Gravitino server by running the following command: - -```shell -cd ${GRAVITINO_HOME} -bin/gravitino.sh start -``` - -### Bundle jars - -Gravitino bundles jars are used to access the cloud storage. They are divided into two categories: - -- `gravitino-${aws,gcp,aliyun,azure}-bundle-{gravitino-version}.jar` are the jars that contain all the necessary dependencies to access the corresponding cloud storages. For instance, `gravitino-aws-bundle-${gravitino-version}.jar` contains the all necessary classes including `hadoop-common`(hadoop-3.3.1) and `hadoop-aws` to access the S3 storage. -They are used in the scenario where there is no hadoop environment in the runtime. - -- If there is already hadoop environment in the runtime, you can use the `gravitino-${aws,gcp,aliyun,azure}-${gravitino-version}.jar` that does not contain the cloud storage classes (like hadoop-aws) and hadoop environment. Alternatively, you can manually add the necessary jars to the classpath. - -If the Hadoop environment is available, you can use the following jars to access the cloud storage fileset: - -- S3: `gravitino-aws-${gravitino-version}.jar`, `hadoop-aws-${hadoop-version}.jar`, `aws-sdk-java-${version}` and other necessary dependencies -- GCS: `gravitino-gcp-{gravitino-version}.jar`, `gcs-connector-${hadoop-version}`.jar, other necessary dependencies -- OSS: `gravitino-aliyun-{gravitino-version}.jar`, hadoop-aliyun-{hadoop-version}.jar, aliyun-sdk-java-{version} and other necessary dependencies -- ABS: `gravitino-azure-${gravitino-version}.jar`, `hadoop-azure-${hadoop-version}.jar`, and other necessary dependencies - -If there is no Hadoop environment, you can use the following jars to access the cloud storage fileset: - -- S3: `gravitino-aws-bundle-${gravitino-version}.jar` -- GCS: `gravitino-gcp-bundle-${gravitino-version}.jar` -- OSS: `gravitino-aliyun-bundle-${gravitino-version}.jar` -- ABS: `gravitino-azure-bundle-${gravitino-version}.jar` - -For `hadoop-aws-${hadoop-version}.jar`, `hadoop-azure-${hadoop-version}.jar` and `hadoop-aliyun-${hadoop-version}.jar` and related dependencies, you can get them from `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. -For `gcs-connector`, you can download it from the [GCS connector](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases) for hadoop2 or hadoop3. - -If there are some issues, please consider [fill in an issue](https://github.com/apache/gravitino/issues/new/choose). - -## Create fileset catalogs - -Once the Gravitino server is started, you can create the corresponding fileset by the following sentence: - - -### Create a S3 fileset catalog - - - - -```shell -curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ --H "Content-Type: application/json" -d '{ - "name": "catalog", - "type": "FILESET", - "comment": "comment", - "provider": "hadoop", - "properties": { - "location": "s3a://bucket/root", - "s3-access-key-id": "access_key", - "s3-secret-access-key": "secret_key", - "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com", - "filesystem-providers": "s3" - } -}' http://localhost:8090/api/metalakes/metalake/catalogs -``` - - - - -```java -GravitinoClient gravitinoClient = GravitinoClient - .builder("http://localhost:8090") - .withMetalake("metalake") - .build(); - -s3Properties = ImmutableMap.builder() - .put("location", "s3a://bucket/root") - .put("s3-access-key-id", "access_key") - .put("s3-secret-access-key", "secret_key") - .put("s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com") - .put("filesystem-providers", "s3") - .build(); - -Catalog s3Catalog = gravitinoClient.createCatalog("catalog", - Type.FILESET, - "hadoop", // provider, Gravitino only supports "hadoop" for now. - "This is a S3 fileset catalog", - s3Properties); -// ... - -``` - - - - -```python -gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") -s3_properties = { - "location": "s3a://bucket/root", - "s3-access-key-id": "access_key" - "s3-secret-access-key": "secret_key", - "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com" -} - -s3_catalog = gravitino_client.create_catalog(name="catalog", - type=Catalog.Type.FILESET, - provider="hadoop", - comment="This is a S3 fileset catalog", - properties=s3_properties) - -``` - - - - -:::note -The value of location should always start with `s3a` NOT `s3` for AWS S3, for instance, `s3a://bucket/root`. Value like `s3://bucket/root` is not supported due to the limitation of the hadoop-aws library. -::: - -### Create a GCS fileset catalog - - - - -```shell -curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ --H "Content-Type: application/json" -d '{ - "name": "catalog", - "type": "FILESET", - "comment": "comment", - "provider": "hadoop", - "properties": { - "location": "gs://bucket/root", - "gcs-service-account-file": "path_of_gcs_service_account_file", - "filesystem-providers": "gcs" - } -}' http://localhost:8090/api/metalakes/metalake/catalogs -``` - - - - -```java -GravitinoClient gravitinoClient = GravitinoClient - .builder("http://localhost:8090") - .withMetalake("metalake") - .build(); - -gcsProperties = ImmutableMap.builder() - .put("location", "gs://bucket/root") - .put("gcs-service-account-file", "path_of_gcs_service_account_file") - .put("filesystem-providers", "gcs") - .build(); - -Catalog gcsCatalog = gravitinoClient.createCatalog("catalog", - Type.FILESET, - "hadoop", // provider, Gravitino only supports "hadoop" for now. - "This is a GCS fileset catalog", - gcsProperties); -// ... - -``` - - - - -```python -gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") - -gcs_properties = { - "location": "gcs://bucket/root", - "gcs_service_account_file": "path_of_gcs_service_account_file" -} - -s3_catalog = gravitino_client.create_catalog(name="catalog", - type=Catalog.Type.FILESET, - provider="hadoop", - comment="This is a GCS fileset catalog", - properties=gcs_properties) - -``` - - - - -:::note -The prefix of a GCS location should always start with `gs`, for instance, `gs://bucket/root`. -::: - -### Create an OSS fileset catalog - - - - -```shell -curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ --H "Content-Type: application/json" -d '{ - "name": "catalog", - "type": "FILESET", - "comment": "comment", - "provider": "hadoop", - "properties": { - "location": "oss://bucket/root", - "oss-access-key-id": "access_key", - "oss-secret-access-key": "secret_key", - "oss-endpoint": "http://oss-cn-hangzhou.aliyuncs.com", - "filesystem-providers": "oss" - } -}' http://localhost:8090/api/metalakes/metalake/catalogs -``` - - - - -```java -GravitinoClient gravitinoClient = GravitinoClient - .builder("http://localhost:8090") - .withMetalake("metalake") - .build(); - -ossProperties = ImmutableMap.builder() - .put("location", "oss://bucket/root") - .put("oss-access-key-id", "access_key") - .put("oss-secret-access-key", "secret_key") - .put("oss-endpoint", "http://oss-cn-hangzhou.aliyuncs.com") - .put("filesystem-providers", "oss") - .build(); - -Catalog ossProperties = gravitinoClient.createCatalog("catalog", - Type.FILESET, - "hadoop", // provider, Gravitino only supports "hadoop" for now. - "This is a OSS fileset catalog", - ossProperties); -// ... - -``` - - - - -```python -gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") -oss_properties = { - "location": "oss://bucket/root", - "oss-access-key-id": "access_key" - "oss-secret-access-key": "secret_key", - "oss-endpoint": "http://oss-cn-hangzhou.aliyuncs.com" -} - -oss_catalog = gravitino_client.create_catalog(name="catalog", - type=Catalog.Type.FILESET, - provider="hadoop", - comment="This is a OSS fileset catalog", - properties=oss_properties) - -``` - -### Create an ABS (Azure Blob Storage or ADLS) fileset catalog - - - - -```shell -curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ --H "Content-Type: application/json" -d '{ - "name": "catalog", - "type": "FILESET", - "comment": "comment", - "provider": "hadoop", - "properties": { - "location": "abfss://container/root", - "abs-account-name": "The account name of the Azure Blob Storage", - "abs-account-key": "The account key of the Azure Blob Storage", - "filesystem-providers": "abs" - } -}' http://localhost:8090/api/metalakes/metalake/catalogs -``` - - - - -```java -GravitinoClient gravitinoClient = GravitinoClient - .builder("http://localhost:8090") - .withMetalake("metalake") - .build(); - -absProperties = ImmutableMap.builder() - .put("location", "abfss://container/root") - .put("abs-account-name", "The account name of the Azure Blob Storage") - .put("abs-account-key", "The account key of the Azure Blob Storage") - .put("filesystem-providers", "abs") - .build(); - -Catalog gcsCatalog = gravitinoClient.createCatalog("catalog", - Type.FILESET, - "hadoop", // provider, Gravitino only supports "hadoop" for now. - "This is a Azure Blob storage fileset catalog", - absProperties); -// ... - -``` - - - - -```python -gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") - -abs_properties = { - "location": "gcs://bucket/root", - "abs_account_name": "The account name of the Azure Blob Storage", - "abs_account_key": "The account key of the Azure Blob Storage" -} - -abs_catalog = gravitino_client.create_catalog(name="catalog", - type=Catalog.Type.FILESET, - provider="hadoop", - comment="This is a Azure Blob Storage fileset catalog", - properties=abs_properties) - -``` - - - - -note::: -The prefix of an ABS (Azure Blob Storage or ADLS (v2)) location should always start with `abfss` NOT `abfs`, for instance, `abfss://container/root`. Value like `abfs://container/root` is not supported. -::: - - -## Create fileset schema - -This part is the same for all cloud storage filesets, you can create the schema by the following sentence: - - - - -```shell -curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ --H "Content-Type: application/json" -d '{ - "name": "schema", - "comment": "comment", - "properties": { - "location": "file:///tmp/root/schema" - } -}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas -``` - - - - -```java -GravitinoClient gravitinoClient = GravitinoClient - .builder("http://localhost:8090") - .withMetalake("metalake") - .build(); - -// Assuming you have just created a Hadoop catalog named `catalog` -Catalog catalog = gravitinoClient.loadCatalog("catalog"); - -SupportsSchemas supportsSchemas = catalog.asSchemas(); - -Map schemaProperties = ImmutableMap.builder() - // Property "location" is optional, if specified all the managed fileset without - // specifying storage location will be stored under this location. - .put("location", "file:///tmp/root/schema") - .build(); -Schema schema = supportsSchemas.createSchema("schema", - "This is a schema", - schemaProperties -); -// ... -``` - - - - -You can change `location` value based on the catalog you are using. If the `location` property is specified in the catalog, we can omit it in the schema. - -## Create filesets - -The following sentences can be used to create a fileset in the schema: - - - - -```shell -curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ --H "Content-Type: application/json" -d '{ - "name": "example_fileset", - "comment": "This is an example fileset", - "type": "MANAGED", - "storageLocation": "s3a://bucket/root/schema/example_fileset", - "properties": { - "k1": "v1" - } -}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/filesets -``` - - - - -```java -GravitinoClient gravitinoClient = GravitinoClient - .builder("http://localhost:8090") - .withMetalake("metalake") - .build(); - -Catalog catalog = gravitinoClient.loadCatalog("catalog"); -FilesetCatalog filesetCatalog = catalog.asFilesetCatalog(); - -Map propertiesMap = ImmutableMap.builder() - .put("k1", "v1") - .build(); - -filesetCatalog.createFileset( - NameIdentifier.of("schema", "example_fileset"), - "This is an example fileset", - Fileset.Type.MANAGED, - "s3a://bucket/root/schema/example_fileset", - propertiesMap, -); -``` - - - - -```python -gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") - -catalog: Catalog = gravitino_client.load_catalog(name="catalog") -catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", "example_fileset"), - type=Fileset.Type.MANAGED, - comment="This is an example fileset", - storage_location="s3a://bucket/root/schema/example_fileset", - properties={"k1": "v1"}) -``` - - - - -Similar to schema, the `storageLocation` is optional if you have set the `location` property in the schema or catalog. Please change the value of -`location` as the actual location you want to store the fileset. - -The example above is for S3 fileset, you can replace the `storageLocation` with the actual location of the GCS, OSS, or ABS fileset. - - -## Using Spark to access the fileset - -The following code snippet shows how to use **PySpark 3.1.3 with hadoop environment(hadoop 3.2.0)** to access the fileset: - -```python -import logging -from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient -from pyspark.sql import SparkSession -import os - -gravitino_url = "http://localhost:8090" -metalake_name = "test" - -catalog_name = "s3_catalog" -schema_name = "schema" -fileset_name = "example" - -## this is for S3 -os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aws-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}-SNAPSHOT.jar,/path/to/hadoop-aws-3.2.0.jar,/path/to/aws-java-sdk-bundle-1.11.375.jar --master local[1] pyspark-shell" -spark = SparkSession.builder - .appName("s3_fielset_test") - .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") - .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") - .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") - .config("spark.hadoop.fs.gravitino.client.metalake", "test") - .config("spark.hadoop.s3-access-key-id", os.environ["S3_ACCESS_KEY_ID"]) - .config("spark.hadoop.s3-secret-access-key", os.environ["S3_SECRET_ACCESS_KEY"]) - .config("spark.hadoop.s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com") - .config("spark.driver.memory", "2g") - .config("spark.driver.port", "2048") - .getOrCreate() - -### this is for GCS -os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-gcp-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/gcs-connector-hadoop3-2.2.22-shaded.jar --master local[1] pyspark-shell" -spark = SparkSession.builder - .appName("s3_fielset_test") - .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") - .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") - .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") - .config("spark.hadoop.fs.gravitino.client.metalake", "test") - .config("spark.hadoop.gcs-service-account-file", "/path/to/gcs-service-account-file.json") - .config("spark.driver.memory", "2g") - .config("spark.driver.port", "2048") - .getOrCreate() - -### this is for OSS -os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aliyun-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/aliyun-sdk-oss-2.8.3.jar,/path/to/hadoop-aliyun-3.2.0.jar,/path/to/jdom-1.1.jar --master local[1] pyspark-shell" -spark = SparkSession.builder - .appName("s3_fielset_test") - .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") - .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") - .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") - .config("spark.hadoop.fs.gravitino.client.metalake", "test") - .config("spark.hadoop.oss-access-key-id", os.environ["OSS_ACCESS_KEY_ID"]) - .config("spark.hadoop.oss-secret-access-key", os.environ["S3_SECRET_ACCESS_KEY"]) - .config("spark.hadoop.oss-endpoint", "https://oss-cn-shanghai.aliyuncs.com") - .config("spark.driver.memory", "2g") - .config("spark.driver.port", "2048") - .getOrCreate() -spark.sparkContext.setLogLevel("DEBUG") - -### this is for ABS -os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/hadoop-azure-3.2.0.jar,/path/to/azure-storage-7.0.0.jar,/path/to/wildfly-openssl-1.0.4.Final.jar --master local[1] pyspark-shell" -spark = SparkSession.builder - .appName("s3_fielset_test") - .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") - .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") - .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") - .config("spark.hadoop.fs.gravitino.client.metalake", "test") - .config("spark.hadoop.azure-storage-account-name", "azure_account_name") - .config("spark.hadoop.azure-storage-account-key", "azure_account_name") - .config("spark.hadoop.fs.azure.skipUserGroupMetadataDuringInitialization", "true") - .config("spark.driver.memory", "2g") - .config("spark.driver.port", "2048") - .getOrCreate() - -data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)] -columns = ["Name", "Age"] -spark_df = spark.createDataFrame(data, schema=columns) -gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people" - -spark_df.coalesce(1).write - .mode("overwrite") - .option("header", "true") - .csv(gvfs_path) - -``` - -If your Spark without Hadoop environment, you can use the following code snippet to access the fileset: - -```python -## replace the env PYSPARK_SUBMIT_ARGS variable in the code above with the following content: -### S3 -os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aws-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar --master local[1] pyspark-shell" -### GCS -os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-gcp-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar, --master local[1] pyspark-shell" -### OSS -os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aliyun-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar, --master local[1] pyspark-shell" -#### Azure Blob Storage -os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar --master local[1] pyspark-shell" -``` - -:::note -In some Spark version, Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work, in this case, you should add the jars to the spark classpath directly. -::: - -## Using fileset with hadoop fs command - -The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. - -1. Adding the following contents to the `${HADOOP_HOME}/etc/hadoop/core-site.xml` file: - -```xml - - fs.AbstractFileSystem.gvfs.impl - org.apache.gravitino.filesystem.hadoop.Gvfs - - - - fs.gvfs.impl - org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem - - - - fs.gravitino.server.uri - http://192.168.50.188:8090 - - - - fs.gravitino.client.metalake - test - - - - - s3-endpoint - http://s3.ap-northeast-1.amazonaws.com - - - s3-access-key-id - access-key - - - s3-secret-access-key - secret-key - - - - - oss-endpoint - https://oss-cn-shanghai.aliyuncs.com - - - oss-access-key-id - access_key - - - oss-secret-access-key - secret_key - - - - - gcs-service-account-file - /path/your-service-account-file.json - - - - - azure-storage-account-name - account_name - - - azure-storage-account-key - account_key - - - -``` - -2. Copy the necessary jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. - -Copy the corresponding jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. For example, if you are using S3, you need to copy `gravitino-aws-{version}.jar` to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. -then copy hadoop-aws-{version}.jar and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory, for simple you can add all the jars in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. - -More detail, please refer to the [Bundle jars](#bundle-jars) section. - - -3. Run the following command to access the fileset: - -```shell -hadoop dfs -ls gvfs://fileset/s3_catalog/schema/example -hadoop dfs -put /path/to/local/file gvfs://fileset/s3_catalog/schema/example -``` - -### Using fileset with pandas - -The following are examples of how to use the pandas library to access the S3 fileset - -```python -import pandas as pd - -storage_options = { - "server_uri": "http://localhost:8090", - "metalake_name": "test", - "options": { - "s3_access_key_id": "access_key", - "s3_secret_access_key": "secret_key", - "s3_endpoint": "http://s3.ap-northeast-1.amazonaws.com" - } -} -ds = pd.read_csv(f"gvfs://fileset/${catalog_name}/${schema_name}/${fileset_name}/people/part-00000-51d366e2-d5eb-448d-9109-32a96c8a14dc-c000.csv", - storage_options=storage_options) -ds.head() -``` - - diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md new file mode 100644 index 00000000000..e77d2c80465 --- /dev/null +++ b/docs/hadoop-catalog-with-adls.md @@ -0,0 +1,357 @@ +--- +title: "Hadoop catalog with ADLS" +slug: /hadoop-catalog-with-adls +date: 2025-01-03 +keyword: Hadoop catalog ADLS +license: "This software is licensed under the Apache License version 2." +--- + +This document describes how to configure a Hadoop catalog with ADLS (Azure Blob Storage). + +## Prerequisites + +In order to create a Hadoop catalog with ADLS, you need to place [`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle) in Gravitino Hadoop classpath located +at `${HADOOP_HOME}/share/hadoop/common/lib/`. After that, start Gravitino server with the following command: + +```bash +$ bin/gravitino-server.sh start +``` + +## Create a Hadoop Catalog with ADLS in Gravitino + +### Catalog a catalog + +Apart from configuration method in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with ADLS: + +| Configuration item | Description | Default value | Required | Since version | +|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|-------------------------------------------|------------------| +| `filesystem-providers` | The file system providers to add. Set it to `abs` if it's a Azure Blob Storage fileset, or a comma separated string that contains `abs` like `oss,abs,s3` to support multiple kinds of fileset including `abs`. | (none) | Yes | 0.8.0-incubating | +| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for Azure Blob Storage, if we set this value, we can omit the prefix 'abfss://' in the location. | `builtin-local` | No | 0.8.0-incubating | +| `azure-storage-account-name ` | The account name of Azure Blob Storage. | (none) | Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating | +| `azure-storage-account-key` | The account key of Azure Blob Storage. | (none) | Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating | + +### Create a schema + +Refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations) for more details. + +### Create a fileset + +Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for more details. + + +## Using Hadoop catalog with ADLS + +### Create a Hadoop catalog/schema/file set with ADLS + +First, you need to create a Hadoop catalog with ADLS. The following example shows how to create a Hadoop catalog with ADLS: + + + + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "catalog", + "type": "FILESET", + "comment": "comment", + "provider": "hadoop", + "properties": { + "location": "abfss://container@account-name.dfs.core.windows.net/path", + "azure-storage-account-name": "The account name of the Azure Blob Storage", + "azure-storage-account-key": "The account key of the Azure Blob Storage", + "filesystem-providers": "abs" + } +}' http://localhost:8090/api/metalakes/metalake/catalogs +``` + + + + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("http://localhost:8090") + .withMetalake("metalake") + .build(); + +adlsProperties = ImmutableMap.builder() + .put("location", "abfss://container@account-name.dfs.core.windows.net/path") + .put("azure-storage-account-name", "azure storage account name") + .put("azure-storage-account-key", "azure storage account key") + .put("filesystem-providers", "abs") + .build(); + +Catalog adlsCatalog = gravitinoClient.createCatalog("catalog", + Type.FILESET, + "hadoop", // provider, Gravitino only supports "hadoop" for now. + "This is a ADLS fileset catalog", + adlsProperties); +// ... + +``` + + + + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") +adls_properties = { + "location": "abfss://container@account-name.dfs.core.windows.net/path", + "azure_storage_account_name": "azure storage account name", + "azure_storage_account_key": "azure storage account key" +} + +adls_properties = gravitino_client.create_catalog(name="catalog", + type=Catalog.Type.FILESET, + provider="hadoop", + comment="This is a ADLS fileset catalog", + properties=adls_properties) + +``` + + + + +Then create a schema and fileset in the catalog created above. + +Using the following code to create a schema and fileset: + + + + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "schema", + "comment": "comment", + "properties": { + "location": "abfss://container@account-name.dfs.core.windows.net/path" + } +}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas +``` + + + + +```java +// Assuming you have just created a Hive catalog named `hive_catalog` +Catalog catalog = gravitinoClient.loadCatalog("hive_catalog"); + +SupportsSchemas supportsSchemas = catalog.asSchemas(); + +Map schemaProperties = ImmutableMap.builder() + .put("location", "abfss://container@account-name.dfs.core.windows.net/path") + .build(); +Schema schema = supportsSchemas.createSchema("schema", + "This is a schema", + schemaProperties +); +// ... +``` + + + + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="http://127.0.0.1:8090", metalake_name="metalake") +catalog: Catalog = gravitino_client.load_catalog(name="hive_catalog") +catalog.as_schemas().create_schema(name="schema", + comment="This is a schema", + properties={"location": "abfss://container@account-name.dfs.core.windows.net/path"}) +``` + + + + + + + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "example_fileset", + "comment": "This is an example fileset", + "type": "MANAGED", + "storageLocation": "abfss://container@account-name.dfs.core.windows.net/path/example_fileset", + "properties": { + "k1": "v1" + } +}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/filesets +``` + + + + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("http://localhost:8090") + .withMetalake("metalake") + .build(); + +Catalog catalog = gravitinoClient.loadCatalog("catalog"); +FilesetCatalog filesetCatalog = catalog.asFilesetCatalog(); + +Map propertiesMap = ImmutableMap.builder() + .put("k1", "v1") + .build(); + +filesetCatalog.createFileset( + NameIdentifier.of("schema", "example_fileset"), + "This is an example fileset", + Fileset.Type.MANAGED, + "abfss://container@account-name.dfs.core.windows.net/path/example_fileset", + propertiesMap, +); +``` + + + + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") + +catalog: Catalog = gravitino_client.load_catalog(name="catalog") +catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", "example_fileset"), + type=Fileset.Type.MANAGED, + comment="This is an example fileset", + storage_location="abfss://container@account-name.dfs.core.windows.net/path/example_fileset", + properties={"k1": "v1"}) +``` + + + + +## Using Spark to access the fileset + +The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: + +```python +import logging +from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient +from pyspark.sql import SparkSession +import os + +gravitino_url = "http://localhost:8090" +metalake_name = "test" + +catalog_name = "your_adls_catalog" +schema_name = "your_adls_schema" +fileset_name = "your_adls_fileset" + +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/hadoop-azure-3.2.0.jar,/path/to/azure-storage-7.0.0.jar,/path/to/wildfly-openssl-1.0.4.Final.jar --master local[1] pyspark-shell" +spark = SparkSession.builder +.appName("adls_fileset_test") +.config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") +.config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") +.config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") +.config("spark.hadoop.fs.gravitino.client.metalake", "test") +.config("spark.hadoop.azure-storage-account-name", "azure_account_name") +.config("spark.hadoop.azure-storage-account-key", "azure_account_name") +.config("spark.hadoop.fs.azure.skipUserGroupMetadataDuringInitialization", "true") +.config("spark.driver.memory", "2g") +.config("spark.driver.port", "2048") +.getOrCreate() + +data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)] +columns = ["Name", "Age"] +spark_df = spark.createDataFrame(data, schema=columns) +gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people" + +spark_df.coalesce(1).write +.mode("overwrite") +.option("header", "true") +.csv(gvfs_path) +``` + +If your Spark **without Hadoop environment**, you can use the following code snippet to access the fileset: + +```python +## Replace the following code snippet with the above code snippet with the same environment variables + +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar --master local[1] pyspark-shell" +``` + +- [`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle) is the Gravitino ADLS jar with Hadoop environment and `hadoop-azure` jar. +- [`gravitino-azure-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure) is the Gravitino ADLS jar without Hadoop environment and `hadoop-azure` jar. + +Please choose the correct jar according to your environment. + +:::note +In some Spark version, Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work, in this case, you should add the jars to the spark classpath directly. +::: + +## Using fileset with hadoop fs command + +The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. + +1. Adding the following contents to the `${HADOOP_HOME}/etc/hadoop/core-site.xml` file: + +```xml + + fs.AbstractFileSystem.gvfs.impl + org.apache.gravitino.filesystem.hadoop.Gvfs + + + + fs.gvfs.impl + org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem + + + + fs.gravitino.server.uri + http://192.168.50.188:8090 + + + + fs.gravitino.client.metalake + test + + + + azure-storage-account-name + account_name + + + azure-storage-account-key + account_key + +``` + +2. Copy the necessary jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. + +Copy the corresponding jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. For ADLS, you need to copy `gravitino-azure-{version}.jar` to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. +then copy `hadoop-azure-${version}.jar` and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory, for simple you can add all the jars in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. + + +3. Run the following command to access the fileset: + +```shell +hadoop dfs -ls gvfs://fileset/adls_catalog/schema/example +hadoop dfs -put /path/to/local/file gvfs://fileset/adls_catalog/schema/example +``` + +### Using fileset with pandas + +The following are examples of how to use the pandas library to access the ADLS fileset + +```python +import pandas as pd + +storage_options = { + "server_uri": "http://localhost:8090", + "metalake_name": "test", + "options": { + "azure_storage_account_name": "azure_account_name", + "azure_storage_account_key": "azure_account_key" + } +} +ds = pd.read_csv(f"gvfs://fileset/${catalog_name}/${schema_name}/${fileset_name}/people/part-00000-51d366e2-d5eb-448d-9109-32a96c8a14dc-c000.csv", + storage_options=storage_options) +ds.head() +``` + +## Fileset with credential + +If the catalog has been configured with credential, you can access ADLS fileset without setting `azure-storage-account-name` and `azure-storage-account-key` in the properties via GVFS. More detail can be seen [here](./security/credential-vending.md#adls-credentials). + + + diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md new file mode 100644 index 00000000000..6dc6bb1c732 --- /dev/null +++ b/docs/hadoop-catalog-with-gcs.md @@ -0,0 +1,345 @@ +--- +title: "Hadoop catalog with GCS" +slug: /hadoop-catalog-with-gcs +date: 2024-01-03 +keyword: Hadoop catalog GCS +license: "This software is licensed under the Apache License version 2." +--- + +This document describes how to configure a Hadoop catalog with GCS. + +## Prerequisites + +In order to create a Hadoop catalog with GCS, you need to place [`gravitino-gcp-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp-bundle) in Gravitino Hadoop classpath located +at `${HADOOP_HOME}/share/hadoop/common/lib/`. After that, start Gravitino server with the following command: + +```bash +$ bin/gravitino-server.sh start +``` + +## Create a Hadoop Catalog with GCS in Gravitino + +### Catalog a catalog + +Apart from configuration method in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with GCS: + +| Configuration item | Description | Default value | Required | Since version | +|-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------------------------|------------------| +| `filesystem-providers` | The file system providers to add. Set it to `gs` if it's a GCS fileset, a comma separated string that contains `gs` like `gs,s3` to support multiple kinds of fileset including `gs`. | (none) | Yes | 0.7.0-incubating | +| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for GCS, if we set this value, we can omit the prefix 'gs://' in the location. | `builtin-local` | No | 0.7.0-incubating | +| `gcs-service-account-file` | The path of GCS service account JSON file. | (none) | Yes if it's a GCS fileset. | 0.7.0-incubating | + +### Create a schema + +Refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations) for more details. + +### Create a fileset + +Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for more details. + + +## Using Hadoop catalog with GCS + +### Create a Hadoop catalog/schema/file set with GCS + +First, you need to create a Hadoop catalog with GCS. The following example shows how to create a Hadoop catalog with GCS: + + + + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "catalog", + "type": "FILESET", + "comment": "comment", + "provider": "hadoop", + "properties": { + "location": "gs://bucket/root", + "gcs-service-account-file": "path_of_gcs_service_account_file", + "filesystem-providers": "gcs" + } +}' http://localhost:8090/api/metalakes/metalake/catalogs +``` + + + + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("http://localhost:8090") + .withMetalake("metalake") + .build(); + +gcsProperties = ImmutableMap.builder() + .put("location", "gs://bucket/root") + .put("gcs-service-account-file", "path_of_gcs_service_account_file") + .put("filesystem-providers", "gcs") + .build(); + +Catalog gcsCatalog = gravitinoClient.createCatalog("catalog", + Type.FILESET, + "hadoop", // provider, Gravitino only supports "hadoop" for now. + "This is a GCS fileset catalog", + gcsProperties); +// ... + +``` + + + + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") +gcs_properties = { + "location": "gs://bucket/root", + "gcs-service-account-file": "path_of_gcs_service_account_file" +} + +gcs_properties = gravitino_client.create_catalog(name="catalog", + type=Catalog.Type.FILESET, + provider="hadoop", + comment="This is a GCS fileset catalog", + properties=gcs_properties) + +``` + + + + +Then create a schema and fileset in the catalog created above. + +Using the following code to create a schema and fileset: + + + + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "schema", + "comment": "comment", + "properties": { + "location": "gs://bucket/root/schema" + } +}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas +``` + + + + +```java +// Assuming you have just created a Hive catalog named `hive_catalog` +Catalog catalog = gravitinoClient.loadCatalog("hive_catalog"); + +SupportsSchemas supportsSchemas = catalog.asSchemas(); + +Map schemaProperties = ImmutableMap.builder() + .put("location", "gs://bucket/root/schema") + .build(); +Schema schema = supportsSchemas.createSchema("schema", + "This is a schema", + schemaProperties +); +// ... +``` + + + + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="http://127.0.0.1:8090", metalake_name="metalake") +catalog: Catalog = gravitino_client.load_catalog(name="hive_catalog") +catalog.as_schemas().create_schema(name="schema", + comment="This is a schema", + properties={"location": "gs://bucket/root/schema"}) +``` + + + + + + + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "example_fileset", + "comment": "This is an example fileset", + "type": "MANAGED", + "storageLocation": "gs://bucket/root/schema/example_fileset", + "properties": { + "k1": "v1" + } +}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/filesets +``` + + + + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("http://localhost:8090") + .withMetalake("metalake") + .build(); + +Catalog catalog = gravitinoClient.loadCatalog("catalog"); +FilesetCatalog filesetCatalog = catalog.asFilesetCatalog(); + +Map propertiesMap = ImmutableMap.builder() + .put("k1", "v1") + .build(); + +filesetCatalog.createFileset( + NameIdentifier.of("schema", "example_fileset"), + "This is an example fileset", + Fileset.Type.MANAGED, + "gs://bucket/root/schema/example_fileset", + propertiesMap, +); +``` + + + + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") + +catalog: Catalog = gravitino_client.load_catalog(name="catalog") +catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", "example_fileset"), + type=Fileset.Type.MANAGED, + comment="This is an example fileset", + storage_location="gs://bucket/root/schema/example_fileset", + properties={"k1": "v1"}) +``` + + + + +## Using Spark to access the fileset + +The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: + +```python +import logging +from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient +from pyspark.sql import SparkSession +import os + +gravitino_url = "http://localhost:8090" +metalake_name = "test" + +catalog_name = "your_gcs_catalog" +schema_name = "your_gcs_schema" +fileset_name = "your_gcs_fileset" + +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-gcp-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/gcs-connector-hadoop3-2.2.22-shaded.jar --master local[1] pyspark-shell" +spark = SparkSession.builder +.appName("gcs_fielset_test") +.config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") +.config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") +.config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") +.config("spark.hadoop.fs.gravitino.client.metalake", "test") +.config("spark.hadoop.gcs-service-account-file", "/path/to/gcs-service-account-file.json") +.config("spark.driver.memory", "2g") +.config("spark.driver.port", "2048") +.getOrCreate() + +data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)] +columns = ["Name", "Age"] +spark_df = spark.createDataFrame(data, schema=columns) +gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people" + +spark_df.coalesce(1).write +.mode("overwrite") +.option("header", "true") +.csv(gvfs_path) +``` + +If your Spark **without Hadoop environment**, you can use the following code snippet to access the fileset: + +```python +## Replace the following code snippet with the above code snippet with the same environment variables + +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-gcp-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar, --master local[1] pyspark-shell" +``` + +- [`gravitino-gcp-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp-bundle) is the Gravitino GCS jar with Hadoop environment and `gcs-connector` jar. +- [`gravitino-gcp-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp) is the Gravitino GCS jar without Hadoop environment and `gcs-connector` jar. + +Please choose the correct jar according to your environment. + +:::note +In some Spark version, Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work, in this case, you should add the jars to the spark classpath directly. +::: + +## Using fileset with hadoop fs command + +The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. + +1. Adding the following contents to the `${HADOOP_HOME}/etc/hadoop/core-site.xml` file: + +```xml + + fs.AbstractFileSystem.gvfs.impl + org.apache.gravitino.filesystem.hadoop.Gvfs + + + + fs.gvfs.impl + org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem + + + + fs.gravitino.server.uri + http://192.168.50.188:8090 + + + + fs.gravitino.client.metalake + test + + + + gcs-service-account-file + /path/your-service-account-file.json + +``` + +2. Copy the necessary jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. + +Copy the corresponding jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. For GCS, you need to copy `gravitino-gcp-{version}.jar` to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. +then copy `hadoop-gcp-${version}.jar` and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory, for simple you can add all the jars in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. + + +3. Run the following command to access the fileset: + +```shell +hadoop dfs -ls gvfs://fileset/gcs_catalog/schema/example +hadoop dfs -put /path/to/local/file gvfs://fileset/gcs_catalog/schema/example +``` + +## Using fileset with pandas + +The following are examples of how to use the pandas library to access the GCS fileset + +```python +import pandas as pd + +storage_options = { + "server_uri": "http://localhost:8090", + "metalake_name": "test", + "options": { + "gcs_service_account_file": "path_of_gcs_service_account_file.json", + } +} +ds = pd.read_csv(f"gvfs://fileset/${catalog_name}/${schema_name}/${fileset_name}/people/part-00000-51d366e2-d5eb-448d-9109-32a96c8a14dc-c000.csv", + storage_options=storage_options) +ds.head() +``` + + +## Fileset with credential + +If the catalog has been configured with credential, you can access S3 fileset without setting `gcs-service-account-file` in the properties via GVFS. More detail can be seen [here](./security/credential-vending.md#gcs-credentials). + diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md new file mode 100644 index 00000000000..c968901e03a --- /dev/null +++ b/docs/hadoop-catalog-with-oss.md @@ -0,0 +1,368 @@ +--- +title: "Hadoop catalog with OSS" +slug: /hadoop-catalog-with-oss +date: 2025-01-03 +keyword: Hadoop catalog OSS +license: "This software is licensed under the Apache License version 2." +--- + +This document describes how to configure a Hadoop catalog with Aliyun OSS. + +## Prerequisites + +In order to create a Hadoop catalog with OSS, you need to place [`gravitino-aliyun-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun-bundle) in Gravitino Hadoop classpath located +at `${HADOOP_HOME}/share/hadoop/common/lib/`. After that, start Gravitino server with the following command: + +```bash +$ bin/gravitino-server.sh start +``` + +## Create a Hadoop Catalog with OSS in Gravitino + +### Catalog a catalog + +Apart from configuration method in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with OSS: + +| Configuration item | Description | Default value | Required | Since version | +|-------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------------------------|------------------| +| `filesystem-providers` | The file system providers to add. Set it to `oss` if it's a OSS fileset, or a comma separated string that contains `oss` like `oss,gs,s3` to support multiple kinds of fileset including `oss`. | (none) | Yes | 0.7.0-incubating | +| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for OSS, if we set this value, we can omit the prefix 'oss://' in the location. | `builtin-local` | No | 0.7.0-incubating | +| `oss-endpoint` | The endpoint of the Aliyun OSS. | (none) | Yes if it's a OSS fileset. | 0.7.0-incubating | +| `oss-access-key-id` | The access key of the Aliyun OSS. | (none) | Yes if it's a OSS fileset. | 0.7.0-incubating | +| `oss-secret-access-key` | The secret key of the Aliyun OSS. | (none) | Yes if it's a OSS fileset. | 0.7.0-incubating | + +### Create a schema + +Refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations) for more details. + +### Create a fileset + +Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for more details. + + +## Using Hadoop catalog with OSS + +### Create a Hadoop catalog/schema/file set with OSS + +First, you need to create a Hadoop catalog with OSS. The following example shows how to create a Hadoop catalog with OSS: + + + + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "catalog", + "type": "FILESET", + "comment": "comment", + "provider": "hadoop", + "properties": { + "location": "oss://bucket/root", + "oss-access-key-id": "access_key", + "oss-secret-access-key": "secret_key", + "oss-endpoint": "http://oss-cn-hangzhou.aliyuncs.com", + "filesystem-providers": "oss" + } +}' http://localhost:8090/api/metalakes/metalake/catalogs +``` + + + + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("http://localhost:8090") + .withMetalake("metalake") + .build(); + +ossProperties = ImmutableMap.builder() + .put("location", "oss://bucket/root") + .put("oss-access-key-id", "access_key") + .put("oss-secret-access-key", "secret_key") + .put("oss-endpoint", "http://oss-cn-hangzhou.aliyuncs.com") + .put("filesystem-providers", "oss") + .build(); + +Catalog ossCatalog = gravitinoClient.createCatalog("catalog", + Type.FILESET, + "hadoop", // provider, Gravitino only supports "hadoop" for now. + "This is a OSS fileset catalog", + ossProperties); +// ... + +``` + + + + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") +oss_properties = { + "location": "oss://bucket/root", + "oss-access-key-id": "access_key" + "oss-secret-access-key": "secret_key", + "oss-endpoint": "ossProperties" +} + +oss_catalog = gravitino_client.create_catalog(name="catalog", + type=Catalog.Type.FILESET, + provider="hadoop", + comment="This is a OSS fileset catalog", + properties=oss_properties) + +``` + + + + +Then create a schema and fileset in the catalog created above. + +Using the following code to create a schema and fileset: + + + + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "schema", + "comment": "comment", + "properties": { + "location": "oss://bucket/root/schema" + } +}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas +``` + + + + +```java +// Assuming you have just created a Hive catalog named `hive_catalog` +Catalog catalog = gravitinoClient.loadCatalog("hive_catalog"); + +SupportsSchemas supportsSchemas = catalog.asSchemas(); + +Map schemaProperties = ImmutableMap.builder() + .put("location", "oss://bucket/root/schema") + .build(); +Schema schema = supportsSchemas.createSchema("schema", + "This is a schema", + schemaProperties +); +// ... +``` + + + + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="http://127.0.0.1:8090", metalake_name="metalake") +catalog: Catalog = gravitino_client.load_catalog(name="hive_catalog") +catalog.as_schemas().create_schema(name="schema", + comment="This is a schema", + properties={"location": "oss://bucket/root/schema"}) +``` + + + + + + + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "example_fileset", + "comment": "This is an example fileset", + "type": "MANAGED", + "storageLocation": "oss://bucket/root/schema/example_fileset", + "properties": { + "k1": "v1" + } +}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/filesets +``` + + + + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("http://localhost:8090") + .withMetalake("metalake") + .build(); + +Catalog catalog = gravitinoClient.loadCatalog("catalog"); +FilesetCatalog filesetCatalog = catalog.asFilesetCatalog(); + +Map propertiesMap = ImmutableMap.builder() + .put("k1", "v1") + .build(); + +filesetCatalog.createFileset( + NameIdentifier.of("schema", "example_fileset"), + "This is an example fileset", + Fileset.Type.MANAGED, + "oss://bucket/root/schema/example_fileset", + propertiesMap, +); +``` + + + + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") + +catalog: Catalog = gravitino_client.load_catalog(name="catalog") +catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", "example_fileset"), + type=Fileset.Type.MANAGED, + comment="This is an example fileset", + storage_location="oss://bucket/root/schema/example_fileset", + properties={"k1": "v1"}) +``` + + + + +## Using Spark to access the fileset + +The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: + +```python +import logging +from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient +from pyspark.sql import SparkSession +import os + +gravitino_url = "http://localhost:8090" +metalake_name = "test" + +catalog_name = "your_oss_catalog" +schema_name = "your_oss_schema" +fileset_name = "your_oss_fileset" + +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aliyun-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/aliyun-sdk-oss-2.8.3.jar,/path/to/hadoop-aliyun-3.2.0.jar,/path/to/jdom-1.1.jar --master local[1] pyspark-shell" +spark = SparkSession.builder +.appName("oss_fielset_test") +.config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") +.config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") +.config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") +.config("spark.hadoop.fs.gravitino.client.metalake", "test") +.config("spark.hadoop.oss-access-key-id", os.environ["OSS_ACCESS_KEY_ID"]) +.config("spark.hadoop.oss-secret-access-key", os.environ["OSS_SECRET_ACCESS_KEY"]) +.config("spark.hadoop.oss-endpoint", "http://oss-cn-hangzhou.aliyuncs.com") +.config("spark.driver.memory", "2g") +.config("spark.driver.port", "2048") +.getOrCreate() + +data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)] +columns = ["Name", "Age"] +spark_df = spark.createDataFrame(data, schema=columns) +gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people" + +spark_df.coalesce(1).write +.mode("overwrite") +.option("header", "true") +.csv(gvfs_path) +``` + +If your Spark **without Hadoop environment**, you can use the following code snippet to access the fileset: + +```python +## Replace the following code snippet with the above code snippet with the same environment variables + +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aliyun-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar, --master local[1] pyspark-shell" +``` + +- [`gravitino-aliyun-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun-bundle) is the Gravitino Aliyun jar with Hadoop environment and `hadoop-oss` jar. +- [`gravitino-aliyun-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun) is the Gravitino OSS jar without Hadoop environment and `hadoop-oss` jar. + +Please choose the correct jar according to your environment. + +:::note +In some Spark version, Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work, in this case, you should add the jars to the spark classpath directly. +::: + +## Using fileset with hadoop fs command + +The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. + +1. Adding the following contents to the `${HADOOP_HOME}/etc/hadoop/core-site.xml` file: + +```xml + + fs.AbstractFileSystem.gvfs.impl + org.apache.gravitino.filesystem.hadoop.Gvfs + + + + fs.gvfs.impl + org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem + + + + fs.gravitino.server.uri + http://192.168.50.188:8090 + + + + fs.gravitino.client.metalake + test + + + + oss-endpoint + http://oss-cn-hangzhou.aliyuncs.com + + + + oss-access-key-id + access-key + + + + oss-secret-access-key + secret-key + +``` + +2. Copy the necessary jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. + +Copy the corresponding jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. For OSS, you need to copy `gravitino-aliyun-{version}.jar` to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. +then copy hadoop-aliyun-{version}.jar and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory, for simple you can add all the jars in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. + + +3. Run the following command to access the fileset: + +```shell +hadoop dfs -ls gvfs://fileset/oss_catalog/schema/example +hadoop dfs -put /path/to/local/file gvfs://fileset/oss_catalog/schema/example +``` + +## Using fileset with pandas + +The following are examples of how to use the pandas library to access the OSS fileset + +```python +import pandas as pd + +storage_options = { + "server_uri": "http://localhost:8090", + "metalake_name": "test", + "options": { + "oss_access_key_id": "access_key", + "oss_secret_access_key": "secret_key", + "oss_endpoint": "http://oss-cn-hangzhou.aliyuncs.com" + } +} +ds = pd.read_csv(f"gvfs://fileset/${catalog_name}/${schema_name}/${fileset_name}/people/part-00000-51d366e2-d5eb-448d-9109-32a96c8a14dc-c000.csv", + storage_options=storage_options) +ds.head() +``` + +## Fileset with credential + +If the catalog has been configured with credential, you can access S3 fileset without setting `oss-access-key-id` and `oss-secret-access-key` in the properties via GVFS. More detail can be seen [here](./security/credential-vending.md#oss-credentials). + + + diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md new file mode 100644 index 00000000000..260a036f9d3 --- /dev/null +++ b/docs/hadoop-catalog-with-s3.md @@ -0,0 +1,372 @@ +--- +title: "Hadoop catalog with S3" +slug: /hadoop-catalog-with-s3 +date: 2025-01-03 +keyword: Hadoop catalog S3 +license: "This software is licensed under the Apache License version 2." +--- + +This document describes how to configure a Hadoop catalog with S3. + +## Prerequisites + +In order to create a Hadoop catalog with S3, you need to place [`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle) in Gravitino Hadoop classpath located +at `${HADOOP_HOME}/share/hadoop/common/lib/`. After that, start Gravitino server with the following command: + +```bash +$ bin/gravitino-server.sh start +``` + +## Create a Hadoop Catalog with S3 in Gravitino + +### Catalog a catalog + +Apart from configuration method in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with S3: + +| Configuration item | Description | Default value | Required | Since version | +|-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|---------------------------|------------------| +| `filesystem-providers` | The file system providers to add. Set it to `s3` if it's a S3 fileset, or a comma separated string that contains `s3` like `gs,s3` to support multiple kinds of fileset including `s3`. | (none) | Yes | 0.7.0-incubating | +| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for S3, if we set this value, we can omit the prefix 's3a://' in the location. | `builtin-local` | No | 0.7.0-incubating | +| `s3-endpoint` | The endpoint of the AWS S3. | (none) | Yes if it's a S3 fileset. | 0.7.0-incubating | +| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes if it's a S3 fileset. | 0.7.0-incubating | +| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes if it's a S3 fileset. | 0.7.0-incubating | + +### Create a schema + +Refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations) for more details. + +### Create a fileset + +Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for more details. + + +## Using Hadoop catalog with S3 + +### Create a Hadoop catalog/schema/file set with S3 + +First of all, you need to create a Hadoop catalog with S3. The following example shows how to create a Hadoop catalog with S3: + + + + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "catalog", + "type": "FILESET", + "comment": "comment", + "provider": "hadoop", + "properties": { + "location": "s3a://bucket/root", + "s3-access-key-id": "access_key", + "s3-secret-access-key": "secret_key", + "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com", + "filesystem-providers": "s3" + } +}' http://localhost:8090/api/metalakes/metalake/catalogs +``` + + + + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("http://localhost:8090") + .withMetalake("metalake") + .build(); + +s3Properties = ImmutableMap.builder() + .put("location", "s3a://bucket/root") + .put("s3-access-key-id", "access_key") + .put("s3-secret-access-key", "secret_key") + .put("s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com") + .put("filesystem-providers", "s3") + .build(); + +Catalog s3Catalog = gravitinoClient.createCatalog("catalog", + Type.FILESET, + "hadoop", // provider, Gravitino only supports "hadoop" for now. + "This is a S3 fileset catalog", + s3Properties); +// ... + +``` + + + + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") +s3_properties = { + "location": "s3a://bucket/root", + "s3-access-key-id": "access_key" + "s3-secret-access-key": "secret_key", + "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com" +} + +s3_catalog = gravitino_client.create_catalog(name="catalog", + type=Catalog.Type.FILESET, + provider="hadoop", + comment="This is a S3 fileset catalog", + properties=s3_properties) + +``` + + + + +:::note +The value of location should always start with `s3a` NOT `s3` for AWS S3, for instance, `s3a://bucket/root`. Value like `s3://bucket/root` is not supported due to the limitation of the hadoop-aws library. +::: + +Then create a schema and fileset in the catalog created above. + +Using the following code to create a schema and fileset: + + + + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "schema", + "comment": "comment", + "properties": { + "location": "s3a://bucket/root/schema" + } +}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas +``` + + + + +```java +// Assuming you have just created a Hive catalog named `hive_catalog` +Catalog catalog = gravitinoClient.loadCatalog("hive_catalog"); + +SupportsSchemas supportsSchemas = catalog.asSchemas(); + +Map schemaProperties = ImmutableMap.builder() + .put("location", "s3a://bucket/root/schema") + .build(); +Schema schema = supportsSchemas.createSchema("schema", + "This is a schema", + schemaProperties +); +// ... +``` + + + + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="http://127.0.0.1:8090", metalake_name="metalake") +catalog: Catalog = gravitino_client.load_catalog(name="hive_catalog") +catalog.as_schemas().create_schema(name="schema", + comment="This is a schema", + properties={"location": "s3a://bucket/root/schema"}) +``` + + + + + + + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "example_fileset", + "comment": "This is an example fileset", + "type": "MANAGED", + "storageLocation": "s3a://bucket/root/schema/example_fileset", + "properties": { + "k1": "v1" + } +}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/filesets +``` + + + + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("http://localhost:8090") + .withMetalake("metalake") + .build(); + +Catalog catalog = gravitinoClient.loadCatalog("catalog"); +FilesetCatalog filesetCatalog = catalog.asFilesetCatalog(); + +Map propertiesMap = ImmutableMap.builder() + .put("k1", "v1") + .build(); + +filesetCatalog.createFileset( + NameIdentifier.of("schema", "example_fileset"), + "This is an example fileset", + Fileset.Type.MANAGED, + "s3a://bucket/root/schema/example_fileset", + propertiesMap, +); +``` + + + + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") + +catalog: Catalog = gravitino_client.load_catalog(name="catalog") +catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", "example_fileset"), + type=Fileset.Type.MANAGED, + comment="This is an example fileset", + storage_location="s3a://bucket/root/schema/example_fileset", + properties={"k1": "v1"}) +``` + + + + +## Using Spark to access the fileset + +The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: + +```python +import logging +from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient +from pyspark.sql import SparkSession +import os + +gravitino_url = "http://localhost:8090" +metalake_name = "test" + +catalog_name = "your_s3_catalog" +schema_name = "your_s3_schema" +fileset_name = "your_s3_fileset" + +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aws-${gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-${gravitino-version}-SNAPSHOT.jar,/path/to/hadoop-aws-3.2.0.jar,/path/to/aws-java-sdk-bundle-1.11.375.jar --master local[1] pyspark-shell" +spark = SparkSession.builder + .appName("s3_fielset_test") + .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") + .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.client.metalake", "test") + .config("spark.hadoop.s3-access-key-id", os.environ["S3_ACCESS_KEY_ID"]) + .config("spark.hadoop.s3-secret-access-key", os.environ["S3_SECRET_ACCESS_KEY"]) + .config("spark.hadoop.s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com") + .config("spark.driver.memory", "2g") + .config("spark.driver.port", "2048") + .getOrCreate() + +data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)] +columns = ["Name", "Age"] +spark_df = spark.createDataFrame(data, schema=columns) +gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people" + +spark_df.coalesce(1).write +.mode("overwrite") +.option("header", "true") +.csv(gvfs_path) +``` + +If your Spark **without Hadoop environment**, you can use the following code snippet to access the fileset: + +```python +## Replace the following code snippet with the above code snippet with the same environment variables +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aws-${gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-${gravitino-version}-SNAPSHOT.jar,/path/to/hadoop-aws-3.2.0.jar,/path/to/aws-java-sdk-bundle-1.11.375.jar --master local[1] pyspark-shell" +``` + +- [`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle) is the Gravitino AWS jar with Hadoop environment and `hadoop-aws` jar. +- [`gravitino-aws-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws) is the Gravitino AWS jar without Hadoop environment and `hadoop-aws` jar. + +Please choose the correct jar according to your environment. + +:::note +In some Spark version, Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work, in this case, you should add the jars to the spark classpath directly. +::: + +## Using fileset with hadoop fs command + +The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. + +1. Adding the following contents to the `${HADOOP_HOME}/etc/hadoop/core-site.xml` file: + +```xml + + fs.AbstractFileSystem.gvfs.impl + org.apache.gravitino.filesystem.hadoop.Gvfs + + + + fs.gvfs.impl + org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem + + + + fs.gravitino.server.uri + http://192.168.50.188:8090 + + + + fs.gravitino.client.metalake + test + + + + s3-endpoint + http://s3.ap-northeast-1.amazonaws.com + + + + s3-access-key-id + access-key + + + + s3-secret-access-key + secret-key + +``` + +2. Copy the necessary jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. + +Copy the corresponding jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. For S3, you need to copy `gravitino-aws-{version}.jar` to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. +then copy hadoop-aws-{version}.jar and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory, for simple you can add all the jars in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. + + +3. Run the following command to access the fileset: + +```shell +hadoop dfs -ls gvfs://fileset/s3_catalog/schema/example +hadoop dfs -put /path/to/local/file gvfs://fileset/s3_catalog/schema/example +``` + +## Using fileset with pandas + +The following are examples of how to use the pandas library to access the S3 fileset + +```python +import pandas as pd + +storage_options = { + "server_uri": "http://localhost:8090", + "metalake_name": "test", + "options": { + "s3_access_key_id": "access_key", + "s3_secret_access_key": "secret_key", + "s3_endpoint": "http://s3.ap-northeast-1.amazonaws.com" + } +} +ds = pd.read_csv(f"gvfs://fileset/${catalog_name}/${schema_name}/${fileset_name}/people/part-00000-51d366e2-d5eb-448d-9109-32a96c8a14dc-c000.csv", + storage_options=storage_options) +ds.head() +``` + +## Fileset with credential + +If the catalog has been configured with credential, you can access S3 fileset without setting `s3-access-key-id` and `s3-secret-access-key` in the properties via GVFS. More detail can be seen [here](./security/credential-vending.md#s3-credentials). + + + + diff --git a/docs/hadoop-catalog.md b/docs/hadoop-catalog.md index cf86fde06e4..57da399b12c 100644 --- a/docs/hadoop-catalog.md +++ b/docs/hadoop-catalog.md @@ -9,9 +9,9 @@ license: "This software is licensed under the Apache License version 2." ## Introduction Hadoop catalog is a fileset catalog that using Hadoop Compatible File System (HCFS) to manage -the storage location of the fileset. Currently, it supports the local filesystem and HDFS. Since 0.7.0-incubating, Gravitino supports S3, GCS, OSS and Azure Blob Storage fileset through Hadoop catalog. +the storage location of the fileset. Currently, it supports the local filesystem and HDFS. Since 0.7.0-incubating, Gravitino supports [S3](hadoop-catalog-with-S3.md), [GCS](hadoop-catalog-with-gcs.md), [OSS](hadoop-catalog-with-oss.md) and [Azure Blob Storage](hadoop-catalog-with-adls.md) through Hadoop catalog. -The rest of this document will use HDFS or local file as an example to illustrate how to use the Hadoop catalog. For S3, GCS, OSS and Azure Blob Storage, the configuration is similar to HDFS, but more properties need to be set. We will use [separate sections](./cloud-storage-fileset-example.md) to introduce how to use of S3, GCS, OSS and Azure Blob Storage. +The rest of this document will use HDFS or local file as an example to illustrate how to use the Hadoop catalog. For S3, GCS, OSS and Azure Blob Storage, the configuration is similar to HDFS, please refer to the corresponding document for more details. Note that Gravitino uses Hadoop 3 dependencies to build Hadoop catalog. Theoretically, it should be compatible with both Hadoop 2.x and 3.x, since Gravitino doesn't leverage any new features in diff --git a/docs/how-to-use-gvfs.md b/docs/how-to-use-gvfs.md index a14d09794a3..6ac5079a6b3 100644 --- a/docs/how-to-use-gvfs.md +++ b/docs/how-to-use-gvfs.md @@ -146,8 +146,7 @@ You can configure these properties in two ways: ``` :::note -If you want to access the S3, GCS, OSS or custom fileset through GVFS, apart from the above properties, you need to place the corresponding bundle jars in the Hadoop environment, For bundles jar and -cloud storage fileset configuration example, please refer to [cloud storage fileset example](./cloud-storage-fileset-example.md). +If you want to access the S3, GCS, OSS or custom fileset through GVFS, apart from the above properties, you need to place the corresponding bundle jars in the Hadoop environment. ::: 2. Configure the properties in the `core-site.xml` file of the Hadoop environment: @@ -204,10 +203,6 @@ two ways: ```shell ./gradlew :clients:filesystem-hadoop3-runtime:build -x test ``` -:::note -For cloud storage fileset, some extra steps should be added, please refer to [cloud storage fileset example](./cloud-storage-fileset-example.md). -::: - #### Via Hadoop shell command @@ -234,8 +229,6 @@ cp ${HADOOP_HOME}/share/hadoop/tools/lib/* ${HADOOP_HOME}/share/hadoop/common/li ./${HADOOP_HOME}/bin/hadoop dfs -ls gvfs://fileset/test_catalog/test_schema/test_fileset_1 ``` -Full example to access S3, GCS, OSS fileset via Hadoop shell command, please refer to [cloud storage fileset example](./cloud-storage-fileset-example.md). - #### Via Java code You can also perform operations on the files or directories managed by fileset through Java code. @@ -285,9 +278,6 @@ FileSystem fs = filesetPath.getFileSystem(conf); fs.getFileStatus(filesetPath); ``` -Full example to access S3, GCS, OSS fileset via Hadoop shell command, please refer to [cloud storage fileset example](./cloud-storage-fileset-example.md). - - #### Via Apache Spark 1. Add the GVFS runtime jar to the Spark environment. @@ -327,8 +317,6 @@ Full example to access S3, GCS, OSS fileset via Hadoop shell command, please ref rdd.foreach(println) ``` -Full example to access S3, GCS, OSS fileset via Spark, please refer to [cloud storage fileset example](./cloud-storage-fileset-example.md). - #### Via Tensorflow For Tensorflow to support GVFS, you need to recompile the [tensorflow-io](https://github.com/tensorflow/io) module. @@ -523,7 +511,6 @@ options = { fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090", metalake_name="test_metalake", options=options) ``` -Full Python example to access S3, GCS, OSS fileset via GVFS, please refer to [cloud storage fileset example](./cloud-storage-fileset-example.md). :::note From 1ecc3785209b4db34d20f112d0c78f680e8db37f Mon Sep 17 00:00:00 2001 From: yuqi Date: Sat, 4 Jan 2025 14:51:50 +0800 Subject: [PATCH 04/39] update the docs --- docs/hadoop-catalog-with-adls.md | 72 ++++++++++++++++++++++++++++- docs/hadoop-catalog-with-gcs.md | 72 ++++++++++++++++++++++++++++- docs/hadoop-catalog-with-oss.md | 78 +++++++++++++++++++++++++++++++- docs/hadoop-catalog-with-s3.md | 75 +++++++++++++++++++++++++++++- 4 files changed, 292 insertions(+), 5 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index e77d2c80465..a54a3baf4ef 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -279,6 +279,22 @@ Please choose the correct jar according to your environment. In some Spark version, Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work, in this case, you should add the jars to the spark classpath directly. ::: +## Using Gravitino virual file system Java client to access the fileset + +```java +Configuration conf = new Configuration(); +conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); +conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); +conf.set("fs.gravitino.server.uri","http://localhost:8090"); +conf.set("fs.gravitino.client.metalake","test_metalake"); +conf.set("azure-storage-account-name", "account_name_of_adls"); +conf.set("azure-storage-account-key", "account_key_of_adls"); +Path filesetPath = new Path("gvfs://fileset/test_catalog/test_schema/test_fileset/new_dir"); +FileSystem fs = filesetPath.getFileSystem(conf); +fs.mkdirs(filesetPath); +... +``` + ## Using fileset with hadoop fs command The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. @@ -329,6 +345,22 @@ hadoop dfs -ls gvfs://fileset/adls_catalog/schema/example hadoop dfs -put /path/to/local/file gvfs://fileset/adls_catalog/schema/example ``` +## Using Gravitino virtual file system Python client + +```python +from gravitino import gvfs +options = { + "cache_size": 20, + "cache_expired_time": 3600, + "auth_type": "simple", + "azure_storage_account_name": "azure_account_name", + "azure_storage_account_key": "azure_account_key" +} +fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090", metalake_name="test_metalake", options=options) +fs.ls("gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/") +``` + + ### Using fileset with pandas The following are examples of how to use the pandas library to access the ADLS fileset @@ -351,7 +383,45 @@ ds.head() ## Fileset with credential -If the catalog has been configured with credential, you can access ADLS fileset without setting `azure-storage-account-name` and `azure-storage-account-key` in the properties via GVFS. More detail can be seen [here](./security/credential-vending.md#adls-credentials). +Since 0.8.0-incubating, Gravitino supports credential vending for ADLS fileset. If the catalog has been configured with credential, you can access ADLS fileset without providing authentication information like `azure-storage-account-name` and `azure-storage-account-key` in the properties. + +### How to create an ADLS Hadoop catalog with credential enabled +Apart from configuration method in [create-adls-hadoop-catalog](#catalog-a-catalog), properties needed by [adls-credential](./security/credential-vending.md#adls-credentials) should also be set to enable credential vending for ADLSfileset. + +### How to access ADLS fileset with credential + +If the catalog has been configured with credential, you can access ADLS fileset without providing authentication information via GVFS. Let's see how to access ADLS fileset with credential: + +GVFS Java client: + +```java +Configuration conf = new Configuration(); +conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); +conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); +conf.set("fs.gravitino.server.uri","http://localhost:8090"); +conf.set("fs.gravitino.client.metalake","test_metalake"); +// No need to set azure-storage-account-name and azure-storage-account-name +Path filesetPath = new Path("gvfs://fileset/adls_test_catalog/test_schema/test_fileset/new_dir"); +FileSystem fs = filesetPath.getFileSystem(conf); +fs.mkdirs(filesetPath); +... +``` + +Spark: + +```python +spark = SparkSession.builder + .appName("adls_fielset_test") + .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") + .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.client.metalake", "test") + # No need to set azure-storage-account-name and azure-storage-account-name + .config("spark.driver.memory", "2g") + .config("spark.driver.port", "2048") + .getOrCreate() +``` +Python client and Hadoop command are similar to the above examples. diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index 6dc6bb1c732..d94473f5d16 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -273,6 +273,21 @@ Please choose the correct jar according to your environment. In some Spark version, Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work, in this case, you should add the jars to the spark classpath directly. ::: +## Using Gravitino virual file system Java client to access the fileset + +```java +Configuration conf = new Configuration(); +conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); +conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); +conf.set("fs.gravitino.server.uri","http://localhost:8090"); +conf.set("fs.gravitino.client.metalake","test_metalake"); +conf.set("gcs-service-account-file", "/path/your-service-account-file.json"); +Path filesetPath = new Path("gvfs://fileset/test_catalog/test_schema/test_fileset/new_dir"); +FileSystem fs = filesetPath.getFileSystem(conf); +fs.mkdirs(filesetPath); +... +``` + ## Using fileset with hadoop fs command The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. @@ -319,6 +334,21 @@ hadoop dfs -ls gvfs://fileset/gcs_catalog/schema/example hadoop dfs -put /path/to/local/file gvfs://fileset/gcs_catalog/schema/example ``` + +## Using Gravitino virtual file system Python client + +```python +from gravitino import gvfs +options = { + "cache_size": 20, + "cache_expired_time": 3600, + "auth_type": "simple", + "gcs_service_account_file": "path_of_gcs_service_account_file.json", +} +fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090", metalake_name="test_metalake", options=options) +fs.ls("gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/") +``` + ## Using fileset with pandas The following are examples of how to use the pandas library to access the GCS fileset @@ -338,8 +368,46 @@ ds = pd.read_csv(f"gvfs://fileset/${catalog_name}/${schema_name}/${fileset_name} ds.head() ``` - ## Fileset with credential -If the catalog has been configured with credential, you can access S3 fileset without setting `gcs-service-account-file` in the properties via GVFS. More detail can be seen [here](./security/credential-vending.md#gcs-credentials). +Since 0.8.0-incubating, Gravitino supports credential vending for GCS fileset. If the catalog has been configured with credential, you can access GCS fileset without providing authentication information like `gcs-service-account-file` in the properties. + +### How to create a GCS Hadoop catalog with credential enabled + +Apart from configuration method in [create-gcs-hadoop-catalog](#catalog-a-catalog), properties needed by [gcs-credential](./security/credential-vending.md#gcs-credentials) should also be set to enable credential vending for GCS fileset. + +### How to access GCS fileset with credential + +If the catalog has been configured with credential, you can access GCS fileset without providing authentication information via GVFS. Let's see how to access GCS fileset with credential: + +GVFS Java client: + +```java +Configuration conf = new Configuration(); +conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); +conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); +conf.set("fs.gravitino.server.uri","http://localhost:8090"); +conf.set("fs.gravitino.client.metalake","test_metalake"); +// No need to set gcs-service-account-file +Path filesetPath = new Path("gvfs://fileset/gcs_test_catalog/test_schema/test_fileset/new_dir"); +FileSystem fs = filesetPath.getFileSystem(conf); +fs.mkdirs(filesetPath); +... +``` + +Spark: + +```python +spark = SparkSession.builder + .appName("gcs_fielset_test") + .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") + .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.client.metalake", "test") + # No need to set gcs-service-account-file + .config("spark.driver.memory", "2g") + .config("spark.driver.port", "2048") + .getOrCreate() +``` +Python client and Hadoop command are similar to the above examples. diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index c968901e03a..1da09507afa 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -283,6 +283,24 @@ Please choose the correct jar according to your environment. In some Spark version, Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work, in this case, you should add the jars to the spark classpath directly. ::: +## Using Gravitino virual file system Java client to access the fileset + +```java +Configuration conf = new Configuration(); +conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); +conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); +conf.set("fs.gravitino.server.uri","http://localhost:8090"); +conf.set("fs.gravitino.client.metalake","test_metalake"); +conf.set("oss-endpoint", "http://localhost:9000"); +conf.set("oss-access-key-id", "minio"); +conf.set("oss-secret-access-key", "minio123"); +Path filesetPath = new Path("gvfs://fileset/test_catalog/test_schema/test_fileset/new_dir"); +FileSystem fs = filesetPath.getFileSystem(conf); +fs.mkdirs(filesetPath); +... +``` + + ## Using fileset with hadoop fs command The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. @@ -339,6 +357,25 @@ hadoop dfs -ls gvfs://fileset/oss_catalog/schema/example hadoop dfs -put /path/to/local/file gvfs://fileset/oss_catalog/schema/example ``` + +## Using Gravitino virtual file system Python client + +```python +from gravitino import gvfs +options = { + "cache_size": 20, + "cache_expired_time": 3600, + "auth_type": "simple", + "oss_endpoint": "http://localhost:9000", + "oss_access_key_id": "minio", + "oss_secret_access_key": "minio123" +} +fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090", metalake_name="test_metalake", options=options) + +fs.ls("gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/") +``` + + ## Using fileset with pandas The following are examples of how to use the pandas library to access the OSS fileset @@ -362,7 +399,46 @@ ds.head() ## Fileset with credential -If the catalog has been configured with credential, you can access S3 fileset without setting `oss-access-key-id` and `oss-secret-access-key` in the properties via GVFS. More detail can be seen [here](./security/credential-vending.md#oss-credentials). +Since 0.8.0-incubating, Gravitino supports credential vending for OSS fileset. If the catalog has been configured with credential, you can access OSS fileset without providing authentication information like `oss-access-key-id` and `oss-secret-access-key` in the properties. + +### How to create a OSS Hadoop catalog with credential enabled + +Apart from configuration method in [create-oss-hadoop-catalog](#catalog-a-catalog), properties needed by [oss-credential](./security/credential-vending.md#oss-credentials) should also be set to enable credential vending for OSS fileset. + +### How to access OSS fileset with credential + +If the catalog has been configured with credential, you can access OSS fileset without providing authentication information via GVFS. Let's see how to access OSS fileset with credential: + +GVFS Java client: + +```java +Configuration conf = new Configuration(); +conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); +conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); +conf.set("fs.gravitino.server.uri","http://localhost:8090"); +conf.set("fs.gravitino.client.metalake","test_metalake"); +// No need to set oss-access-key-id and oss-secret-access-key +Path filesetPath = new Path("gvfs://fileset/oss_test_catalog/test_schema/test_fileset/new_dir"); +FileSystem fs = filesetPath.getFileSystem(conf); +fs.mkdirs(filesetPath); +... +``` + +Spark: + +```python +spark = SparkSession.builder + .appName("oss_fielset_test") + .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") + .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.client.metalake", "test") + # No need to set oss-access-key-id and oss-secret-access-key + .config("spark.driver.memory", "2g") + .config("spark.driver.port", "2048") + .getOrCreate() +``` +Python client and Hadoop command are similar to the above examples. diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index 260a036f9d3..e81359e172d 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -286,6 +286,25 @@ Please choose the correct jar according to your environment. In some Spark version, Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work, in this case, you should add the jars to the spark classpath directly. ::: +## Using Gravitino virual file system Java client to access the fileset + +```java +Configuration conf = new Configuration(); +conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); +conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); +conf.set("fs.gravitino.server.uri","http://localhost:8090"); +conf.set("fs.gravitino.client.metalake","test_metalake"); + +conf.set("s3-endpoint", "http://localhost:9000"); +conf.set("s3-access-key-id", "minio"); +conf.set("s3-secret-access-key", "minio123"); + +Path filesetPath = new Path("gvfs://fileset/test_catalog/test_schema/test_fileset/new_dir"); +FileSystem fs = filesetPath.getFileSystem(conf); +fs.mkdirs(filesetPath); +... +``` + ## Using fileset with hadoop fs command The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. @@ -342,6 +361,22 @@ hadoop dfs -ls gvfs://fileset/s3_catalog/schema/example hadoop dfs -put /path/to/local/file gvfs://fileset/s3_catalog/schema/example ``` +## Using Gravitino virtual file system Python client + +```python +from gravitino import gvfs +options = { + "cache_size": 20, + "cache_expired_time": 3600, + "auth_type": "simple", + "s3_endpoint": "http://localhost:9000", + "s3_access_key_id": "minio", + "s3_secret_access_key": "minio123" +} +fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090", metalake_name="test_metalake", options=options) +fs.ls("gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/") ") +``` + ## Using fileset with pandas The following are examples of how to use the pandas library to access the S3 fileset @@ -365,8 +400,46 @@ ds.head() ## Fileset with credential -If the catalog has been configured with credential, you can access S3 fileset without setting `s3-access-key-id` and `s3-secret-access-key` in the properties via GVFS. More detail can be seen [here](./security/credential-vending.md#s3-credentials). +Since 0.8.0-incubating, Gravitino supports credential vending for S3 fileset. If the catalog has been configured with credential, you can access S3 fileset without providing authentication information like `s3-access-key-id` and `s3-secret-access-key` in the properties. + +### How to create a S3 Hadoop catalog with credential enabled + +Apart from configuration method in [create-s3-hadoop-catalog](#catalog-a-catalog), properties needed by [s3-credential](./security/credential-vending.md#s3-credentials) should also be set to enable credential vending for S3 fileset. + +### How to access S3 fileset with credential +If the catalog has been configured with credential, you can access S3 fileset without providing authentication information via GVFS. Let's see how to access S3 fileset with credential: + +GVFS Java client: + +```java +Configuration conf = new Configuration(); +conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); +conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); +conf.set("fs.gravitino.server.uri","http://localhost:8090"); +conf.set("fs.gravitino.client.metalake","test_metalake"); +// No need to set s3-access-key-id and s3-secret-access-key +Path filesetPath = new Path("gvfs://fileset/test_catalog/test_schema/test_fileset/new_dir"); +FileSystem fs = filesetPath.getFileSystem(conf); +fs.mkdirs(filesetPath); +... +``` + +Spark: + +```python +spark = SparkSession.builder + .appName("s3_fielset_test") + .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") + .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.client.metalake", "test") + # No need to set s3-access-key-id and s3-secret-access-key + .config("spark.driver.memory", "2g") + .config("spark.driver.port", "2048") + .getOrCreate() +``` +Python client and Hadoop command are similar to the above examples. From d232e92316735ca78d3ebb63b09bcdaf3a865437 Mon Sep 17 00:00:00 2001 From: yuqi Date: Mon, 6 Jan 2025 08:48:23 +0800 Subject: [PATCH 05/39] polish document again. --- docs/hadoop-catalog-with-adls.md | 14 +++++++++----- docs/hadoop-catalog-with-gcs.md | 16 ++++++++++------ docs/hadoop-catalog-with-oss.md | 14 ++++++++------ docs/hadoop-catalog-with-s3.md | 14 +++++++++----- 4 files changed, 36 insertions(+), 22 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index a54a3baf4ef..2c4b2667ed9 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -19,6 +19,8 @@ $ bin/gravitino-server.sh start ## Create a Hadoop Catalog with ADLS in Gravitino +The rest of this document shows how to use the Hadoop catalog with ADLS in Gravitino with a full example. + ### Catalog a catalog Apart from configuration method in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with ADLS: @@ -41,7 +43,7 @@ Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#filese ## Using Hadoop catalog with ADLS -### Create a Hadoop catalog/schema/file set with ADLS +### Create a Hadoop catalog/schema/fileset with ADLS First, you need to create a Hadoop catalog with ADLS. The following example shows how to create a Hadoop catalog with ADLS: @@ -220,7 +222,7 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", "e -## Using Spark to access the fileset +### Using Spark to access the fileset The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: @@ -279,7 +281,7 @@ Please choose the correct jar according to your environment. In some Spark version, Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work, in this case, you should add the jars to the spark classpath directly. ::: -## Using Gravitino virual file system Java client to access the fileset +### Using Gravitino virtual file system Java client to access the fileset ```java Configuration conf = new Configuration(); @@ -295,7 +297,7 @@ fs.mkdirs(filesetPath); ... ``` -## Using fileset with hadoop fs command +### Using fileset with hadoop fs command The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. @@ -345,7 +347,7 @@ hadoop dfs -ls gvfs://fileset/adls_catalog/schema/example hadoop dfs -put /path/to/local/file gvfs://fileset/adls_catalog/schema/example ``` -## Using Gravitino virtual file system Python client +### Using Gravitino virtual file system Python client ```python from gravitino import gvfs @@ -381,6 +383,8 @@ ds = pd.read_csv(f"gvfs://fileset/${catalog_name}/${schema_name}/${fileset_name} ds.head() ``` +For other use cases, please refer to the [Gravitino Virtual File System](./how-to-use-gvfs.md) document. + ## Fileset with credential Since 0.8.0-incubating, Gravitino supports credential vending for ADLS fileset. If the catalog has been configured with credential, you can access ADLS fileset without providing authentication information like `azure-storage-account-name` and `azure-storage-account-key` in the properties. diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index d94473f5d16..e2d129e8545 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -19,6 +19,9 @@ $ bin/gravitino-server.sh start ## Create a Hadoop Catalog with GCS in Gravitino +The rest of this document shows how to use the Hadoop catalog with GCS in Gravitino with a full example. + + ### Catalog a catalog Apart from configuration method in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with GCS: @@ -37,7 +40,6 @@ Refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema- Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for more details. - ## Using Hadoop catalog with GCS ### Create a Hadoop catalog/schema/file set with GCS @@ -216,7 +218,7 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", "e -## Using Spark to access the fileset +### Using Spark to access the fileset The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: @@ -273,7 +275,7 @@ Please choose the correct jar according to your environment. In some Spark version, Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work, in this case, you should add the jars to the spark classpath directly. ::: -## Using Gravitino virual file system Java client to access the fileset +### Using Gravitino virtual file system Java client to access the fileset ```java Configuration conf = new Configuration(); @@ -288,7 +290,7 @@ fs.mkdirs(filesetPath); ... ``` -## Using fileset with hadoop fs command +### Using fileset with hadoop fs command The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. @@ -335,7 +337,7 @@ hadoop dfs -put /path/to/local/file gvfs://fileset/gcs_catalog/schema/example ``` -## Using Gravitino virtual file system Python client +### Using Gravitino virtual file system Python client ```python from gravitino import gvfs @@ -349,7 +351,7 @@ fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090", metalak fs.ls("gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/") ``` -## Using fileset with pandas +### Using fileset with pandas The following are examples of how to use the pandas library to access the GCS fileset @@ -368,6 +370,8 @@ ds = pd.read_csv(f"gvfs://fileset/${catalog_name}/${schema_name}/${fileset_name} ds.head() ``` +For other use cases, please refer to the [Gravitino Virtual File System](./how-to-use-gvfs.md) document. + ## Fileset with credential Since 0.8.0-incubating, Gravitino supports credential vending for GCS fileset. If the catalog has been configured with credential, you can access GCS fileset without providing authentication information like `gcs-service-account-file` in the properties. diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index 1da09507afa..326c370064d 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -42,6 +42,8 @@ Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#filese ## Using Hadoop catalog with OSS +The rest of this document shows how to use the Hadoop catalog with OSS in Gravitino with a full example. + ### Create a Hadoop catalog/schema/file set with OSS First, you need to create a Hadoop catalog with OSS. The following example shows how to create a Hadoop catalog with OSS: @@ -224,7 +226,7 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", "e -## Using Spark to access the fileset +### Using Spark to access the fileset The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: @@ -283,7 +285,7 @@ Please choose the correct jar according to your environment. In some Spark version, Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work, in this case, you should add the jars to the spark classpath directly. ::: -## Using Gravitino virual file system Java client to access the fileset +### Using Gravitino virtual file system Java client to access the fileset ```java Configuration conf = new Configuration(); @@ -300,8 +302,7 @@ fs.mkdirs(filesetPath); ... ``` - -## Using fileset with hadoop fs command +### Using fileset with hadoop fs command The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. @@ -358,7 +359,7 @@ hadoop dfs -put /path/to/local/file gvfs://fileset/oss_catalog/schema/example ``` -## Using Gravitino virtual file system Python client +### Using Gravitino virtual file system Python client ```python from gravitino import gvfs @@ -376,7 +377,7 @@ fs.ls("gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/") ``` -## Using fileset with pandas +### Using fileset with pandas The following are examples of how to use the pandas library to access the OSS fileset @@ -396,6 +397,7 @@ ds = pd.read_csv(f"gvfs://fileset/${catalog_name}/${schema_name}/${fileset_name} storage_options=storage_options) ds.head() ``` +For other use cases, please refer to the [Gravitino Virtual File System](./how-to-use-gvfs.md) document. ## Fileset with credential diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index e81359e172d..0c184d7f387 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -42,6 +42,8 @@ Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#filese ## Using Hadoop catalog with S3 +The rest of this document shows how to use the Hadoop catalog with S3 in Gravitino with a full example. + ### Create a Hadoop catalog/schema/file set with S3 First of all, you need to create a Hadoop catalog with S3. The following example shows how to create a Hadoop catalog with S3: @@ -228,7 +230,8 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", "e -## Using Spark to access the fileset + +### Using Spark to access the fileset The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: @@ -286,7 +289,7 @@ Please choose the correct jar according to your environment. In some Spark version, Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work, in this case, you should add the jars to the spark classpath directly. ::: -## Using Gravitino virual file system Java client to access the fileset +### Using Gravitino virtual file system Java client to access the fileset ```java Configuration conf = new Configuration(); @@ -305,7 +308,7 @@ fs.mkdirs(filesetPath); ... ``` -## Using fileset with hadoop fs command +### Using fileset with hadoop fs command The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. @@ -361,7 +364,7 @@ hadoop dfs -ls gvfs://fileset/s3_catalog/schema/example hadoop dfs -put /path/to/local/file gvfs://fileset/s3_catalog/schema/example ``` -## Using Gravitino virtual file system Python client +### Using Gravitino virtual file system Python client ```python from gravitino import gvfs @@ -377,7 +380,7 @@ fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090", metalak fs.ls("gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/") ") ``` -## Using fileset with pandas +### Using fileset with pandas The following are examples of how to use the pandas library to access the S3 fileset @@ -397,6 +400,7 @@ ds = pd.read_csv(f"gvfs://fileset/${catalog_name}/${schema_name}/${fileset_name} storage_options=storage_options) ds.head() ``` +For other use cases, please refer to the [Gravitino Virtual File System](./how-to-use-gvfs.md) document. ## Fileset with credential From fbd57ba1e864e483d646fd7c0de54477fa216884 Mon Sep 17 00:00:00 2001 From: yuqi Date: Mon, 6 Jan 2025 09:25:34 +0800 Subject: [PATCH 06/39] Again --- docs/hadoop-catalog-with-adls.md | 2 ++ docs/hadoop-catalog-with-gcs.md | 4 +++- docs/hadoop-catalog-with-oss.md | 2 ++ docs/hadoop-catalog-with-s3.md | 2 ++ 4 files changed, 9 insertions(+), 1 deletion(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index 2c4b2667ed9..40b4c14f767 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -297,6 +297,8 @@ fs.mkdirs(filesetPath); ... ``` +Similar to Spark configurations, you need to add ADLS bundle jars to the classpath according to your environment. + ### Using fileset with hadoop fs command The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index e2d129e8545..4b29db5c016 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -290,6 +290,8 @@ fs.mkdirs(filesetPath); ... ``` +Similar to Spark configurations, you need to add GCS bundle jars to the classpath according to your environment. + ### Using fileset with hadoop fs command The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. @@ -403,7 +405,7 @@ Spark: ```python spark = SparkSession.builder - .appName("gcs_fielset_test") + .appName("gcs_fileset_test") .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index 326c370064d..ecc89edd901 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -302,6 +302,8 @@ fs.mkdirs(filesetPath); ... ``` +Similar to Spark configurations, you need to add OSS bundle jars to the classpath according to your environment. + ### Using fileset with hadoop fs command The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index 0c184d7f387..94cc12b3d95 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -308,6 +308,8 @@ fs.mkdirs(filesetPath); ... ``` +Similar to Spark configurations, you need to add S3 bundle jars to the classpath according to your environment. + ### Using fileset with hadoop fs command The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. From 4fb6e798146ebcee0298ec81517572f12eec6d23 Mon Sep 17 00:00:00 2001 From: yuqi Date: Mon, 6 Jan 2025 19:57:04 +0800 Subject: [PATCH 07/39] fix --- docs/hadoop-catalog-with-adls.md | 41 +++++++++++++------------- docs/hadoop-catalog-with-gcs.md | 49 ++++++++++++++++---------------- docs/hadoop-catalog-with-oss.md | 43 ++++++++++++++-------------- docs/hadoop-catalog-with-s3.md | 36 +++++++++++------------ 4 files changed, 83 insertions(+), 86 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index 40b4c14f767..679a3d631b0 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -21,7 +21,7 @@ $ bin/gravitino-server.sh start The rest of this document shows how to use the Hadoop catalog with ADLS in Gravitino with a full example. -### Catalog a catalog +### Catalog a Hadoop catalog with ADLS Apart from configuration method in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with ADLS: @@ -53,7 +53,7 @@ First, you need to create a Hadoop catalog with ADLS. The following example show ```shell curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{ - "name": "catalog", + "name": "example_catalog", "type": "FILESET", "comment": "comment", "provider": "hadoop", @@ -82,7 +82,7 @@ adlsProperties = ImmutableMap.builder() .put("filesystem-providers", "abs") .build(); -Catalog adlsCatalog = gravitinoClient.createCatalog("catalog", +Catalog adlsCatalog = gravitinoClient.createCatalog("example_catalog", Type.FILESET, "hadoop", // provider, Gravitino only supports "hadoop" for now. "This is a ADLS fileset catalog", @@ -102,7 +102,7 @@ adls_properties = { "azure_storage_account_key": "azure storage account key" } -adls_properties = gravitino_client.create_catalog(name="catalog", +adls_properties = gravitino_client.create_catalog(name="example_catalog", type=Catalog.Type.FILESET, provider="hadoop", comment="This is a ADLS fileset catalog", @@ -123,27 +123,26 @@ Using the following code to create a schema and fileset: ```shell curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{ - "name": "schema", + "name": "test_schema", "comment": "comment", "properties": { "location": "abfss://container@account-name.dfs.core.windows.net/path" } -}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas +}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas ``` ```java -// Assuming you have just created a Hive catalog named `hive_catalog` -Catalog catalog = gravitinoClient.loadCatalog("hive_catalog"); +Catalog catalog = gravitinoClient.loadCatalog("test_catalog"); SupportsSchemas supportsSchemas = catalog.asSchemas(); Map schemaProperties = ImmutableMap.builder() .put("location", "abfss://container@account-name.dfs.core.windows.net/path") .build(); -Schema schema = supportsSchemas.createSchema("schema", +Schema schema = supportsSchemas.createSchema("test_schema", "This is a schema", schemaProperties ); @@ -155,8 +154,8 @@ Schema schema = supportsSchemas.createSchema("schema", ```python gravitino_client: GravitinoClient = GravitinoClient(uri="http://127.0.0.1:8090", metalake_name="metalake") -catalog: Catalog = gravitino_client.load_catalog(name="hive_catalog") -catalog.as_schemas().create_schema(name="schema", +catalog: Catalog = gravitino_client.load_catalog(name="test_catalog") +catalog.as_schemas().create_schema(name="test_schema", comment="This is a schema", properties={"location": "abfss://container@account-name.dfs.core.windows.net/path"}) ``` @@ -177,7 +176,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ "properties": { "k1": "v1" } -}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/filesets +}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas/test_schema/filesets ``` @@ -189,7 +188,7 @@ GravitinoClient gravitinoClient = GravitinoClient .withMetalake("metalake") .build(); -Catalog catalog = gravitinoClient.loadCatalog("catalog"); +Catalog catalog = gravitinoClient.loadCatalog("test_catalog"); FilesetCatalog filesetCatalog = catalog.asFilesetCatalog(); Map propertiesMap = ImmutableMap.builder() @@ -197,7 +196,7 @@ Map propertiesMap = ImmutableMap.builder() .build(); filesetCatalog.createFileset( - NameIdentifier.of("schema", "example_fileset"), + NameIdentifier.of("test_schema", "example_fileset"), "This is an example fileset", Fileset.Type.MANAGED, "abfss://container@account-name.dfs.core.windows.net/path/example_fileset", @@ -211,8 +210,8 @@ filesetCatalog.createFileset( ```python gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") -catalog: Catalog = gravitino_client.load_catalog(name="catalog") -catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", "example_fileset"), +catalog: Catalog = gravitino_client.load_catalog(name="test_catalog") +catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema", "example_fileset"), type=Fileset.Type.MANAGED, comment="This is an example fileset", storage_location="abfss://container@account-name.dfs.core.windows.net/path/example_fileset", @@ -244,7 +243,7 @@ spark = SparkSession.builder .appName("adls_fileset_test") .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") -.config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") +.config("spark.hadoop.fs.gravitino.server.uri", "${GRAVITINO_SERVER_URL}") .config("spark.hadoop.fs.gravitino.client.metalake", "test") .config("spark.hadoop.azure-storage-account-name", "azure_account_name") .config("spark.hadoop.azure-storage-account-key", "azure_account_name") @@ -273,12 +272,12 @@ os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-bundle-{gra ``` - [`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle) is the Gravitino ADLS jar with Hadoop environment and `hadoop-azure` jar. -- [`gravitino-azure-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure) is the Gravitino ADLS jar without Hadoop environment and `hadoop-azure` jar. +- [`gravitino-azure-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure) is a condensed version of the Gravitino ADLS bundle jar without Hadoop environment and `hadoop-azure` jar. Please choose the correct jar according to your environment. :::note -In some Spark version, Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work, in this case, you should add the jars to the spark classpath directly. +In some Spark versions, a Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work. If this is the case, you should add the jars to the spark CLASSPATH directly. ::: ### Using Gravitino virtual file system Java client to access the fileset @@ -299,7 +298,7 @@ fs.mkdirs(filesetPath); Similar to Spark configurations, you need to add ADLS bundle jars to the classpath according to your environment. -### Using fileset with hadoop fs command +### Accessing a fileset using the Hadoop fs command The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. @@ -349,7 +348,7 @@ hadoop dfs -ls gvfs://fileset/adls_catalog/schema/example hadoop dfs -put /path/to/local/file gvfs://fileset/adls_catalog/schema/example ``` -### Using Gravitino virtual file system Python client +### Using the Gravitino virtual file system Python client to access a fileset ```python from gravitino import gvfs diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index 4b29db5c016..0e409bc4016 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -22,13 +22,13 @@ $ bin/gravitino-server.sh start The rest of this document shows how to use the Hadoop catalog with GCS in Gravitino with a full example. -### Catalog a catalog +### Catalog a Hadoop catalog with GCS Apart from configuration method in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with GCS: | Configuration item | Description | Default value | Required | Since version | |-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------------------------|------------------| -| `filesystem-providers` | The file system providers to add. Set it to `gs` if it's a GCS fileset, a comma separated string that contains `gs` like `gs,s3` to support multiple kinds of fileset including `gs`. | (none) | Yes | 0.7.0-incubating | +| `filesystem-providers` | The file system providers to add. Set it to `gcs` if it's a GCS fileset, a comma separated string that contains `gcs` like `gcs,s3` to support multiple kinds of fileset including `gcs`. | (none) | Yes | 0.7.0-incubating | | `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for GCS, if we set this value, we can omit the prefix 'gs://' in the location. | `builtin-local` | No | 0.7.0-incubating | | `gcs-service-account-file` | The path of GCS service account JSON file. | (none) | Yes if it's a GCS fileset. | 0.7.0-incubating | @@ -52,7 +52,7 @@ First, you need to create a Hadoop catalog with GCS. The following example shows ```shell curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{ - "name": "catalog", + "name": "test_catalog", "type": "FILESET", "comment": "comment", "provider": "hadoop", @@ -79,7 +79,7 @@ gcsProperties = ImmutableMap.builder() .put("filesystem-providers", "gcs") .build(); -Catalog gcsCatalog = gravitinoClient.createCatalog("catalog", +Catalog gcsCatalog = gravitinoClient.createCatalog("test_catalog", Type.FILESET, "hadoop", // provider, Gravitino only supports "hadoop" for now. "This is a GCS fileset catalog", @@ -98,7 +98,7 @@ gcs_properties = { "gcs-service-account-file": "path_of_gcs_service_account_file" } -gcs_properties = gravitino_client.create_catalog(name="catalog", +gcs_properties = gravitino_client.create_catalog(name="test_catalog", type=Catalog.Type.FILESET, provider="hadoop", comment="This is a GCS fileset catalog", @@ -109,7 +109,7 @@ gcs_properties = gravitino_client.create_catalog(name="catalog", -Then create a schema and fileset in the catalog created above. +Then create a schema and a fileset in the catalog created above. Using the following code to create a schema and fileset: @@ -119,27 +119,26 @@ Using the following code to create a schema and fileset: ```shell curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{ - "name": "schema", + "name": "test_schema", "comment": "comment", "properties": { "location": "gs://bucket/root/schema" } -}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas +}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas ``` ```java -// Assuming you have just created a Hive catalog named `hive_catalog` -Catalog catalog = gravitinoClient.loadCatalog("hive_catalog"); +Catalog catalog = gravitinoClient.loadCatalog("test_catalog"); SupportsSchemas supportsSchemas = catalog.asSchemas(); Map schemaProperties = ImmutableMap.builder() .put("location", "gs://bucket/root/schema") .build(); -Schema schema = supportsSchemas.createSchema("schema", +Schema schema = supportsSchemas.createSchema("test_schema", "This is a schema", schemaProperties ); @@ -151,8 +150,8 @@ Schema schema = supportsSchemas.createSchema("schema", ```python gravitino_client: GravitinoClient = GravitinoClient(uri="http://127.0.0.1:8090", metalake_name="metalake") -catalog: Catalog = gravitino_client.load_catalog(name="hive_catalog") -catalog.as_schemas().create_schema(name="schema", +catalog: Catalog = gravitino_client.load_catalog(name="test_catalog") +catalog.as_schemas().create_schema(name="test_schema", comment="This is a schema", properties={"location": "gs://bucket/root/schema"}) ``` @@ -173,7 +172,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ "properties": { "k1": "v1" } -}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/filesets +}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas/test_schema/filesets ``` @@ -185,7 +184,7 @@ GravitinoClient gravitinoClient = GravitinoClient .withMetalake("metalake") .build(); -Catalog catalog = gravitinoClient.loadCatalog("catalog"); +Catalog catalog = gravitinoClient.loadCatalog("test_catalog"); FilesetCatalog filesetCatalog = catalog.asFilesetCatalog(); Map propertiesMap = ImmutableMap.builder() @@ -193,7 +192,7 @@ Map propertiesMap = ImmutableMap.builder() .build(); filesetCatalog.createFileset( - NameIdentifier.of("schema", "example_fileset"), + NameIdentifier.of("test_schema", "example_fileset"), "This is an example fileset", Fileset.Type.MANAGED, "gs://bucket/root/schema/example_fileset", @@ -207,8 +206,8 @@ filesetCatalog.createFileset( ```python gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") -catalog: Catalog = gravitino_client.load_catalog(name="catalog") -catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", "example_fileset"), +catalog: Catalog = gravitino_client.load_catalog(name="test_catalog") +catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema", "example_fileset"), type=Fileset.Type.MANAGED, comment="This is an example fileset", storage_location="gs://bucket/root/schema/example_fileset", @@ -240,8 +239,8 @@ spark = SparkSession.builder .appName("gcs_fielset_test") .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") -.config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") -.config("spark.hadoop.fs.gravitino.client.metalake", "test") +.config("spark.hadoop.fs.gravitino.server.uri", "${GRAVITINO_SERVER_URL}") +.config("spark.hadoop.fs.gravitino.client.metalake", "test_metalake") .config("spark.hadoop.gcs-service-account-file", "/path/to/gcs-service-account-file.json") .config("spark.driver.memory", "2g") .config("spark.driver.port", "2048") @@ -266,13 +265,13 @@ If your Spark **without Hadoop environment**, you can use the following code sni os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-gcp-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar, --master local[1] pyspark-shell" ``` -- [`gravitino-gcp-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp-bundle) is the Gravitino GCS jar with Hadoop environment and `gcs-connector` jar. -- [`gravitino-gcp-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp) is the Gravitino GCS jar without Hadoop environment and `gcs-connector` jar. +- [`gravitino-gcp-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp-bundle) is the Gravitino GCP jar with Hadoop environment and `gcs-connector`. +- [`gravitino-gcp-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp) is a condensed version of the Gravitino GCP bundle jar without Hadoop environment and `gcs-connector`. Please choose the correct jar according to your environment. :::note -In some Spark version, Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work, in this case, you should add the jars to the spark classpath directly. +In some Spark versions, a Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work. If this is the case, you should add the jars to the spark CLASSPATH directly. ::: ### Using Gravitino virtual file system Java client to access the fileset @@ -292,7 +291,7 @@ fs.mkdirs(filesetPath); Similar to Spark configurations, you need to add GCS bundle jars to the classpath according to your environment. -### Using fileset with hadoop fs command +### Accessing a fileset using the Hadoop fs command The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. @@ -339,7 +338,7 @@ hadoop dfs -put /path/to/local/file gvfs://fileset/gcs_catalog/schema/example ``` -### Using Gravitino virtual file system Python client +### Using the Gravitino virtual file system Python client to access a fileset ```python from gravitino import gvfs diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index ecc89edd901..5b118836e95 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -44,7 +44,7 @@ Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#filese The rest of this document shows how to use the Hadoop catalog with OSS in Gravitino with a full example. -### Create a Hadoop catalog/schema/file set with OSS +### Create a Hadoop catalog/schema/fileset with OSS First, you need to create a Hadoop catalog with OSS. The following example shows how to create a Hadoop catalog with OSS: @@ -54,7 +54,7 @@ First, you need to create a Hadoop catalog with OSS. The following example shows ```shell curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{ - "name": "catalog", + "name": "test_catalog", "type": "FILESET", "comment": "comment", "provider": "hadoop", @@ -85,7 +85,7 @@ ossProperties = ImmutableMap.builder() .put("filesystem-providers", "oss") .build(); -Catalog ossCatalog = gravitinoClient.createCatalog("catalog", +Catalog ossCatalog = gravitinoClient.createCatalog("test_catalog", Type.FILESET, "hadoop", // provider, Gravitino only supports "hadoop" for now. "This is a OSS fileset catalog", @@ -106,7 +106,7 @@ oss_properties = { "oss-endpoint": "ossProperties" } -oss_catalog = gravitino_client.create_catalog(name="catalog", +oss_catalog = gravitino_client.create_catalog(name="test_catalog", type=Catalog.Type.FILESET, provider="hadoop", comment="This is a OSS fileset catalog", @@ -117,7 +117,7 @@ oss_catalog = gravitino_client.create_catalog(name="catalog", -Then create a schema and fileset in the catalog created above. +Then create a schema and a fileset in the catalog created above. Using the following code to create a schema and fileset: @@ -127,27 +127,26 @@ Using the following code to create a schema and fileset: ```shell curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{ - "name": "schema", + "name": "test_schema", "comment": "comment", "properties": { "location": "oss://bucket/root/schema" } -}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas +}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas ``` ```java -// Assuming you have just created a Hive catalog named `hive_catalog` -Catalog catalog = gravitinoClient.loadCatalog("hive_catalog"); +Catalog catalog = gravitinoClient.loadCatalog("test_catalog"); SupportsSchemas supportsSchemas = catalog.asSchemas(); Map schemaProperties = ImmutableMap.builder() .put("location", "oss://bucket/root/schema") .build(); -Schema schema = supportsSchemas.createSchema("schema", +Schema schema = supportsSchemas.createSchema("test_schema", "This is a schema", schemaProperties ); @@ -159,8 +158,8 @@ Schema schema = supportsSchemas.createSchema("schema", ```python gravitino_client: GravitinoClient = GravitinoClient(uri="http://127.0.0.1:8090", metalake_name="metalake") -catalog: Catalog = gravitino_client.load_catalog(name="hive_catalog") -catalog.as_schemas().create_schema(name="schema", +catalog: Catalog = gravitino_client.load_catalog(name="test_catalog") +catalog.as_schemas().create_schema(name="test_schema", comment="This is a schema", properties={"location": "oss://bucket/root/schema"}) ``` @@ -181,7 +180,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ "properties": { "k1": "v1" } -}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/filesets +}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas/test_schema/filesets ``` @@ -193,7 +192,7 @@ GravitinoClient gravitinoClient = GravitinoClient .withMetalake("metalake") .build(); -Catalog catalog = gravitinoClient.loadCatalog("catalog"); +Catalog catalog = gravitinoClient.loadCatalog("test_catalog"); FilesetCatalog filesetCatalog = catalog.asFilesetCatalog(); Map propertiesMap = ImmutableMap.builder() @@ -201,7 +200,7 @@ Map propertiesMap = ImmutableMap.builder() .build(); filesetCatalog.createFileset( - NameIdentifier.of("schema", "example_fileset"), + NameIdentifier.of("test_schema", "example_fileset"), "This is an example fileset", Fileset.Type.MANAGED, "oss://bucket/root/schema/example_fileset", @@ -215,8 +214,8 @@ filesetCatalog.createFileset( ```python gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") -catalog: Catalog = gravitino_client.load_catalog(name="catalog") -catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", "example_fileset"), +catalog: Catalog = gravitino_client.load_catalog(name="test_catalog") +catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema", "example_fileset"), type=Fileset.Type.MANAGED, comment="This is an example fileset", storage_location="oss://bucket/root/schema/example_fileset", @@ -248,7 +247,7 @@ spark = SparkSession.builder .appName("oss_fielset_test") .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") -.config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") +.config("spark.hadoop.fs.gravitino.server.uri", "${GRAVITINO_SERVER_URL}") .config("spark.hadoop.fs.gravitino.client.metalake", "test") .config("spark.hadoop.oss-access-key-id", os.environ["OSS_ACCESS_KEY_ID"]) .config("spark.hadoop.oss-secret-access-key", os.environ["OSS_SECRET_ACCESS_KEY"]) @@ -277,12 +276,12 @@ os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aliyun-bundle-{gr ``` - [`gravitino-aliyun-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun-bundle) is the Gravitino Aliyun jar with Hadoop environment and `hadoop-oss` jar. -- [`gravitino-aliyun-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun) is the Gravitino OSS jar without Hadoop environment and `hadoop-oss` jar. +- [`gravitino-aliyun-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun) is a condensed version of the Gravitino Aliyun bundle jar without Hadoop environment and `hadoop-aliyun` jar. Please choose the correct jar according to your environment. :::note -In some Spark version, Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work, in this case, you should add the jars to the spark classpath directly. +In some Spark versions, a Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work. If this is the case, you should add the jars to the spark CLASSPATH directly. ::: ### Using Gravitino virtual file system Java client to access the fileset @@ -304,7 +303,7 @@ fs.mkdirs(filesetPath); Similar to Spark configurations, you need to add OSS bundle jars to the classpath according to your environment. -### Using fileset with hadoop fs command +### Accessing a fileset using the Hadoop fs command The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. @@ -361,7 +360,7 @@ hadoop dfs -put /path/to/local/file gvfs://fileset/oss_catalog/schema/example ``` -### Using Gravitino virtual file system Python client +### Using Gravitino virtual file system Python client to access a fileset ```python from gravitino import gvfs diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index 94cc12b3d95..b7cc30f26c5 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -19,7 +19,7 @@ $ bin/gravitino-server.sh start ## Create a Hadoop Catalog with S3 in Gravitino -### Catalog a catalog +### Catalog a Hadoop catalog with S3 Apart from configuration method in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with S3: @@ -54,7 +54,7 @@ First of all, you need to create a Hadoop catalog with S3. The following example ```shell curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{ - "name": "catalog", + "name": "test_catalog", "type": "FILESET", "comment": "comment", "provider": "hadoop", @@ -85,7 +85,7 @@ s3Properties = ImmutableMap.builder() .put("filesystem-providers", "s3") .build(); -Catalog s3Catalog = gravitinoClient.createCatalog("catalog", +Catalog s3Catalog = gravitinoClient.createCatalog("test_catalog", Type.FILESET, "hadoop", // provider, Gravitino only supports "hadoop" for now. "This is a S3 fileset catalog", @@ -106,7 +106,7 @@ s3_properties = { "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com" } -s3_catalog = gravitino_client.create_catalog(name="catalog", +s3_catalog = gravitino_client.create_catalog(name="test_catalog", type=Catalog.Type.FILESET, provider="hadoop", comment="This is a S3 fileset catalog", @@ -121,7 +121,7 @@ s3_catalog = gravitino_client.create_catalog(name="catalog", The value of location should always start with `s3a` NOT `s3` for AWS S3, for instance, `s3a://bucket/root`. Value like `s3://bucket/root` is not supported due to the limitation of the hadoop-aws library. ::: -Then create a schema and fileset in the catalog created above. +Then create a schema and a fileset in the catalog created above. Using the following code to create a schema and fileset: @@ -131,12 +131,12 @@ Using the following code to create a schema and fileset: ```shell curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{ - "name": "schema", + "name": "test_schema", "comment": "comment", "properties": { "location": "s3a://bucket/root/schema" } -}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas +}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas ``` @@ -151,7 +151,7 @@ SupportsSchemas supportsSchemas = catalog.asSchemas(); Map schemaProperties = ImmutableMap.builder() .put("location", "s3a://bucket/root/schema") .build(); -Schema schema = supportsSchemas.createSchema("schema", +Schema schema = supportsSchemas.createSchema("test_schema", "This is a schema", schemaProperties ); @@ -163,8 +163,8 @@ Schema schema = supportsSchemas.createSchema("schema", ```python gravitino_client: GravitinoClient = GravitinoClient(uri="http://127.0.0.1:8090", metalake_name="metalake") -catalog: Catalog = gravitino_client.load_catalog(name="hive_catalog") -catalog.as_schemas().create_schema(name="schema", +catalog: Catalog = gravitino_client.load_catalog(name="test_catalog") +catalog.as_schemas().create_schema(name="test_schema", comment="This is a schema", properties={"location": "s3a://bucket/root/schema"}) ``` @@ -185,7 +185,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ "properties": { "k1": "v1" } -}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/filesets +}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas/test_schema/filesets ``` @@ -197,7 +197,7 @@ GravitinoClient gravitinoClient = GravitinoClient .withMetalake("metalake") .build(); -Catalog catalog = gravitinoClient.loadCatalog("catalog"); +Catalog catalog = gravitinoClient.loadCatalog("test_catalog"); FilesetCatalog filesetCatalog = catalog.asFilesetCatalog(); Map propertiesMap = ImmutableMap.builder() @@ -205,7 +205,7 @@ Map propertiesMap = ImmutableMap.builder() .build(); filesetCatalog.createFileset( - NameIdentifier.of("schema", "example_fileset"), + NameIdentifier.of("test_schema", "example_fileset"), "This is an example fileset", Fileset.Type.MANAGED, "s3a://bucket/root/schema/example_fileset", @@ -253,7 +253,7 @@ spark = SparkSession.builder .appName("s3_fielset_test") .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") - .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.server.uri", "${GRAVITINO_SERVER_URL}") .config("spark.hadoop.fs.gravitino.client.metalake", "test") .config("spark.hadoop.s3-access-key-id", os.environ["S3_ACCESS_KEY_ID"]) .config("spark.hadoop.s3-secret-access-key", os.environ["S3_SECRET_ACCESS_KEY"]) @@ -281,12 +281,12 @@ os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aws-${gravitino-v ``` - [`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle) is the Gravitino AWS jar with Hadoop environment and `hadoop-aws` jar. -- [`gravitino-aws-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws) is the Gravitino AWS jar without Hadoop environment and `hadoop-aws` jar. +- [`gravitino-aws-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws) is a condensed version of the Gravitino AWS bundle jar without Hadoop environment and `hadoop-aws` jar. Please choose the correct jar according to your environment. :::note -In some Spark version, Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work, in this case, you should add the jars to the spark classpath directly. +In some Spark versions, a Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work. If this is the case, you should add the jars to the spark CLASSPATH directly. ::: ### Using Gravitino virtual file system Java client to access the fileset @@ -310,7 +310,7 @@ fs.mkdirs(filesetPath); Similar to Spark configurations, you need to add S3 bundle jars to the classpath according to your environment. -### Using fileset with hadoop fs command +### Accessing a fileset using the Hadoop fs command The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. @@ -366,7 +366,7 @@ hadoop dfs -ls gvfs://fileset/s3_catalog/schema/example hadoop dfs -put /path/to/local/file gvfs://fileset/s3_catalog/schema/example ``` -### Using Gravitino virtual file system Python client +### Using the Gravitino virtual file system Python client to access a fileset ```python from gravitino import gvfs From e481c8d0217cfd990b8cae75e0ab52eb2ef73524 Mon Sep 17 00:00:00 2001 From: yuqi Date: Mon, 6 Jan 2025 21:52:34 +0800 Subject: [PATCH 08/39] fix --- docs/hadoop-catalog-with-adls.md | 8 ++++---- docs/hadoop-catalog-with-gcs.md | 8 ++++---- docs/hadoop-catalog-with-oss.md | 8 ++++---- docs/hadoop-catalog-with-s3.md | 8 ++++---- 4 files changed, 16 insertions(+), 16 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index 679a3d631b0..5f77153f6dc 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -17,11 +17,11 @@ at `${HADOOP_HOME}/share/hadoop/common/lib/`. After that, start Gravitino server $ bin/gravitino-server.sh start ``` -## Create a Hadoop Catalog with ADLS in Gravitino +## Create a Hadoop Catalog with ADLS The rest of this document shows how to use the Hadoop catalog with ADLS in Gravitino with a full example. -### Catalog a Hadoop catalog with ADLS +### Catalog a ADLS Hadoop catalog Apart from configuration method in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with ADLS: @@ -337,8 +337,8 @@ The following are examples of how to use the `hadoop fs` command to access the f 2. Copy the necessary jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. -Copy the corresponding jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. For ADLS, you need to copy `gravitino-azure-{version}.jar` to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. -then copy `hadoop-azure-${version}.jar` and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory, for simple you can add all the jars in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. +For ADLS, you need to copy `gravitino-azure-{version}.jar` to the `${HADOOP_HOME}/share/hadoop/common/lib` directory, +then copy `hadoop-azure-${version}.jar` and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory, you can add all the jars in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. 3. Run the following command to access the fileset: diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index 0e409bc4016..640ee5ee098 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -17,12 +17,12 @@ at `${HADOOP_HOME}/share/hadoop/common/lib/`. After that, start Gravitino server $ bin/gravitino-server.sh start ``` -## Create a Hadoop Catalog with GCS in Gravitino +## Create a Hadoop Catalog with GCS The rest of this document shows how to use the Hadoop catalog with GCS in Gravitino with a full example. -### Catalog a Hadoop catalog with GCS +### Catalog a GCS Hadoop catalog Apart from configuration method in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with GCS: @@ -326,8 +326,8 @@ The following are examples of how to use the `hadoop fs` command to access the f 2. Copy the necessary jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. -Copy the corresponding jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. For GCS, you need to copy `gravitino-gcp-{version}.jar` to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. -then copy `hadoop-gcp-${version}.jar` and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory, for simple you can add all the jars in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. +For GCS, you need to copy `gravitino-gcp-{version}.jar` to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. +Then copy `hadoop-gcp-${version}.jar` and other possible dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory, you can add all the jars in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. 3. Run the following command to access the fileset: diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index 5b118836e95..a61c90600a3 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -17,9 +17,9 @@ at `${HADOOP_HOME}/share/hadoop/common/lib/`. After that, start Gravitino server $ bin/gravitino-server.sh start ``` -## Create a Hadoop Catalog with OSS in Gravitino +## Create a Hadoop Catalog with OSS -### Catalog a catalog +### Catalog an OSS Hadoop catalog Apart from configuration method in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with OSS: @@ -348,8 +348,8 @@ The following are examples of how to use the `hadoop fs` command to access the f 2. Copy the necessary jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. -Copy the corresponding jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. For OSS, you need to copy `gravitino-aliyun-{version}.jar` to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. -then copy hadoop-aliyun-{version}.jar and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory, for simple you can add all the jars in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. +For OSS, you need to copy `gravitino-aliyun-{version}.jar` to the `${HADOOP_HOME}/share/hadoop/common/lib` directory, +then copy hadoop-aliyun-{version}.jar and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory, you can add all the jars in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. 3. Run the following command to access the fileset: diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index b7cc30f26c5..b1a724f1f2f 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -17,9 +17,9 @@ at `${HADOOP_HOME}/share/hadoop/common/lib/`. After that, start Gravitino server $ bin/gravitino-server.sh start ``` -## Create a Hadoop Catalog with S3 in Gravitino +## Create a Hadoop Catalog with S3 -### Catalog a Hadoop catalog with S3 +### Catalog a S3 Hadoop catalog Apart from configuration method in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with S3: @@ -355,8 +355,8 @@ The following are examples of how to use the `hadoop fs` command to access the f 2. Copy the necessary jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. -Copy the corresponding jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. For S3, you need to copy `gravitino-aws-{version}.jar` to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. -then copy hadoop-aws-{version}.jar and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory, for simple you can add all the jars in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. +For S3, you need to copy `gravitino-aws-{version}.jar` to the `${HADOOP_HOME}/share/hadoop/common/lib` directoryl, +then copy hadoop-aws-{version}.jar and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory, you can add all the jars in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. 3. Run the following command to access the fileset: From 0a97fc7a07e5c424dabc46769ba72bea86f374a7 Mon Sep 17 00:00:00 2001 From: yuqi Date: Mon, 6 Jan 2025 21:59:09 +0800 Subject: [PATCH 09/39] fix --- docs/hadoop-catalog-with-adls.md | 28 ++++++++++++++-------------- docs/hadoop-catalog-with-gcs.md | 30 +++++++++++++++--------------- docs/hadoop-catalog-with-oss.md | 14 +++++++------- docs/hadoop-catalog-with-s3.md | 8 ++++---- 4 files changed, 40 insertions(+), 40 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index 5f77153f6dc..3ef052da730 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -63,7 +63,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ "azure-storage-account-key": "The account key of the Azure Blob Storage", "filesystem-providers": "abs" } -}' http://localhost:8090/api/metalakes/metalake/catalogs +}' ${GRAVITINO_SERVER_IP:PORT}/api/metalakes/metalake/catalogs ``` @@ -71,7 +71,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ ```java GravitinoClient gravitinoClient = GravitinoClient - .builder("http://localhost:8090") + .builder("${GRAVITINO_SERVER_IP:PORT}") .withMetalake("metalake") .build(); @@ -95,7 +95,7 @@ Catalog adlsCatalog = gravitinoClient.createCatalog("example_catalog", ```python -gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") +gravitino_client: GravitinoClient = GravitinoClient(uri="${GRAVITINO_SERVER_IP:PORT}", metalake_name="metalake") adls_properties = { "location": "abfss://container@account-name.dfs.core.windows.net/path", "azure_storage_account_name": "azure storage account name", @@ -128,7 +128,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ "properties": { "location": "abfss://container@account-name.dfs.core.windows.net/path" } -}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas +}' ${GRAVITINO_SERVER_IP:PORT}/api/metalakes/metalake/catalogs/test_catalog/schemas ``` @@ -176,7 +176,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ "properties": { "k1": "v1" } -}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas/test_schema/filesets +}' ${GRAVITINO_SERVER_IP:PORT}/api/metalakes/metalake/catalogs/test_catalog/schemas/test_schema/filesets ``` @@ -184,7 +184,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ ```java GravitinoClient gravitinoClient = GravitinoClient - .builder("http://localhost:8090") + .builder("${GRAVITINO_SERVER_IP:PORT}") .withMetalake("metalake") .build(); @@ -208,7 +208,7 @@ filesetCatalog.createFileset( ```python -gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") +gravitino_client: GravitinoClient = GravitinoClient(uri="${GRAVITINO_SERVER_IP:PORT}", metalake_name="metalake") catalog: Catalog = gravitino_client.load_catalog(name="test_catalog") catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema", "example_fileset"), @@ -231,7 +231,7 @@ from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, Graviti from pyspark.sql import SparkSession import os -gravitino_url = "http://localhost:8090" +gravitino_url = "${GRAVITINO_SERVER_IP:PORT}" metalake_name = "test" catalog_name = "your_adls_catalog" @@ -286,7 +286,7 @@ In some Spark versions, a Hadoop environment is needed by the driver, adding the Configuration conf = new Configuration(); conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); -conf.set("fs.gravitino.server.uri","http://localhost:8090"); +conf.set("fs.gravitino.server.uri","${GRAVITINO_SERVER_URL}"); conf.set("fs.gravitino.client.metalake","test_metalake"); conf.set("azure-storage-account-name", "account_name_of_adls"); conf.set("azure-storage-account-key", "account_key_of_adls"); @@ -317,7 +317,7 @@ The following are examples of how to use the `hadoop fs` command to access the f fs.gravitino.server.uri - http://192.168.50.188:8090 + ${GRAVITINO_SERVER_IP:PORT} @@ -359,7 +359,7 @@ options = { "azure_storage_account_name": "azure_account_name", "azure_storage_account_key": "azure_account_key" } -fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090", metalake_name="test_metalake", options=options) +fs = gvfs.GravitinoVirtualFileSystem(server_uri="${GRAVITINO_SERVER_IP:PORT}", metalake_name="test_metalake", options=options) fs.ls("gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/") ``` @@ -372,7 +372,7 @@ The following are examples of how to use the pandas library to access the ADLS f import pandas as pd storage_options = { - "server_uri": "http://localhost:8090", + "server_uri": "${GRAVITINO_SERVER_IP:PORT}", "metalake_name": "test", "options": { "azure_storage_account_name": "azure_account_name", @@ -404,7 +404,7 @@ GVFS Java client: Configuration conf = new Configuration(); conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); -conf.set("fs.gravitino.server.uri","http://localhost:8090"); +conf.set("fs.gravitino.server.uri","${GRAVITINO_SERVER_IP:PORT}"); conf.set("fs.gravitino.client.metalake","test_metalake"); // No need to set azure-storage-account-name and azure-storage-account-name Path filesetPath = new Path("gvfs://fileset/adls_test_catalog/test_schema/test_fileset/new_dir"); @@ -420,7 +420,7 @@ spark = SparkSession.builder .appName("adls_fielset_test") .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") - .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.server.uri", "${GRAVITINO_SERVER_IP:PORT}") .config("spark.hadoop.fs.gravitino.client.metalake", "test") # No need to set azure-storage-account-name and azure-storage-account-name .config("spark.driver.memory", "2g") diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index 640ee5ee098..9578a952a50 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -42,7 +42,7 @@ Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#filese ## Using Hadoop catalog with GCS -### Create a Hadoop catalog/schema/file set with GCS +### Create a Hadoop catalog/schema/fileset with GCS First, you need to create a Hadoop catalog with GCS. The following example shows how to create a Hadoop catalog with GCS: @@ -61,7 +61,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ "gcs-service-account-file": "path_of_gcs_service_account_file", "filesystem-providers": "gcs" } -}' http://localhost:8090/api/metalakes/metalake/catalogs +}' ${GRAVITINO_SERVER_IP:PORT}/api/metalakes/metalake/catalogs ``` @@ -69,7 +69,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ ```java GravitinoClient gravitinoClient = GravitinoClient - .builder("http://localhost:8090") + .builder("${GRAVITINO_SERVER_IP:PORT}") .withMetalake("metalake") .build(); @@ -92,7 +92,7 @@ Catalog gcsCatalog = gravitinoClient.createCatalog("test_catalog", ```python -gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") +gravitino_client: GravitinoClient = GravitinoClient(uri="${GRAVITINO_SERVER_IP:PORT}", metalake_name="metalake") gcs_properties = { "location": "gs://bucket/root", "gcs-service-account-file": "path_of_gcs_service_account_file" @@ -124,7 +124,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ "properties": { "location": "gs://bucket/root/schema" } -}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas +}' ${GRAVITINO_SERVER_IP:PORT}/api/metalakes/metalake/catalogs/test_catalog/schemas ``` @@ -172,7 +172,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ "properties": { "k1": "v1" } -}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas/test_schema/filesets +}' ${GRAVITINO_SERVER_IP:PORT}/api/metalakes/metalake/catalogs/test_catalog/schemas/test_schema/filesets ``` @@ -180,7 +180,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ ```java GravitinoClient gravitinoClient = GravitinoClient - .builder("http://localhost:8090") + .builder("${GRAVITINO_SERVER_IP:PORT}") .withMetalake("metalake") .build(); @@ -204,7 +204,7 @@ filesetCatalog.createFileset( ```python -gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") +gravitino_client: GravitinoClient = GravitinoClient(uri="${GRAVITINO_SERVER_IP:PORT}", metalake_name="metalake") catalog: Catalog = gravitino_client.load_catalog(name="test_catalog") catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema", "example_fileset"), @@ -227,7 +227,7 @@ from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, Graviti from pyspark.sql import SparkSession import os -gravitino_url = "http://localhost:8090" +gravitino_url = "${GRAVITINO_SERVER_IP:PORT}" metalake_name = "test" catalog_name = "your_gcs_catalog" @@ -280,7 +280,7 @@ In some Spark versions, a Hadoop environment is needed by the driver, adding the Configuration conf = new Configuration(); conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); -conf.set("fs.gravitino.server.uri","http://localhost:8090"); +conf.set("fs.gravitino.server.uri","${GRAVITINO_SERVER_IP:PORT}"); conf.set("fs.gravitino.client.metalake","test_metalake"); conf.set("gcs-service-account-file", "/path/your-service-account-file.json"); Path filesetPath = new Path("gvfs://fileset/test_catalog/test_schema/test_fileset/new_dir"); @@ -310,7 +310,7 @@ The following are examples of how to use the `hadoop fs` command to access the f fs.gravitino.server.uri - http://192.168.50.188:8090 + ${GRAVITINO_SERVER_IP:PORT} @@ -348,7 +348,7 @@ options = { "auth_type": "simple", "gcs_service_account_file": "path_of_gcs_service_account_file.json", } -fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090", metalake_name="test_metalake", options=options) +fs = gvfs.GravitinoVirtualFileSystem(server_uri="${GRAVITINO_SERVER_IP:PORT}", metalake_name="test_metalake", options=options) fs.ls("gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/") ``` @@ -360,7 +360,7 @@ The following are examples of how to use the pandas library to access the GCS fi import pandas as pd storage_options = { - "server_uri": "http://localhost:8090", + "server_uri": "${GRAVITINO_SERVER_IP:PORT}", "metalake_name": "test", "options": { "gcs_service_account_file": "path_of_gcs_service_account_file.json", @@ -391,7 +391,7 @@ GVFS Java client: Configuration conf = new Configuration(); conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); -conf.set("fs.gravitino.server.uri","http://localhost:8090"); +conf.set("fs.gravitino.server.uri","${GRAVITINO_SERVER_IP:PORT}"); conf.set("fs.gravitino.client.metalake","test_metalake"); // No need to set gcs-service-account-file Path filesetPath = new Path("gvfs://fileset/gcs_test_catalog/test_schema/test_fileset/new_dir"); @@ -407,7 +407,7 @@ spark = SparkSession.builder .appName("gcs_fileset_test") .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") - .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.server.uri", "${GRAVITINO_SERVER_IP:PORT}") .config("spark.hadoop.fs.gravitino.client.metalake", "test") # No need to set gcs-service-account-file .config("spark.driver.memory", "2g") diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index a61c90600a3..761501079d3 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -292,7 +292,7 @@ conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hado conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); conf.set("fs.gravitino.server.uri","http://localhost:8090"); conf.set("fs.gravitino.client.metalake","test_metalake"); -conf.set("oss-endpoint", "http://localhost:9000"); +conf.set("oss-endpoint", "${GRAVITINO_SERVER_IP:PORT}"); conf.set("oss-access-key-id", "minio"); conf.set("oss-secret-access-key", "minio123"); Path filesetPath = new Path("gvfs://fileset/test_catalog/test_schema/test_fileset/new_dir"); @@ -322,7 +322,7 @@ The following are examples of how to use the `hadoop fs` command to access the f fs.gravitino.server.uri - http://192.168.50.188:8090 + ${GRAVITINO_SERVER_IP:PORT} @@ -368,11 +368,11 @@ options = { "cache_size": 20, "cache_expired_time": 3600, "auth_type": "simple", - "oss_endpoint": "http://localhost:9000", + "oss_endpoint": "${GRAVITINO_SERVER_IP:PORT}", "oss_access_key_id": "minio", "oss_secret_access_key": "minio123" } -fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090", metalake_name="test_metalake", options=options) +fs = gvfs.GravitinoVirtualFileSystem(server_uri="${GRAVITINO_SERVER_IP:PORT}", metalake_name="test_metalake", options=options) fs.ls("gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/") ``` @@ -386,7 +386,7 @@ The following are examples of how to use the pandas library to access the OSS fi import pandas as pd storage_options = { - "server_uri": "http://localhost:8090", + "server_uri": "${GRAVITINO_SERVER_IP:PORT}", "metalake_name": "test", "options": { "oss_access_key_id": "access_key", @@ -418,7 +418,7 @@ GVFS Java client: Configuration conf = new Configuration(); conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); -conf.set("fs.gravitino.server.uri","http://localhost:8090"); +conf.set("fs.gravitino.server.uri","${GRAVITINO_SERVER_IP:PORT}"); conf.set("fs.gravitino.client.metalake","test_metalake"); // No need to set oss-access-key-id and oss-secret-access-key Path filesetPath = new Path("gvfs://fileset/oss_test_catalog/test_schema/test_fileset/new_dir"); @@ -434,7 +434,7 @@ spark = SparkSession.builder .appName("oss_fielset_test") .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") - .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.server.uri", "${GRAVITINO_SERVER_IP:PORT}") .config("spark.hadoop.fs.gravitino.client.metalake", "test") # No need to set oss-access-key-id and oss-secret-access-key .config("spark.driver.memory", "2g") diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index b1a724f1f2f..de3c3e7fc3d 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -295,10 +295,10 @@ In some Spark versions, a Hadoop environment is needed by the driver, adding the Configuration conf = new Configuration(); conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); -conf.set("fs.gravitino.server.uri","http://localhost:8090"); +conf.set("fs.gravitino.server.uri","${GRAVITINO_SERVER_IP:PORT}"); conf.set("fs.gravitino.client.metalake","test_metalake"); -conf.set("s3-endpoint", "http://localhost:9000"); +conf.set("s3-endpoint", "${GRAVITINO_SERVER_IP:PORT}"); conf.set("s3-access-key-id", "minio"); conf.set("s3-secret-access-key", "minio123"); @@ -329,7 +329,7 @@ The following are examples of how to use the `hadoop fs` command to access the f fs.gravitino.server.uri - http://192.168.50.188:8090 + ${GRAVITINO_SERVER_IP:PORT} @@ -374,7 +374,7 @@ options = { "cache_size": 20, "cache_expired_time": 3600, "auth_type": "simple", - "s3_endpoint": "http://localhost:9000", + "s3_endpoint": "${GRAVITINO_SERVER_IP:PORT}", "s3_access_key_id": "minio", "s3_secret_access_key": "minio123" } From a6fbe7ba68d9d30c0a7f2e113433fc24a706992c Mon Sep 17 00:00:00 2001 From: yuqi Date: Tue, 7 Jan 2025 09:58:38 +0800 Subject: [PATCH 10/39] fix --- docs/hadoop-catalog-with-adls.md | 6 ++++-- docs/hadoop-catalog-with-gcs.md | 5 +++-- docs/hadoop-catalog-with-oss.md | 5 +++-- docs/hadoop-catalog-with-s3.md | 11 ++++++----- 4 files changed, 16 insertions(+), 11 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index 3ef052da730..22edc17cf4a 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -10,8 +10,8 @@ This document describes how to configure a Hadoop catalog with ADLS (Azure Blob ## Prerequisites -In order to create a Hadoop catalog with ADLS, you need to place [`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle) in Gravitino Hadoop classpath located -at `${HADOOP_HOME}/share/hadoop/common/lib/`. After that, start Gravitino server with the following command: +In order to create a Hadoop catalog with ADLS, you need to place [`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle) in Gravitino Hadoop catalog classpath located +at `${GRAVITINO_HOME}/catalogs/hadoop/libs//`. After that, start Gravitino server with the following command: ```bash $ bin/gravitino-server.sh start @@ -273,6 +273,8 @@ os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-bundle-{gra - [`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle) is the Gravitino ADLS jar with Hadoop environment and `hadoop-azure` jar. - [`gravitino-azure-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure) is a condensed version of the Gravitino ADLS bundle jar without Hadoop environment and `hadoop-azure` jar. +- `hadoop-azure-3.2.0.jar` and `azure-storage-7.0.0.jar` can be found in the Hadoop distribution in the `${HADOOP_HOME}/share/hadoop/tools/lib` directory. + Please choose the correct jar according to your environment. diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index 9578a952a50..e93f725d0e2 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -10,8 +10,8 @@ This document describes how to configure a Hadoop catalog with GCS. ## Prerequisites -In order to create a Hadoop catalog with GCS, you need to place [`gravitino-gcp-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp-bundle) in Gravitino Hadoop classpath located -at `${HADOOP_HOME}/share/hadoop/common/lib/`. After that, start Gravitino server with the following command: +In order to create a Hadoop catalog with GCS, you need to place [`gravitino-gcp-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp-bundle) in Gravitino Hadoop catalog classpath located +at `${GRAVITINO_HOME}/catalogs/hadoop/libs/`. After that, start Gravitino server with the following command: ```bash $ bin/gravitino-server.sh start @@ -267,6 +267,7 @@ os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-gcp-bundle-{gravi - [`gravitino-gcp-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp-bundle) is the Gravitino GCP jar with Hadoop environment and `gcs-connector`. - [`gravitino-gcp-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp) is a condensed version of the Gravitino GCP bundle jar without Hadoop environment and `gcs-connector`. +- `gcs-connector-hadoop3-2.2.22-shaded.jar` can be found [here](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases/download/v2.2.22/gcs-connector-hadoop3-2.2.22-shaded.jar) Please choose the correct jar according to your environment. diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index 761501079d3..634b40ad7e2 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -10,8 +10,8 @@ This document describes how to configure a Hadoop catalog with Aliyun OSS. ## Prerequisites -In order to create a Hadoop catalog with OSS, you need to place [`gravitino-aliyun-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun-bundle) in Gravitino Hadoop classpath located -at `${HADOOP_HOME}/share/hadoop/common/lib/`. After that, start Gravitino server with the following command: +In order to create a Hadoop catalog with OSS, you need to place [`gravitino-aliyun-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun-bundle) in Gravitino Hadoop catalog classpath located +at `${GRAVITINO_HOME}/catalogs/hadoop/libs/`. After that, start Gravitino server with the following command: ```bash $ bin/gravitino-server.sh start @@ -277,6 +277,7 @@ os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aliyun-bundle-{gr - [`gravitino-aliyun-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun-bundle) is the Gravitino Aliyun jar with Hadoop environment and `hadoop-oss` jar. - [`gravitino-aliyun-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun) is a condensed version of the Gravitino Aliyun bundle jar without Hadoop environment and `hadoop-aliyun` jar. +-`hadoop-aliyun-3.2.0.jar` and `aliyun-sdk-oss-2.8.3.jar` can be found in the Hadoop distribution in the `${HADOOP_HOME}/share/hadoop/tools/lib` directory. Please choose the correct jar according to your environment. diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index de3c3e7fc3d..890ae0662e7 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -10,8 +10,8 @@ This document describes how to configure a Hadoop catalog with S3. ## Prerequisites -In order to create a Hadoop catalog with S3, you need to place [`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle) in Gravitino Hadoop classpath located -at `${HADOOP_HOME}/share/hadoop/common/lib/`. After that, start Gravitino server with the following command: +In order to create a Hadoop catalog with S3, you need to place [`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle) in Gravitino Hadoop catalog classpath located +at `${GRAVITINO_HOME}/catalogs/hadoop/libs/`. After that, start Gravitino server with the following command: ```bash $ bin/gravitino-server.sh start @@ -44,7 +44,7 @@ Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#filese The rest of this document shows how to use the Hadoop catalog with S3 in Gravitino with a full example. -### Create a Hadoop catalog/schema/file set with S3 +### Create a Hadoop catalog/schema/fileset with S3 First of all, you need to create a Hadoop catalog with S3. The following example shows how to create a Hadoop catalog with S3: @@ -277,11 +277,12 @@ If your Spark **without Hadoop environment**, you can use the following code sni ```python ## Replace the following code snippet with the above code snippet with the same environment variables -os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aws-${gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-${gravitino-version}-SNAPSHOT.jar,/path/to/hadoop-aws-3.2.0.jar,/path/to/aws-java-sdk-bundle-1.11.375.jar --master local[1] pyspark-shell" +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aws-bundle-${gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-${gravitino-version}-SNAPSHOT.jar --master local[1] pyspark-shell" ``` - [`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle) is the Gravitino AWS jar with Hadoop environment and `hadoop-aws` jar. - [`gravitino-aws-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws) is a condensed version of the Gravitino AWS bundle jar without Hadoop environment and `hadoop-aws` jar. +- `hadoop-aws-3.2.0.jar` and `aws-java-sdk-bundle-1.11.375.jar` can be found in the Hadoop distribution in the `${HADOOP_HOME}/share/hadoop/tools/lib` directory. Please choose the correct jar according to your environment. @@ -355,7 +356,7 @@ The following are examples of how to use the `hadoop fs` command to access the f 2. Copy the necessary jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. -For S3, you need to copy `gravitino-aws-{version}.jar` to the `${HADOOP_HOME}/share/hadoop/common/lib` directoryl, +For S3, you need to copy `gravitino-aws-{version}.jar` to the `${HADOOP_HOME}/share/hadoop/common/lib` directory, then copy hadoop-aws-{version}.jar and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory, you can add all the jars in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. From 7b47a9b029228a9eb91af514037185219c4f8839 Mon Sep 17 00:00:00 2001 From: yuqi Date: Tue, 7 Jan 2025 10:03:07 +0800 Subject: [PATCH 11/39] fix --- docs/hadoop-catalog-with-adls.md | 4 ++-- docs/hadoop-catalog-with-gcs.md | 4 ++-- docs/hadoop-catalog-with-oss.md | 4 ++-- docs/hadoop-catalog-with-s3.md | 4 ++-- 4 files changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index 22edc17cf4a..28227a8a754 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -21,9 +21,9 @@ $ bin/gravitino-server.sh start The rest of this document shows how to use the Hadoop catalog with ADLS in Gravitino with a full example. -### Catalog a ADLS Hadoop catalog +### Create a ADLS Hadoop catalog -Apart from configuration method in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with ADLS: +Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with ADLS: | Configuration item | Description | Default value | Required | Since version | |-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|-------------------------------------------|------------------| diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index e93f725d0e2..d96b1f87869 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -22,9 +22,9 @@ $ bin/gravitino-server.sh start The rest of this document shows how to use the Hadoop catalog with GCS in Gravitino with a full example. -### Catalog a GCS Hadoop catalog +### Create a GCS Hadoop catalog -Apart from configuration method in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with GCS: +Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with GCS: | Configuration item | Description | Default value | Required | Since version | |-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------------------------|------------------| diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index 634b40ad7e2..4f7dd26defd 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -19,9 +19,9 @@ $ bin/gravitino-server.sh start ## Create a Hadoop Catalog with OSS -### Catalog an OSS Hadoop catalog +### Create an OSS Hadoop catalog -Apart from configuration method in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with OSS: +Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with OSS: | Configuration item | Description | Default value | Required | Since version | |-------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------------------------|------------------| diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index 890ae0662e7..a42234ef381 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -19,9 +19,9 @@ $ bin/gravitino-server.sh start ## Create a Hadoop Catalog with S3 -### Catalog a S3 Hadoop catalog +### Create a S3 Hadoop catalog -Apart from configuration method in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with S3: +Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with S3: | Configuration item | Description | Default value | Required | Since version | |-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|---------------------------|------------------| From ab0745565504922e26cddec8be8f2d8c64a4fe51 Mon Sep 17 00:00:00 2001 From: yuqi Date: Tue, 7 Jan 2025 10:25:09 +0800 Subject: [PATCH 12/39] Polish the doc --- docs/hadoop-catalog-with-adls.md | 8 ++++---- docs/hadoop-catalog-with-gcs.md | 6 +++--- docs/hadoop-catalog-with-oss.md | 6 +++--- docs/hadoop-catalog-with-s3.md | 10 +++++----- 4 files changed, 15 insertions(+), 15 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index 28227a8a754..f0cab6da677 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -115,7 +115,7 @@ adls_properties = gravitino_client.create_catalog(name="example_catalog", Then create a schema and fileset in the catalog created above. -Using the following code to create a schema and fileset: +Using the following code to create a schema and a fileset: @@ -346,8 +346,8 @@ then copy `hadoop-azure-${version}.jar` and related dependencies to the `${HADOO 3. Run the following command to access the fileset: ```shell -hadoop dfs -ls gvfs://fileset/adls_catalog/schema/example -hadoop dfs -put /path/to/local/file gvfs://fileset/adls_catalog/schema/example +hadoop dfs -ls gvfs://fileset/adls_catalog/adls_schema/adls_fileset +hadoop dfs -put /path/to/local/file gvfs://fileset/adls_catalog/adls_schema/adls_fileset ``` ### Using the Gravitino virtual file system Python client to access a fileset @@ -362,7 +362,7 @@ options = { "azure_storage_account_key": "azure_account_key" } fs = gvfs.GravitinoVirtualFileSystem(server_uri="${GRAVITINO_SERVER_IP:PORT}", metalake_name="test_metalake", options=options) -fs.ls("gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/") +fs.ls("gvfs://fileset/{adls_catalog}/{adls_schema}/{adls_fileset}/") ``` diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index d96b1f87869..1466cb21754 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -111,7 +111,7 @@ gcs_properties = gravitino_client.create_catalog(name="test_catalog", Then create a schema and a fileset in the catalog created above. -Using the following code to create a schema and fileset: +Using the following code to create a schema and a fileset: @@ -334,8 +334,8 @@ Then copy `hadoop-gcp-${version}.jar` and other possible dependencies to the `${ 3. Run the following command to access the fileset: ```shell -hadoop dfs -ls gvfs://fileset/gcs_catalog/schema/example -hadoop dfs -put /path/to/local/file gvfs://fileset/gcs_catalog/schema/example +hadoop dfs -ls gvfs://fileset/gcs_catalog/gcs_schema/gcs_example +hadoop dfs -put /path/to/local/file gvfs://fileset/gcs_catalog/gcs_schema/gcs_example ``` diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index 4f7dd26defd..9afdb2e9c79 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -119,7 +119,7 @@ oss_catalog = gravitino_client.create_catalog(name="test_catalog", Then create a schema and a fileset in the catalog created above. -Using the following code to create a schema and fileset: +Using the following code to create a schema and a fileset: @@ -356,8 +356,8 @@ then copy hadoop-aliyun-{version}.jar and related dependencies to the `${HADOOP_ 3. Run the following command to access the fileset: ```shell -hadoop dfs -ls gvfs://fileset/oss_catalog/schema/example -hadoop dfs -put /path/to/local/file gvfs://fileset/oss_catalog/schema/example +hadoop dfs -ls gvfs://fileset/oss_catalog/oss_schema/oss_fileset +hadoop dfs -put /path/to/local/file gvfs://fileset/oss_catalog/schema/oss_fileset ``` diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index a42234ef381..3928f67dc55 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -123,7 +123,7 @@ The value of location should always start with `s3a` NOT `s3` for AWS S3, for in Then create a schema and a fileset in the catalog created above. -Using the following code to create a schema and fileset: +Using the following code to create a schema and a fileset: @@ -303,13 +303,13 @@ conf.set("s3-endpoint", "${GRAVITINO_SERVER_IP:PORT}"); conf.set("s3-access-key-id", "minio"); conf.set("s3-secret-access-key", "minio123"); -Path filesetPath = new Path("gvfs://fileset/test_catalog/test_schema/test_fileset/new_dir"); +Path filesetPath = new Path("gvfs://fileset/adls_catalog/adls_schema/adls_fileset/new_dir"); FileSystem fs = filesetPath.getFileSystem(conf); fs.mkdirs(filesetPath); ... ``` -Similar to Spark configurations, you need to add S3 bundle jars to the classpath according to your environment. +Similar to Spark configurations, you need to add S3 (bundle) jars to the classpath according to your environment. ### Accessing a fileset using the Hadoop fs command @@ -363,8 +363,8 @@ then copy hadoop-aws-{version}.jar and related dependencies to the `${HADOOP_HOM 3. Run the following command to access the fileset: ```shell -hadoop dfs -ls gvfs://fileset/s3_catalog/schema/example -hadoop dfs -put /path/to/local/file gvfs://fileset/s3_catalog/schema/example +hadoop dfs -ls gvfs://fileset/s3_catalog/s3_schema/s3_fileset +hadoop dfs -put /path/to/local/file gvfs://fileset/s3_catalog/s3_schema/s3_fileset ``` ### Using the Gravitino virtual file system Python client to access a fileset From 6c1aac3578016ffd2446e35cf2207030fb2feabf Mon Sep 17 00:00:00 2001 From: yuqi Date: Tue, 7 Jan 2025 12:09:13 +0800 Subject: [PATCH 13/39] Optimize the docs --- docs/hadoop-catalog-with-adls.md | 33 ++++++++++++++++++------ docs/hadoop-catalog-with-gcs.md | 22 ++++++++++------ docs/hadoop-catalog-with-oss.md | 43 ++++++++++++++++++++------------ docs/hadoop-catalog-with-s3.md | 43 ++++++++++++++++++++------------ 4 files changed, 94 insertions(+), 47 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index f0cab6da677..96a3c47b2a6 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -10,8 +10,17 @@ This document describes how to configure a Hadoop catalog with ADLS (Azure Blob ## Prerequisites -In order to create a Hadoop catalog with ADLS, you need to place [`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle) in Gravitino Hadoop catalog classpath located -at `${GRAVITINO_HOME}/catalogs/hadoop/libs//`. After that, start Gravitino server with the following command: +To set up a Hadoop catalog with ADLS, follow these steps: + +1. Download the [`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle) file. +2. Place the downloaded file into the Gravitino Hadoop catalog classpath at `${GRAVITINO_HOME}/catalogs/hadoop/libs/`. +3. Start the Gravitino server by running the following command: + +```bash +$ bin/gravitino-server.sh start +``` +Once the server is up and running, you can proceed to configure the Hadoop catalog with ADLS. + ```bash $ bin/gravitino-server.sh start @@ -21,7 +30,7 @@ $ bin/gravitino-server.sh start The rest of this document shows how to use the Hadoop catalog with ADLS in Gravitino with a full example. -### Create a ADLS Hadoop catalog +### Configuration for a ADLS Hadoop catalog Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with ADLS: @@ -32,18 +41,20 @@ Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./ | `azure-storage-account-name ` | The account name of Azure Blob Storage. | (none) | Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating | | `azure-storage-account-key` | The account key of Azure Blob Storage. | (none) | Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating | -### Create a schema +### Configuration for a schema Refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations) for more details. -### Create a fileset +### Configuration for a fileset Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for more details. ## Using Hadoop catalog with ADLS -### Create a Hadoop catalog/schema/fileset with ADLS +This section demonstrates how to use the Hadoop catalog with ADLS in Gravitino, with a complete example. + +### Step1: Create a Hadoop catalog with ADLS First, you need to create a Hadoop catalog with ADLS. The following example shows how to create a Hadoop catalog with ADLS: @@ -113,9 +124,9 @@ adls_properties = gravitino_client.create_catalog(name="example_catalog", -Then create a schema and fileset in the catalog created above. +### Step2: Create a schema -Using the following code to create a schema and a fileset: +Once the catalog is created, you can create a schema. The following example shows how to create a schema: @@ -163,6 +174,10 @@ catalog.as_schemas().create_schema(name="test_schema", +### Step3: Create a fileset + +After creating the schema, you can create a fileset. The following example shows how to create a fileset: + @@ -221,6 +236,8 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema +## Accessing a fileset with ADLS + ### Using Spark to access the fileset The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index 1466cb21754..ce23fdd1f3b 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -21,8 +21,7 @@ $ bin/gravitino-server.sh start The rest of this document shows how to use the Hadoop catalog with GCS in Gravitino with a full example. - -### Create a GCS Hadoop catalog +### Configuration for a GCS Hadoop catalog Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with GCS: @@ -32,17 +31,19 @@ Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./ | `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for GCS, if we set this value, we can omit the prefix 'gs://' in the location. | `builtin-local` | No | 0.7.0-incubating | | `gcs-service-account-file` | The path of GCS service account JSON file. | (none) | Yes if it's a GCS fileset. | 0.7.0-incubating | -### Create a schema +### Configuration for a schema Refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations) for more details. -### Create a fileset +### Configuration for a fileset Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for more details. ## Using Hadoop catalog with GCS -### Create a Hadoop catalog/schema/fileset with GCS +This section will show you how to use the Hadoop catalog with GCS in Gravitino, including detailed examples. + +### Create a Hadoop catalog with GCS First, you need to create a Hadoop catalog with GCS. The following example shows how to create a Hadoop catalog with GCS: @@ -109,9 +110,9 @@ gcs_properties = gravitino_client.create_catalog(name="test_catalog", -Then create a schema and a fileset in the catalog created above. +### Step2: Create a schema -Using the following code to create a schema and a fileset: +Once you have created a Hadoop catalog with GCS, you can create a schema. The following example shows how to create a schema: @@ -159,6 +160,11 @@ catalog.as_schemas().create_schema(name="test_schema", + +### Step3: Create a fileset + +After creating a schema, you can create a fileset. The following example shows how to create a fileset: + @@ -217,6 +223,8 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema +## Accessing a fileset with GCS + ### Using Spark to access the fileset The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index 9afdb2e9c79..15e481087c9 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -6,22 +6,26 @@ keyword: Hadoop catalog OSS license: "This software is licensed under the Apache License version 2." --- -This document describes how to configure a Hadoop catalog with Aliyun OSS. +This document explains how to configure a Hadoop catalog with Aliyun OSS (Object Storage Service) in Gravitino. ## Prerequisites -In order to create a Hadoop catalog with OSS, you need to place [`gravitino-aliyun-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun-bundle) in Gravitino Hadoop catalog classpath located -at `${GRAVITINO_HOME}/catalogs/hadoop/libs/`. After that, start Gravitino server with the following command: +To set up a Hadoop catalog with OSS, follow these steps: + +1. Download the [`gravitino-aliyun-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun-bundle) file. +2. Place the downloaded file into the Gravitino Hadoop catalog classpath at `${GRAVITINO_HOME}/catalogs/hadoop/libs/`. +3. Start the Gravitino server by running the following command: ```bash $ bin/gravitino-server.sh start ``` +Once the server is up and running, you can proceed to configure the Hadoop catalog with OSS. ## Create a Hadoop Catalog with OSS -### Create an OSS Hadoop catalog +### Configuration for an OSS Hadoop catalog -Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with OSS: +In addition to the basic configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with OSS: | Configuration item | Description | Default value | Required | Since version | |-------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------------------------|------------------| @@ -31,22 +35,21 @@ Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./ | `oss-access-key-id` | The access key of the Aliyun OSS. | (none) | Yes if it's a OSS fileset. | 0.7.0-incubating | | `oss-secret-access-key` | The secret key of the Aliyun OSS. | (none) | Yes if it's a OSS fileset. | 0.7.0-incubating | -### Create a schema - -Refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations) for more details. +### Configuration for a schema -### Create a fileset +To create a schema, refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations). -Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for more details. +### Configuration for a fileset +For instructions on how to create a fileset, refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for more details. ## Using Hadoop catalog with OSS -The rest of this document shows how to use the Hadoop catalog with OSS in Gravitino with a full example. +This section will show you how to use the Hadoop catalog with OSS in Gravitino, including detailed examples. -### Create a Hadoop catalog/schema/fileset with OSS +### Create a Hadoop catalog with OSS -First, you need to create a Hadoop catalog with OSS. The following example shows how to create a Hadoop catalog with OSS: +First, you need to create a Hadoop catalog for OSS. The following examples demonstrate how to create a Hadoop catalog with OSS: @@ -117,9 +120,9 @@ oss_catalog = gravitino_client.create_catalog(name="test_catalog", -Then create a schema and a fileset in the catalog created above. +Step 2: Create a Schema -Using the following code to create a schema and a fileset: +Once the Hadoop catalog with OSS is created, you can create a schema inside that catalog. Below are examples of how to do this: @@ -167,6 +170,12 @@ catalog.as_schemas().create_schema(name="test_schema", + +### Create a fileset + +Now that the schema is created, you can create a fileset inside it. Here’s how: + + @@ -225,6 +234,8 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema +## Accessing a fileset with OSS + ### Using Spark to access the fileset The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: @@ -432,7 +443,7 @@ Spark: ```python spark = SparkSession.builder - .appName("oss_fielset_test") + .appName("oss_fileset_test") .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") .config("spark.hadoop.fs.gravitino.server.uri", "${GRAVITINO_SERVER_IP:PORT}") diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index 3928f67dc55..c475dd9f29f 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -6,22 +6,28 @@ keyword: Hadoop catalog S3 license: "This software is licensed under the Apache License version 2." --- -This document describes how to configure a Hadoop catalog with S3. +This document explains how to configure a Hadoop catalog with S3 in Gravitino. ## Prerequisites -In order to create a Hadoop catalog with S3, you need to place [`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle) in Gravitino Hadoop catalog classpath located -at `${GRAVITINO_HOME}/catalogs/hadoop/libs/`. After that, start Gravitino server with the following command: +To create a Hadoop catalog with S3, follow these steps: + +1. Download the [`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle) file. +2. Place this file in the Gravitino Hadoop catalog classpath at `${GRAVITINO_HOME}/catalogs/hadoop/libs/`. +3. Start the Gravitino server using the following command: ```bash $ bin/gravitino-server.sh start ``` +Once the server is running, you can proceed to create the Hadoop catalog with S3. + + ## Create a Hadoop Catalog with S3 -### Create a S3 Hadoop catalog +### Configuration for S3 Hadoop Catalog -Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with S3: +In addition to the basic configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are necessary to configure a Hadoop catalog with S3: | Configuration item | Description | Default value | Required | Since version | |-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|---------------------------|------------------| @@ -31,20 +37,20 @@ Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./ | `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes if it's a S3 fileset. | 0.7.0-incubating | | `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes if it's a S3 fileset. | 0.7.0-incubating | -### Create a schema +### Configuration for a schema -Refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations) for more details. +To learn how to create a schema, refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations). -### Create a fileset +### Configuration for a fileset -Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for more details. +For more details on creating a fileset, Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations). -## Using Hadoop catalog with S3 +## Using the Hadoop catalog with S3 -The rest of this document shows how to use the Hadoop catalog with S3 in Gravitino with a full example. +This section demonstrates how to use the Hadoop catalog with S3 in Gravitino, with a complete example. -### Create a Hadoop catalog/schema/fileset with S3 +### Step1: Create a Hadoop Catalog with S3 First of all, you need to create a Hadoop catalog with S3. The following example shows how to create a Hadoop catalog with S3: @@ -118,12 +124,12 @@ s3_catalog = gravitino_client.create_catalog(name="test_catalog", :::note -The value of location should always start with `s3a` NOT `s3` for AWS S3, for instance, `s3a://bucket/root`. Value like `s3://bucket/root` is not supported due to the limitation of the hadoop-aws library. +When using S3 with Hadoop, ensure that the location value starts with s3a:// (not s3://) for AWS S3. For example, use s3a://bucket/root, as the s3:// format is not supported by the hadoop-aws library. ::: -Then create a schema and a fileset in the catalog created above. +### Step2: Create a schema -Using the following code to create a schema and a fileset: +Once your Hadoop catalog with S3 is created, you can create a schema under the catalog. Here are examples of how to do that: @@ -172,6 +178,10 @@ catalog.as_schemas().create_schema(name="test_schema", +### Step3: Create a fileset + +After creating the schema, you can create a fileset. Here are examples for creating a fileset: + @@ -230,10 +240,11 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", "e +## Accessing a fileset with S3 ### Using Spark to access the fileset -The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: +The following Python code demonstrates how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: ```python import logging From 44014d9ae43f72726fb203450e513a36a0077b17 Mon Sep 17 00:00:00 2001 From: yuqi Date: Tue, 7 Jan 2025 14:12:14 +0800 Subject: [PATCH 14/39] format code. --- docs/hadoop-catalog-with-adls.md | 20 ++++++++------------ docs/hadoop-catalog-with-gcs.md | 23 ++++++++++++----------- docs/hadoop-catalog-with-oss.md | 14 +++++++------- docs/hadoop-catalog-with-s3.md | 12 ++++++------ 4 files changed, 33 insertions(+), 36 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index 96a3c47b2a6..1e14fe7d1d5 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -21,16 +21,13 @@ $ bin/gravitino-server.sh start ``` Once the server is up and running, you can proceed to configure the Hadoop catalog with ADLS. - ```bash $ bin/gravitino-server.sh start ``` -## Create a Hadoop Catalog with ADLS - -The rest of this document shows how to use the Hadoop catalog with ADLS in Gravitino with a full example. +## Configurations for creating a Hadoop catalog with ADLS -### Configuration for a ADLS Hadoop catalog +### Configuration for a ADLS Hadoop catalog Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with ADLS: @@ -41,18 +38,17 @@ Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./ | `azure-storage-account-name ` | The account name of Azure Blob Storage. | (none) | Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating | | `azure-storage-account-key` | The account key of Azure Blob Storage. | (none) | Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating | -### Configuration for a schema - -Refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations) for more details. +### Configurations for a schema -### Configuration for a fileset +Refer to [Schema configurations](./hadoop-catalog.md#schema-properties) for more details. -Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for more details. +### Configurations for a fileset +Refer to [Fileset configurations](./hadoop-catalog.md#fileset-properties) for more details. -## Using Hadoop catalog with ADLS +## Example of creating Hadoop catalog with ADLS -This section demonstrates how to use the Hadoop catalog with ADLS in Gravitino, with a complete example. +This section demonstrates how to create the Hadoop catalog with ADLS in Gravitino, with a complete example. ### Step1: Create a Hadoop catalog with ADLS diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index ce23fdd1f3b..86bbb324ed9 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -9,19 +9,20 @@ license: "This software is licensed under the Apache License version 2." This document describes how to configure a Hadoop catalog with GCS. ## Prerequisites +To set up a Hadoop catalog with OSS, follow these steps: -In order to create a Hadoop catalog with GCS, you need to place [`gravitino-gcp-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp-bundle) in Gravitino Hadoop catalog classpath located -at `${GRAVITINO_HOME}/catalogs/hadoop/libs/`. After that, start Gravitino server with the following command: +1. Download the [`gravitino-gcp-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp-bundle) file. +2. Place the downloaded file into the Gravitino Hadoop catalog classpath at `${GRAVITINO_HOME}/catalogs/hadoop/libs/`. +3. Start the Gravitino server by running the following command: ```bash $ bin/gravitino-server.sh start ``` +Once the server is up and running, you can proceed to configure the Hadoop catalog with GCS. -## Create a Hadoop Catalog with GCS +## Configurations for creating a Hadoop catalog with GCS -The rest of this document shows how to use the Hadoop catalog with GCS in Gravitino with a full example. - -### Configuration for a GCS Hadoop catalog +### Configurations for a GCS Hadoop catalog Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with GCS: @@ -31,15 +32,15 @@ Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./ | `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for GCS, if we set this value, we can omit the prefix 'gs://' in the location. | `builtin-local` | No | 0.7.0-incubating | | `gcs-service-account-file` | The path of GCS service account JSON file. | (none) | Yes if it's a GCS fileset. | 0.7.0-incubating | -### Configuration for a schema +### Configurations for a schema -Refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations) for more details. +Refer to [Schema configurations](./hadoop-catalog.md#schema-properties) for more details. -### Configuration for a fileset +### Configurations for a fileset -Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for more details. +Refer to [Fileset configurations](./hadoop-catalog.md#fileset-properties) for more details. -## Using Hadoop catalog with GCS +## Example of creating Hadoop catalog with GCS This section will show you how to use the Hadoop catalog with GCS in Gravitino, including detailed examples. diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index 15e481087c9..93750bd6053 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -21,7 +21,7 @@ $ bin/gravitino-server.sh start ``` Once the server is up and running, you can proceed to configure the Hadoop catalog with OSS. -## Create a Hadoop Catalog with OSS +## Configurations for creating a Hadoop catalog with OSS ### Configuration for an OSS Hadoop catalog @@ -35,19 +35,19 @@ In addition to the basic configurations mentioned in [Hadoop-catalog-catalog-con | `oss-access-key-id` | The access key of the Aliyun OSS. | (none) | Yes if it's a OSS fileset. | 0.7.0-incubating | | `oss-secret-access-key` | The secret key of the Aliyun OSS. | (none) | Yes if it's a OSS fileset. | 0.7.0-incubating | -### Configuration for a schema +### Configurations for a schema -To create a schema, refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations). +To create a schema, refer to [Schema configurations](./hadoop-catalog.md#schema-properties). -### Configuration for a fileset +### Configurations for a fileset -For instructions on how to create a fileset, refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for more details. +For instructions on how to create a fileset, refer to [Fileset configurations](./hadoop-catalog.md#fileset-properties) for more details. -## Using Hadoop catalog with OSS +## Example of creating Hadoop catalog/schema/fileset with OSS This section will show you how to use the Hadoop catalog with OSS in Gravitino, including detailed examples. -### Create a Hadoop catalog with OSS +### Step1: Create a Hadoop catalog with OSS First, you need to create a Hadoop catalog for OSS. The following examples demonstrate how to create a Hadoop catalog with OSS: diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index c475dd9f29f..5b83455d3ea 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -23,9 +23,9 @@ $ bin/gravitino-server.sh start Once the server is running, you can proceed to create the Hadoop catalog with S3. -## Create a Hadoop Catalog with S3 +## Configurations for creating a Hadoop catalog with S3 -### Configuration for S3 Hadoop Catalog +### Configurations for S3 Hadoop Catalog In addition to the basic configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are necessary to configure a Hadoop catalog with S3: @@ -37,13 +37,13 @@ In addition to the basic configurations mentioned in [Hadoop-catalog-catalog-con | `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes if it's a S3 fileset. | 0.7.0-incubating | | `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes if it's a S3 fileset. | 0.7.0-incubating | -### Configuration for a schema +### Configurations for a schema -To learn how to create a schema, refer to [Schema operation](./manage-fileset-metadata-using-gravitino.md#schema-operations). +To learn how to create a schema, refer to [Schema configurations](./hadoop-catalog.md#schema-properties). -### Configuration for a fileset +### Configurations for a fileset -For more details on creating a fileset, Refer to [Fileset operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations). +For more details on creating a fileset, Refer to [Fileset configurations](./hadoop-catalog.md#fileset-properties). ## Using the Hadoop catalog with S3 From 8563c9171609fb2677bd136b00f0705b699cf7e1 Mon Sep 17 00:00:00 2001 From: yuqi Date: Wed, 8 Jan 2025 16:17:59 +0800 Subject: [PATCH 15/39] polish document --- docs/hadoop-catalog-with-adls.md | 4 ++-- docs/hadoop-catalog-with-gcs.md | 4 ++-- docs/hadoop-catalog-with-oss.md | 4 ++-- docs/hadoop-catalog-with-s3.md | 4 ++-- 4 files changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index 1e14fe7d1d5..894201fea36 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -403,11 +403,11 @@ For other use cases, please refer to the [Gravitino Virtual File System](./how-t ## Fileset with credential -Since 0.8.0-incubating, Gravitino supports credential vending for ADLS fileset. If the catalog has been configured with credential, you can access ADLS fileset without providing authentication information like `azure-storage-account-name` and `azure-storage-account-key` in the properties. +Since 0.8.0-incubating, Gravitino supports credential vending for ADLS fileset. If the catalog has been [configured with credential](./security/credential-vending.md), you can access ADLS fileset without providing authentication information like `azure-storage-account-name` and `azure-storage-account-key` in the properties. ### How to create an ADLS Hadoop catalog with credential enabled -Apart from configuration method in [create-adls-hadoop-catalog](#catalog-a-catalog), properties needed by [adls-credential](./security/credential-vending.md#adls-credentials) should also be set to enable credential vending for ADLSfileset. +Apart from configuration method in [create-adls-hadoop-catalog](#configuration-for-a-adls-hadoop-catalog), properties needed by [adls-credential](./security/credential-vending.md#adls-credentials) should also be set to enable credential vending for ADLSfileset. ### How to access ADLS fileset with credential diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index 86bbb324ed9..5ca254fd3e7 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -385,11 +385,11 @@ For other use cases, please refer to the [Gravitino Virtual File System](./how-t ## Fileset with credential -Since 0.8.0-incubating, Gravitino supports credential vending for GCS fileset. If the catalog has been configured with credential, you can access GCS fileset without providing authentication information like `gcs-service-account-file` in the properties. +Since 0.8.0-incubating, Gravitino supports credential vending for GCS fileset. If the catalog has been [configured with credential](./security/credential-vending.md), you can access GCS fileset without providing authentication information like `gcs-service-account-file` in the properties. ### How to create a GCS Hadoop catalog with credential enabled -Apart from configuration method in [create-gcs-hadoop-catalog](#catalog-a-catalog), properties needed by [gcs-credential](./security/credential-vending.md#gcs-credentials) should also be set to enable credential vending for GCS fileset. +Apart from configuration method in [create-gcs-hadoop-catalog](#configurations-for-a-gcs-hadoop-catalog), properties needed by [gcs-credential](./security/credential-vending.md#gcs-credentials) should also be set to enable credential vending for GCS fileset. ### How to access GCS fileset with credential diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index 93750bd6053..58cbac6543f 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -414,11 +414,11 @@ For other use cases, please refer to the [Gravitino Virtual File System](./how-t ## Fileset with credential -Since 0.8.0-incubating, Gravitino supports credential vending for OSS fileset. If the catalog has been configured with credential, you can access OSS fileset without providing authentication information like `oss-access-key-id` and `oss-secret-access-key` in the properties. +Since 0.8.0-incubating, Gravitino supports credential vending for OSS fileset. If the catalog has been [configured with credential](./security/credential-vending.md), you can access OSS fileset without providing authentication information like `oss-access-key-id` and `oss-secret-access-key` in the properties. ### How to create a OSS Hadoop catalog with credential enabled -Apart from configuration method in [create-oss-hadoop-catalog](#catalog-a-catalog), properties needed by [oss-credential](./security/credential-vending.md#oss-credentials) should also be set to enable credential vending for OSS fileset. +Apart from configuration method in [create-oss-hadoop-catalog](#configuration-for-an-oss-hadoop-catalog), properties needed by [oss-credential](./security/credential-vending.md#oss-credentials) should also be set to enable credential vending for OSS fileset. ### How to access OSS fileset with credential diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index 5b83455d3ea..4cb7f920896 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -418,11 +418,11 @@ For other use cases, please refer to the [Gravitino Virtual File System](./how-t ## Fileset with credential -Since 0.8.0-incubating, Gravitino supports credential vending for S3 fileset. If the catalog has been configured with credential, you can access S3 fileset without providing authentication information like `s3-access-key-id` and `s3-secret-access-key` in the properties. +Since 0.8.0-incubating, Gravitino supports credential vending for S3 fileset. If the catalog has been [configured with credential](./security/credential-vending.md), you can access S3 fileset without providing authentication information like `s3-access-key-id` and `s3-secret-access-key` in the properties. ### How to create a S3 Hadoop catalog with credential enabled -Apart from configuration method in [create-s3-hadoop-catalog](#catalog-a-catalog), properties needed by [s3-credential](./security/credential-vending.md#s3-credentials) should also be set to enable credential vending for S3 fileset. +Apart from configuration method in [create-s3-hadoop-catalog](#configurations-for-s3-hadoop-catalog), properties needed by [s3-credential](./security/credential-vending.md#s3-credentials) should also be set to enable credential vending for S3 fileset. ### How to access S3 fileset with credential From 0b066a548fa881e5ae75d5adf811c6a70f4e6d45 Mon Sep 17 00:00:00 2001 From: yuqi Date: Wed, 8 Jan 2025 18:06:35 +0800 Subject: [PATCH 16/39] polish docs --- docs/hadoop-catalog-with-adls.md | 6 +-- docs/hadoop-catalog-with-gcs.md | 3 ++ docs/hadoop-catalog-with-oss.md | 6 +-- docs/hadoop-catalog-with-s3.md | 5 +-- docs/how-to-use-gvfs.md | 71 ++++++++++++++++---------------- 5 files changed, 44 insertions(+), 47 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index 894201fea36..e1d719161af 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -350,11 +350,9 @@ The following are examples of how to use the `hadoop fs` command to access the f ``` -2. Copy the necessary jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. - -For ADLS, you need to copy `gravitino-azure-{version}.jar` to the `${HADOOP_HOME}/share/hadoop/common/lib` directory, -then copy `hadoop-azure-${version}.jar` and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory, you can add all the jars in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. +2. Add the necessary jars to the Hadoop classpath. +For ADLS, you need to add `gravitino-azure-${gravitino-version}.jar` and `hadoop-azure-${hadoop-version}.jar` and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. 3. Run the following command to access the fileset: diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index 5ca254fd3e7..0ab3aa8ec69 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -339,6 +339,9 @@ The following are examples of how to use the `hadoop fs` command to access the f For GCS, you need to copy `gravitino-gcp-{version}.jar` to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. Then copy `hadoop-gcp-${version}.jar` and other possible dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory, you can add all the jars in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. +2. Add the necessary jars to the Hadoop classpath. + +For GCS, you need to add `gravitino-gcp-${gravitino-version}.jar` and `gcs-connector-hadoop3-2.2.22-shaded.jar` can be found [here](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases/download/v2.2.22/gcs-connector-hadoop3-2.2.22-shaded.jar) to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. 3. Run the following command to access the fileset: diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index 58cbac6543f..1ed2523f31b 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -358,11 +358,9 @@ The following are examples of how to use the `hadoop fs` command to access the f ``` -2. Copy the necessary jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. - -For OSS, you need to copy `gravitino-aliyun-{version}.jar` to the `${HADOOP_HOME}/share/hadoop/common/lib` directory, -then copy hadoop-aliyun-{version}.jar and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory, you can add all the jars in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. +2. Add the necessary jars to the Hadoop classpath. +For OSS, you need to add `gravitino-aliyun-${gravitino-version}.jar` and `hadoop-aliyun-${hadoop-version}.jar` and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. 3. Run the following command to access the fileset: diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index 4cb7f920896..4f0c023c79e 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -365,10 +365,9 @@ The following are examples of how to use the `hadoop fs` command to access the f ``` -2. Copy the necessary jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. +2. Add the necessary jars to the Hadoop classpath. -For S3, you need to copy `gravitino-aws-{version}.jar` to the `${HADOOP_HOME}/share/hadoop/common/lib` directory, -then copy hadoop-aws-{version}.jar and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory, you can add all the jars in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. +For S3, you need to add `gravitino-aws-${gravitino-version}.jar` and `hadoop-aws-${hadoop-version}.jar` and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. 3. Run the following command to access the fileset: diff --git a/docs/how-to-use-gvfs.md b/docs/how-to-use-gvfs.md index d32ad3da672..9514f5457eb 100644 --- a/docs/how-to-use-gvfs.md +++ b/docs/how-to-use-gvfs.md @@ -42,8 +42,7 @@ the path mapping and convert automatically. ### Prerequisites -+ A Hadoop environment with HDFS running. GVFS has been tested against - Hadoop 3.3.1. It is recommended to use Hadoop 3.3.1 or later, but it should work with Hadoop 2. + - GVFS has been tested against Hadoop 3.3.1. It is recommended to use Hadoop 3.3.1 or later, but it should work with Hadoop 2. x. Please create an [issue](https://www.github.com/apache/gravitino/issues) if you find any compatibility issues. @@ -71,11 +70,11 @@ Apart from the above properties, to access fileset like S3, GCS, OSS and custom #### S3 fileset -| Configuration item | Description | Default value | Required | Since version | -|------------------------|-------------------------------|---------------|--------------------------|------------------| -| `s3-endpoint` | The endpoint of the AWS S3. | (none) | Yes if it's a S3 fileset.| 0.7.0-incubating | -| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes if it's a S3 fileset.| 0.7.0-incubating | -| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes if it's a S3 fileset.| 0.7.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|------------------------|-------------------------------|---------------|----------|------------------| +| `s3-endpoint` | The endpoint of the AWS S3. | (none) | Yes | 0.7.0-incubating | +| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes | 0.7.0-incubating | +| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes | 0.7.0-incubating | At the same time, you need to add the corresponding bundle jar 1. [`gravitino-aws-bundle-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws-bundle/) in the classpath if no Hadoop environment is available, or @@ -84,9 +83,9 @@ At the same time, you need to add the corresponding bundle jar #### GCS fileset -| Configuration item | Description | Default value | Required | Since version | -|----------------------------|--------------------------------------------|---------------|---------------------------|------------------| -| `gcs-service-account-file` | The path of GCS service account JSON file. | (none) | Yes if it's a GCS fileset.| 0.7.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|----------------------------|--------------------------------------------|---------------|----------|------------------| +| `gcs-service-account-file` | The path of GCS service account JSON file. | (none) | Yes | 0.7.0-incubating | In the meantime, you need to add the corresponding bundle jar 1. [`gravitino-gcp-bundle-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-gcp-bundle/) in the classpath if no hadoop environment is available, or @@ -95,11 +94,11 @@ In the meantime, you need to add the corresponding bundle jar #### OSS fileset -| Configuration item | Description | Default value | Required | Since version | -|-------------------------|-----------------------------------|---------------|---------------------------|------------------| -| `oss-endpoint` | The endpoint of the Aliyun OSS. | (none) | Yes if it's a OSS fileset.| 0.7.0-incubating | -| `oss-access-key-id` | The access key of the Aliyun OSS. | (none) | Yes if it's a OSS fileset.| 0.7.0-incubating | -| `oss-secret-access-key` | The secret key of the Aliyun OSS. | (none) | Yes if it's a OSS fileset.| 0.7.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|-------------------------|-----------------------------------|---------------|----------|------------------| +| `oss-endpoint` | The endpoint of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | +| `oss-access-key-id` | The access key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | +| `oss-secret-access-key` | The secret key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | In the meantime, you need to place the corresponding bundle jar @@ -108,10 +107,10 @@ In the meantime, you need to place the corresponding bundle jar #### Azure Blob Storage fileset -| Configuration item | Description | Default value | Required | Since version | -|------------------------------|-----------------------------------------|---------------|-------------------------------------------|------------------| -| `azure-storage-account-name` | The account name of Azure Blob Storage. | (none) | Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating | -| `azure-storage-account-key` | The account key of Azure Blob Storage. | (none) | Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|------------------------------|-----------------------------------------|---------------|----------|------------------| +| `azure-storage-account-name` | The account name of Azure Blob Storage. | (none) | Yes | 0.8.0-incubating | +| `azure-storage-account-key` | The account key of Azure Blob Storage. | (none) | Yes | 0.8.0-incubating | Similar to the above, you need to place the corresponding bundle jar 1. [`gravitino-azure-bundle-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-azure-bundle/) in the classpath if no hadoop environment is available, or @@ -467,32 +466,32 @@ to recompile the native libraries like `libhdfs` and others, and completely repl The following properties are required if you want to access the S3 fileset via the GVFS python client: -| Configuration item | Description | Default value | Required | Since version | -|----------------------------|------------------------------|---------------|--------------------------|------------------| -| `s3_endpoint` | The endpoint of the AWS S3. | (none) | Yes if it's a S3 fileset.| 0.7.0-incubating | -| `s3_access_key_id` | The access key of the AWS S3.| (none) | Yes if it's a S3 fileset.| 0.7.0-incubating | -| `s3_secret_access_key` | The secret key of the AWS S3.| (none) | Yes if it's a S3 fileset.| 0.7.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|----------------------------|------------------------------|---------------|----------|------------------| +| `s3_endpoint` | The endpoint of the AWS S3. | (none) | Yes | 0.7.0-incubating | +| `s3_access_key_id` | The access key of the AWS S3.| (none) | Yes | 0.7.0-incubating | +| `s3_secret_access_key` | The secret key of the AWS S3.| (none) | Yes | 0.7.0-incubating | The following properties are required if you want to access the GCS fileset via the GVFS python client: -| Configuration item | Description | Default value | Required | Since version | -|----------------------------|-------------------------------------------|---------------|---------------------------|------------------| -| `gcs_service_account_file` | The path of GCS service account JSON file.| (none) | Yes if it's a GCS fileset.| 0.7.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|----------------------------|-------------------------------------------|---------------|----------|------------------| +| `gcs_service_account_file` | The path of GCS service account JSON file.| (none) | Yes | 0.7.0-incubating | The following properties are required if you want to access the OSS fileset via the GVFS python client: -| Configuration item | Description | Default value | Required | Since version | -|----------------------------|-----------------------------------|---------------|----------------------------|------------------| -| `oss_endpoint` | The endpoint of the Aliyun OSS. | (none) | Yes if it's a OSS fileset. | 0.7.0-incubating | -| `oss_access_key_id` | The access key of the Aliyun OSS. | (none) | Yes if it's a OSS fileset. | 0.7.0-incubating | -| `oss_secret_access_key` | The secret key of the Aliyun OSS. | (none) | Yes if it's a OSS fileset. | 0.7.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|----------------------------|-----------------------------------|---------------|----------|------------------| +| `oss_endpoint` | The endpoint of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | +| `oss_access_key_id` | The access key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | +| `oss_secret_access_key` | The secret key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | For Azure Blob Storage fileset, you need to configure the following properties: -| Configuration item | Description | Default value | Required | Since version | -|--------------------|----------------------------------------|---------------|-------------------------------------------|------------------| -| `abs_account_name` | The account name of Azure Blob Storage | (none) | Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating | -| `abs_account_key` | The account key of Azure Blob Storage | (none) | Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|--------------------|----------------------------------------|---------------|----------|------------------| +| `abs_account_name` | The account name of Azure Blob Storage | (none) | Yes | 0.8.0-incubating | +| `abs_account_key` | The account key of Azure Blob Storage | (none) | Yes | 0.8.0-incubating | You can configure these properties when obtaining the `Gravitino Virtual FileSystem` in Python like this: From 4c6f4c8e78076f2b785e3f9791159fbfa045814a Mon Sep 17 00:00:00 2001 From: yuqi Date: Wed, 8 Jan 2025 18:09:38 +0800 Subject: [PATCH 17/39] typo --- docs/hadoop-catalog-with-adls.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index e1d719161af..77a5d2ead59 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -405,7 +405,7 @@ Since 0.8.0-incubating, Gravitino supports credential vending for ADLS fileset. ### How to create an ADLS Hadoop catalog with credential enabled -Apart from configuration method in [create-adls-hadoop-catalog](#configuration-for-a-adls-hadoop-catalog), properties needed by [adls-credential](./security/credential-vending.md#adls-credentials) should also be set to enable credential vending for ADLSfileset. +Apart from configuration method in [create-adls-hadoop-catalog](#configuration-for-a-adls-hadoop-catalog), properties needed by [adls-credential](./security/credential-vending.md#adls-credentials) should also be set to enable credential vending for ADLS fileset. ### How to access ADLS fileset with credential From 76f651e337e1090b5a4317a623079a2e729cdde1 Mon Sep 17 00:00:00 2001 From: yuqi Date: Thu, 9 Jan 2025 09:29:30 +0800 Subject: [PATCH 18/39] Polish document again. --- docs/hadoop-catalog-with-s3.md | 3 +- docs/hadoop-catalog.md | 79 +++++++--------------------------- 2 files changed, 17 insertions(+), 65 deletions(-) diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index 4f0c023c79e..70ad8a17ff0 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -413,7 +413,8 @@ ds = pd.read_csv(f"gvfs://fileset/${catalog_name}/${schema_name}/${fileset_name} storage_options=storage_options) ds.head() ``` -For other use cases, please refer to the [Gravitino Virtual File System](./how-to-use-gvfs.md) document. + +For more use cases, please refer to the [Gravitino Virtual File System](./how-to-use-gvfs.md) document. ## Fileset with credential diff --git a/docs/hadoop-catalog.md b/docs/hadoop-catalog.md index d5427844ad8..244f1811a66 100644 --- a/docs/hadoop-catalog.md +++ b/docs/hadoop-catalog.md @@ -23,16 +23,19 @@ Hadoop 3. If there's any compatibility issue, please create an [issue](https://g Besides the [common catalog properties](./gravitino-server-config.md#apache-gravitino-catalog-properties-configuration), the Hadoop catalog has the following properties: -| Property Name | Description | Default Value | Required | Since Version | -|------------------------|----------------------------------------------------|---------------|----------|------------------| -| `location` | The storage location managed by Hadoop catalog. | (none) | No | 0.5.0 | -| `credential-providers` | The credential provider types, separated by comma. | (none) | No | 0.8.0-incubating | +| Property Name | Description | Default Value | Required | Since Version | +|-------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|-----------|------------------| +| `location` | The storage location managed by Hadoop catalog. | (none) | No | 0.5.0 | +| `credential-providers` | The credential provider types, separated by comma. | (none) | No | 0.8.0-incubating | +| `default-filesystem-provider` | The default filesystem provider of this Hadoop catalog if users do not specify the scheme in the URI. Candidate values are 'builtin-local', 'builtin-hdfs', 's3', 'gcs', 'abs' and 'oss'. Default value is `builtin-local`. For S3, if we set this value to 's3', we can omit the prefix 's3a://' in the location. | `builtin-local` | No | 0.7.0-incubating | +| `filesystem-providers` | The file system providers to add. Users needs to set this configuration to support cloud storage or custom HCFS. For instance, set it to `s3` or a comma separated string that contains `s3` like `gs,s3` to support multiple kinds of fileset including `s3`. | (none) | Yes | 0.7.0-incubating | + Please refer to [Credential vending](./security/credential-vending.md) for more details about credential vending. -Apart from the above properties, to access fileset like HDFS, S3, GCS, OSS or custom fileset, you need to configure the following extra properties. +### HDFS fileset -#### HDFS fileset +Apart from the above properties, to access fileset like HDFS fileset, you need to configure the following extra properties. | Property Name | Description | Default Value | Required | Since Version | |----------------------------------------------------|------------------------------------------------------------------------------------------------|---------------|-------------------------------------------------------------|---------------| @@ -43,65 +46,13 @@ Apart from the above properties, to access fileset like HDFS, S3, GCS, OSS or cu | `authentication.kerberos.check-interval-sec` | The check interval of Kerberos credential for Hadoop catalog. | 60 | No | 0.5.1 | | `authentication.kerberos.keytab-fetch-timeout-sec` | The fetch timeout of retrieving Kerberos keytab from `authentication.kerberos.keytab-uri`. | 60 | No | 0.5.1 | -#### S3 fileset - -| Configuration item | Description | Default value | Required | Since version | -|-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|---------------------------|------------------| -| `filesystem-providers` | The file system providers to add. Set it to `s3` if it's a S3 fileset, or a comma separated string that contains `s3` like `gs,s3` to support multiple kinds of fileset including `s3`. | (none) | Yes | 0.7.0-incubating | -| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for S3, if we set this value, we can omit the prefix 's3a://' in the location. | `builtin-local` | No | 0.7.0-incubating | -| `s3-endpoint` | The endpoint of the AWS S3. | (none) | Yes if it's a S3 fileset. | 0.7.0-incubating | -| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes if it's a S3 fileset. | 0.7.0-incubating | -| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes if it's a S3 fileset. | 0.7.0-incubating | - -Please refer to [S3 credentials](./security/credential-vending.md#s3-credentials) for credential related configurations. - -At the same time, you need to place the corresponding bundle jar [`gravitino-aws-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws-bundle/) in the directory `${GRAVITINO_HOME}/catalogs/hadoop/libs`. - -#### GCS fileset - -| Configuration item | Description | Default value | Required | Since version | -|-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------------------------|------------------| -| `filesystem-providers` | The file system providers to add. Set it to `gs` if it's a GCS fileset, a comma separated string that contains `gs` like `gs,s3` to support multiple kinds of fileset including `gs`. | (none) | Yes | 0.7.0-incubating | -| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for GCS, if we set this value, we can omit the prefix 'gs://' in the location. | `builtin-local` | No | 0.7.0-incubating | -| `gcs-service-account-file` | The path of GCS service account JSON file. | (none) | Yes if it's a GCS fileset. | 0.7.0-incubating | - -Please refer to [GCS credentials](./security/credential-vending.md#gcs-credentials) for credential related configurations. - -In the meantime, you need to place the corresponding bundle jar [`gravitino-gcp-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-gcp-bundle/) in the directory `${GRAVITINO_HOME}/catalogs/hadoop/libs`. - -#### OSS fileset - -| Configuration item | Description | Default value | Required | Since version | -|-------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------------------------|------------------| -| `filesystem-providers` | The file system providers to add. Set it to `oss` if it's a OSS fileset, or a comma separated string that contains `oss` like `oss,gs,s3` to support multiple kinds of fileset including `oss`. | (none) | Yes | 0.7.0-incubating | -| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for OSS, if we set this value, we can omit the prefix 'oss://' in the location. | `builtin-local` | No | 0.7.0-incubating | -| `oss-endpoint` | The endpoint of the Aliyun OSS. | (none) | Yes if it's a OSS fileset. | 0.7.0-incubating | -| `oss-access-key-id` | The access key of the Aliyun OSS. | (none) | Yes if it's a OSS fileset. | 0.7.0-incubating | -| `oss-secret-access-key` | The secret key of the Aliyun OSS. | (none) | Yes if it's a OSS fileset. | 0.7.0-incubating | - -Please refer to [OSS credentials](./security/credential-vending.md#oss-credentials) for credential related configurations. - -In the meantime, you need to place the corresponding bundle jar [`gravitino-aliyun-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aliyun-bundle/) in the directory `${GRAVITINO_HOME}/catalogs/hadoop/libs`. - -#### Azure Blob Storage fileset - -| Configuration item | Description | Default value | Required | Since version | -|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|-------------------------------------------|------------------| -| `filesystem-providers` | The file system providers to add. Set it to `abs` if it's a Azure Blob Storage fileset, or a comma separated string that contains `abs` like `oss,abs,s3` to support multiple kinds of fileset including `abs`. | (none) | Yes | 0.8.0-incubating | -| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for Azure Blob Storage, if we set this value, we can omit the prefix 'abfss://' in the location. | `builtin-local` | No | 0.8.0-incubating | -| `azure-storage-account-name ` | The account name of Azure Blob Storage. | (none) | Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating | -| `azure-storage-account-key` | The account key of Azure Blob Storage. | (none) | Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating | - -Please refer to [ADLS credentials](./security/credential-vending.md#adls-credentials) for credential related configurations. - -Similar to the above, you need to place the corresponding bundle jar [`gravitino-azure-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-azure-bundle/) in the directory `${GRAVITINO_HOME}/catalogs/hadoop/libs`. - -:::note -- Gravitino contains builtin file system providers for local file system(`builtin-local`) and HDFS(`builtin-hdfs`), that is to say if `filesystem-providers` is not set, Gravitino will still support local file system and HDFS. Apart from that, you can set the `filesystem-providers` to support other file systems like S3, GCS, OSS or custom file system. -- `default-filesystem-provider` is used to set the default file system provider for the Hadoop catalog. If the user does not specify the scheme in the URI, Gravitino will use the default file system provider to access the fileset. For example, if the default file system provider is set to `builtin-local`, the user can omit the prefix `file:///` in the location. -::: +### Hadoop catalog with Cloud Storage +- For S3, please refer to [Hadoop-catalog-with-s3](./hadoop-catalog-with-s3.md) for more details. +- For GCS, please refer to [Hadoop-catalog-with-gcs](./hadoop-catalog-with-gcs.md) for more details. +- For OSS, please refer to [Hadoop-catalog-with-oss](./hadoop-catalog-with-oss.md) for more details. +- For Azure Blob Storage, please refer to [Hadoop-catalog-with-adls](./hadoop-catalog-with-adls.md) for more details. -#### How to custom your own HCFS file system fileset? +### How to custom your own HCFS file system fileset? Developers and users can custom their own HCFS file system fileset by implementing the `FileSystemProvider` interface in the jar [gravitino-catalog-hadoop](https://repo1.maven.org/maven2/org/apache/gravitino/catalog-hadoop/). The `FileSystemProvider` interface is defined as follows: From 51446ce53b2b8fdb2b3e72943a02171a2ce911eb Mon Sep 17 00:00:00 2001 From: yuqi Date: Thu, 9 Jan 2025 11:00:55 +0800 Subject: [PATCH 19/39] fix --- docs/hadoop-catalog-with-adls.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index 77a5d2ead59..7894bc175cb 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -21,10 +21,6 @@ $ bin/gravitino-server.sh start ``` Once the server is up and running, you can proceed to configure the Hadoop catalog with ADLS. -```bash -$ bin/gravitino-server.sh start -``` - ## Configurations for creating a Hadoop catalog with ADLS ### Configuration for a ADLS Hadoop catalog From 2b9c35f4df6622eb91bf4c8c8ab89ee793c8c11d Mon Sep 17 00:00:00 2001 From: yuqi Date: Thu, 9 Jan 2025 14:08:48 +0800 Subject: [PATCH 20/39] Fix error. --- docs/hadoop-catalog-with-adls.md | 25 +++++++++++++------------ docs/hadoop-catalog-with-gcs.md | 21 +++++++++++---------- docs/hadoop-catalog-with-oss.md | 25 +++++++++++++------------ docs/hadoop-catalog-with-s3.md | 25 +++++++++++++------------ 4 files changed, 50 insertions(+), 46 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index 7894bc175cb..1655bda848d 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -27,12 +27,12 @@ Once the server is up and running, you can proceed to configure the Hadoop catal Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with ADLS: -| Configuration item | Description | Default value | Required | Since version | -|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|-------------------------------------------|------------------| -| `filesystem-providers` | The file system providers to add. Set it to `abs` if it's a Azure Blob Storage fileset, or a comma separated string that contains `abs` like `oss,abs,s3` to support multiple kinds of fileset including `abs`. | (none) | Yes | 0.8.0-incubating | -| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for Azure Blob Storage, if we set this value, we can omit the prefix 'abfss://' in the location. | `builtin-local` | No | 0.8.0-incubating | -| `azure-storage-account-name ` | The account name of Azure Blob Storage. | (none) | Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating | -| `azure-storage-account-key` | The account key of Azure Blob Storage. | (none) | Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------| +| `filesystem-providers` | The file system providers to add. Set it to `abs` if it's a Azure Blob Storage fileset, or a comma separated string that contains `abs` like `oss,abs,s3` to support multiple kinds of fileset including `abs`. | (none) | Yes | 0.8.0-incubating | +| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for Azure Blob Storage, if we set this value, we can omit the prefix 'abfss://' in the location. | `builtin-local` | No | 0.8.0-incubating | +| `azure-storage-account-name ` | The account name of Azure Blob Storage. | (none) | Yes | 0.8.0-incubating | +| `azure-storage-account-key` | The account key of Azure Blob Storage. | (none) | Yes | 0.8.0-incubating | ### Configurations for a schema @@ -58,7 +58,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{ "name": "example_catalog", "type": "FILESET", - "comment": "comment", + "comment": "This is a ADLS fileset catalog", "provider": "hadoop", "properties": { "location": "abfss://container@account-name.dfs.core.windows.net/path", @@ -101,8 +101,9 @@ Catalog adlsCatalog = gravitinoClient.createCatalog("example_catalog", gravitino_client: GravitinoClient = GravitinoClient(uri="${GRAVITINO_SERVER_IP:PORT}", metalake_name="metalake") adls_properties = { "location": "abfss://container@account-name.dfs.core.windows.net/path", - "azure_storage_account_name": "azure storage account name", - "azure_storage_account_key": "azure storage account key" + "azure-storage-account-name": "azure storage account name", + "azure-storage-account-key": "azure storage account key", + "filesystem-providers": "abs" } adls_properties = gravitino_client.create_catalog(name="example_catalog", @@ -127,7 +128,7 @@ Once the catalog is created, you can create a schema. The following example show curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{ "name": "test_schema", - "comment": "comment", + "comment": "This is a ADLS schema", "properties": { "location": "abfss://container@account-name.dfs.core.windows.net/path" } @@ -146,7 +147,7 @@ Map schemaProperties = ImmutableMap.builder() .put("location", "abfss://container@account-name.dfs.core.windows.net/path") .build(); Schema schema = supportsSchemas.createSchema("test_schema", - "This is a schema", + "This is a ADLS schema", schemaProperties ); // ... @@ -159,7 +160,7 @@ Schema schema = supportsSchemas.createSchema("test_schema", gravitino_client: GravitinoClient = GravitinoClient(uri="http://127.0.0.1:8090", metalake_name="metalake") catalog: Catalog = gravitino_client.load_catalog(name="test_catalog") catalog.as_schemas().create_schema(name="test_schema", - comment="This is a schema", + comment="This is a ADLS schema", properties={"location": "abfss://container@account-name.dfs.core.windows.net/path"}) ``` diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index 0ab3aa8ec69..5b82f8bff66 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -26,11 +26,11 @@ Once the server is up and running, you can proceed to configure the Hadoop catal Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with GCS: -| Configuration item | Description | Default value | Required | Since version | -|-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------------------------|------------------| -| `filesystem-providers` | The file system providers to add. Set it to `gcs` if it's a GCS fileset, a comma separated string that contains `gcs` like `gcs,s3` to support multiple kinds of fileset including `gcs`. | (none) | Yes | 0.7.0-incubating | -| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for GCS, if we set this value, we can omit the prefix 'gs://' in the location. | `builtin-local` | No | 0.7.0-incubating | -| `gcs-service-account-file` | The path of GCS service account JSON file. | (none) | Yes if it's a GCS fileset. | 0.7.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------| +| `filesystem-providers` | The file system providers to add. Set it to `gcs` if it's a GCS fileset, a comma separated string that contains `gcs` like `gcs,s3` to support multiple kinds of fileset including `gcs`. | (none) | Yes | 0.7.0-incubating | +| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for GCS, if we set this value, we can omit the prefix 'gs://' in the location. | `builtin-local` | No | 0.7.0-incubating | +| `gcs-service-account-file` | The path of GCS service account JSON file. | (none) | Yes | 0.7.0-incubating | ### Configurations for a schema @@ -56,7 +56,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{ "name": "test_catalog", "type": "FILESET", - "comment": "comment", + "comment": "This is a GCS fileset catalog", "provider": "hadoop", "properties": { "location": "gs://bucket/root", @@ -97,7 +97,8 @@ Catalog gcsCatalog = gravitinoClient.createCatalog("test_catalog", gravitino_client: GravitinoClient = GravitinoClient(uri="${GRAVITINO_SERVER_IP:PORT}", metalake_name="metalake") gcs_properties = { "location": "gs://bucket/root", - "gcs-service-account-file": "path_of_gcs_service_account_file" + "gcs-service-account-file": "path_of_gcs_service_account_file", + "filesystem-providers": "gcs" } gcs_properties = gravitino_client.create_catalog(name="test_catalog", @@ -122,7 +123,7 @@ Once you have created a Hadoop catalog with GCS, you can create a schema. The fo curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{ "name": "test_schema", - "comment": "comment", + "comment": "This is a GCS schema", "properties": { "location": "gs://bucket/root/schema" } @@ -141,7 +142,7 @@ Map schemaProperties = ImmutableMap.builder() .put("location", "gs://bucket/root/schema") .build(); Schema schema = supportsSchemas.createSchema("test_schema", - "This is a schema", + "This is a GCS schema", schemaProperties ); // ... @@ -154,7 +155,7 @@ Schema schema = supportsSchemas.createSchema("test_schema", gravitino_client: GravitinoClient = GravitinoClient(uri="http://127.0.0.1:8090", metalake_name="metalake") catalog: Catalog = gravitino_client.load_catalog(name="test_catalog") catalog.as_schemas().create_schema(name="test_schema", - comment="This is a schema", + comment="This is a GCS schema", properties={"location": "gs://bucket/root/schema"}) ``` diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index 1ed2523f31b..b5de20fb11d 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -27,13 +27,13 @@ Once the server is up and running, you can proceed to configure the Hadoop catal In addition to the basic configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with OSS: -| Configuration item | Description | Default value | Required | Since version | -|-------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------------------------|------------------| -| `filesystem-providers` | The file system providers to add. Set it to `oss` if it's a OSS fileset, or a comma separated string that contains `oss` like `oss,gs,s3` to support multiple kinds of fileset including `oss`. | (none) | Yes | 0.7.0-incubating | -| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for OSS, if we set this value, we can omit the prefix 'oss://' in the location. | `builtin-local` | No | 0.7.0-incubating | -| `oss-endpoint` | The endpoint of the Aliyun OSS. | (none) | Yes if it's a OSS fileset. | 0.7.0-incubating | -| `oss-access-key-id` | The access key of the Aliyun OSS. | (none) | Yes if it's a OSS fileset. | 0.7.0-incubating | -| `oss-secret-access-key` | The secret key of the Aliyun OSS. | (none) | Yes if it's a OSS fileset. | 0.7.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|-------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------| +| `filesystem-providers` | The file system providers to add. Set it to `oss` if it's a OSS fileset, or a comma separated string that contains `oss` like `oss,gs,s3` to support multiple kinds of fileset including `oss`. | (none) | Yes | 0.7.0-incubating | +| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for OSS, if we set this value, we can omit the prefix 'oss://' in the location. | `builtin-local` | No | 0.7.0-incubating | +| `oss-endpoint` | The endpoint of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | +| `oss-access-key-id` | The access key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | +| `oss-secret-access-key` | The secret key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | ### Configurations for a schema @@ -59,7 +59,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{ "name": "test_catalog", "type": "FILESET", - "comment": "comment", + "comment": "This is a OSS fileset catalog", "provider": "hadoop", "properties": { "location": "oss://bucket/root", @@ -106,7 +106,8 @@ oss_properties = { "location": "oss://bucket/root", "oss-access-key-id": "access_key" "oss-secret-access-key": "secret_key", - "oss-endpoint": "ossProperties" + "oss-endpoint": "ossProperties", + "filesystem-providers": "oss" } oss_catalog = gravitino_client.create_catalog(name="test_catalog", @@ -131,7 +132,7 @@ Once the Hadoop catalog with OSS is created, you can create a schema inside that curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{ "name": "test_schema", - "comment": "comment", + "comment": "This is a OSS schema", "properties": { "location": "oss://bucket/root/schema" } @@ -150,7 +151,7 @@ Map schemaProperties = ImmutableMap.builder() .put("location", "oss://bucket/root/schema") .build(); Schema schema = supportsSchemas.createSchema("test_schema", - "This is a schema", + "This is a OSS schema", schemaProperties ); // ... @@ -163,7 +164,7 @@ Schema schema = supportsSchemas.createSchema("test_schema", gravitino_client: GravitinoClient = GravitinoClient(uri="http://127.0.0.1:8090", metalake_name="metalake") catalog: Catalog = gravitino_client.load_catalog(name="test_catalog") catalog.as_schemas().create_schema(name="test_schema", - comment="This is a schema", + comment="This is a OSS schema", properties={"location": "oss://bucket/root/schema"}) ``` diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index 70ad8a17ff0..f214af48417 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -29,13 +29,13 @@ Once the server is running, you can proceed to create the Hadoop catalog with S3 In addition to the basic configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are necessary to configure a Hadoop catalog with S3: -| Configuration item | Description | Default value | Required | Since version | -|-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|---------------------------|------------------| -| `filesystem-providers` | The file system providers to add. Set it to `s3` if it's a S3 fileset, or a comma separated string that contains `s3` like `gs,s3` to support multiple kinds of fileset including `s3`. | (none) | Yes | 0.7.0-incubating | -| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for S3, if we set this value, we can omit the prefix 's3a://' in the location. | `builtin-local` | No | 0.7.0-incubating | -| `s3-endpoint` | The endpoint of the AWS S3. | (none) | Yes if it's a S3 fileset. | 0.7.0-incubating | -| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes if it's a S3 fileset. | 0.7.0-incubating | -| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes if it's a S3 fileset. | 0.7.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------| +| `filesystem-providers` | The file system providers to add. Set it to `s3` if it's a S3 fileset, or a comma separated string that contains `s3` like `gs,s3` to support multiple kinds of fileset including `s3`. | (none) | Yes | 0.7.0-incubating | +| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for S3, if we set this value, we can omit the prefix 's3a://' in the location. | `builtin-local` | No | 0.7.0-incubating | +| `s3-endpoint` | The endpoint of the AWS S3. | (none) | Yes | 0.7.0-incubating | +| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes | 0.7.0-incubating | +| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes | 0.7.0-incubating | ### Configurations for a schema @@ -62,7 +62,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{ "name": "test_catalog", "type": "FILESET", - "comment": "comment", + "comment": "This is a S3 fileset catalog", "provider": "hadoop", "properties": { "location": "s3a://bucket/root", @@ -109,7 +109,8 @@ s3_properties = { "location": "s3a://bucket/root", "s3-access-key-id": "access_key" "s3-secret-access-key": "secret_key", - "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com" + "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com", + "filesystem-providers": "s3" } s3_catalog = gravitino_client.create_catalog(name="test_catalog", @@ -138,7 +139,7 @@ Once your Hadoop catalog with S3 is created, you can create a schema under the c curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" -d '{ "name": "test_schema", - "comment": "comment", + "comment": "This is a S3 schema", "properties": { "location": "s3a://bucket/root/schema" } @@ -158,7 +159,7 @@ Map schemaProperties = ImmutableMap.builder() .put("location", "s3a://bucket/root/schema") .build(); Schema schema = supportsSchemas.createSchema("test_schema", - "This is a schema", + "This is a S3 schema", schemaProperties ); // ... @@ -171,7 +172,7 @@ Schema schema = supportsSchemas.createSchema("test_schema", gravitino_client: GravitinoClient = GravitinoClient(uri="http://127.0.0.1:8090", metalake_name="metalake") catalog: Catalog = gravitino_client.load_catalog(name="test_catalog") catalog.as_schemas().create_schema(name="test_schema", - comment="This is a schema", + comment="This is a S3 schema", properties={"location": "s3a://bucket/root/schema"}) ``` From d65b995f93560e50436b6554f8570f864a93eb51 Mon Sep 17 00:00:00 2001 From: yuqi Date: Thu, 9 Jan 2025 23:22:44 +0800 Subject: [PATCH 21/39] Fix error. --- docs/hadoop-catalog-with-adls.md | 99 +++++++++++++++++++++++++------- docs/hadoop-catalog-with-gcs.md | 96 +++++++++++++++++++++++++------ docs/hadoop-catalog-with-oss.md | 81 ++++++++++++++++++++++---- docs/hadoop-catalog-with-s3.md | 70 ++++++++++++++++++---- 4 files changed, 285 insertions(+), 61 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index 1655bda848d..b4ff614386e 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -19,7 +19,8 @@ To set up a Hadoop catalog with ADLS, follow these steps: ```bash $ bin/gravitino-server.sh start ``` -Once the server is up and running, you can proceed to configure the Hadoop catalog with ADLS. +Once the server is up and running, you can proceed to configure the Hadoop catalog with ADLS. In the rest of this document we will use `http://localhost:8090` as the Gravitino server URL, please +replace it with your actual server URL. ## Configurations for creating a Hadoop catalog with ADLS @@ -66,7 +67,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ "azure-storage-account-key": "The account key of the Azure Blob Storage", "filesystem-providers": "abs" } -}' ${GRAVITINO_SERVER_IP:PORT}/api/metalakes/metalake/catalogs +}' http://localhost:8090/api/metalakes/metalake/catalogs ``` @@ -74,7 +75,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ ```java GravitinoClient gravitinoClient = GravitinoClient - .builder("${GRAVITINO_SERVER_IP:PORT}") + .builder("http://localhost:8090") .withMetalake("metalake") .build(); @@ -98,7 +99,7 @@ Catalog adlsCatalog = gravitinoClient.createCatalog("example_catalog", ```python -gravitino_client: GravitinoClient = GravitinoClient(uri="${GRAVITINO_SERVER_IP:PORT}", metalake_name="metalake") +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") adls_properties = { "location": "abfss://container@account-name.dfs.core.windows.net/path", "azure-storage-account-name": "azure storage account name", @@ -132,7 +133,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ "properties": { "location": "abfss://container@account-name.dfs.core.windows.net/path" } -}' ${GRAVITINO_SERVER_IP:PORT}/api/metalakes/metalake/catalogs/test_catalog/schemas +}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas ``` @@ -184,7 +185,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ "properties": { "k1": "v1" } -}' ${GRAVITINO_SERVER_IP:PORT}/api/metalakes/metalake/catalogs/test_catalog/schemas/test_schema/filesets +}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas/test_schema/filesets ``` @@ -192,7 +193,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ ```java GravitinoClient gravitinoClient = GravitinoClient - .builder("${GRAVITINO_SERVER_IP:PORT}") + .builder("http://localhost:8090") .withMetalake("metalake") .build(); @@ -216,7 +217,7 @@ filesetCatalog.createFileset( ```python -gravitino_client: GravitinoClient = GravitinoClient(uri="${GRAVITINO_SERVER_IP:PORT}", metalake_name="metalake") +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") catalog: Catalog = gravitino_client.load_catalog(name="test_catalog") catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema", "example_fileset"), @@ -235,13 +236,21 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: +Before running the following code, you need to install required packages: + +```bash +pip install pyspark==3.1.3 +pip install gravitino==0.8.0-incubating +``` +Then you can run the following code: + ```python import logging from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient from pyspark.sql import SparkSession import os -gravitino_url = "${GRAVITINO_SERVER_IP:PORT}" +gravitino_url = "http://localhost:8090" metalake_name = "test" catalog_name = "your_adls_catalog" @@ -253,7 +262,7 @@ spark = SparkSession.builder .appName("adls_fileset_test") .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") -.config("spark.hadoop.fs.gravitino.server.uri", "${GRAVITINO_SERVER_URL}") +.config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") .config("spark.hadoop.fs.gravitino.client.metalake", "test") .config("spark.hadoop.azure-storage-account-name", "azure_account_name") .config("spark.hadoop.azure-storage-account-key", "azure_account_name") @@ -289,16 +298,16 @@ os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-bundle-{gra Please choose the correct jar according to your environment. :::note -In some Spark versions, a Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work. If this is the case, you should add the jars to the spark CLASSPATH directly. +In some Spark versions, a Hadoop environment is necessary for the driver, adding the bundle jars with '--jars' may not work. If this is the case, you should add the jars to the spark CLASSPATH directly. ::: -### Using Gravitino virtual file system Java client to access the fileset +### Using the GVFS Java client to access the fileset ```java Configuration conf = new Configuration(); conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); -conf.set("fs.gravitino.server.uri","${GRAVITINO_SERVER_URL}"); +conf.set("fs.gravitino.server.uri","http://localhost:8090"); conf.set("fs.gravitino.client.metalake","test_metalake"); conf.set("azure-storage-account-name", "account_name_of_adls"); conf.set("azure-storage-account-key", "account_key_of_adls"); @@ -310,6 +319,50 @@ fs.mkdirs(filesetPath); Similar to Spark configurations, you need to add ADLS bundle jars to the classpath according to your environment. +If your wants to custom your hadoop version or there is already a hadoop version in your project, you can add the following dependencies to your `pom.xml`: + +```xml + + org.apache.hadoop + hadoop-common + ${HADOOP_VERSION} + + + + org.apache.hadoop + hadoop-azure + ${HADOOP_VERSION} + + + + org.apache.gravitino + filesystem-hadoop3-runtime + 0.8.0-incubating-SNAPSHOT + + + + org.apache.gravitino + gravitino-azure + 0.8.0-incubating-SNAPSHOT + +``` + +Or use the bundle jar with Hadoop environment: + +```xml + + org.apache.gravitino + gravitino-azure-bundle + 0.8.0-incubating-SNAPSHOT + + + + org.apache.gravitino + filesystem-hadoop3-runtime + 0.8.0-incubating-SNAPSHOT + +``` + ### Accessing a fileset using the Hadoop fs command The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. @@ -329,7 +382,7 @@ The following are examples of how to use the `hadoop fs` command to access the f fs.gravitino.server.uri - ${GRAVITINO_SERVER_IP:PORT} + http://localhost:8090 @@ -349,7 +402,7 @@ The following are examples of how to use the `hadoop fs` command to access the f 2. Add the necessary jars to the Hadoop classpath. -For ADLS, you need to add `gravitino-azure-${gravitino-version}.jar` and `hadoop-azure-${hadoop-version}.jar` and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. +For ADLS, you need to add `gravitino-filesystem-hadoop3-runtime-${gravitino-version}.jar`, `gravitino-azure-${gravitino-version}.jar` and `hadoop-azure-${hadoop-version}.jar` and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. 3. Run the following command to access the fileset: @@ -358,7 +411,13 @@ hadoop dfs -ls gvfs://fileset/adls_catalog/adls_schema/adls_fileset hadoop dfs -put /path/to/local/file gvfs://fileset/adls_catalog/adls_schema/adls_fileset ``` -### Using the Gravitino virtual file system Python client to access a fileset +### Using the GVFS Python client to access a fileset + +Please install the `gravitino` package before running the following code: + +```bash +pip install gravitino==0.8.0-incubating +``` ```python from gravitino import gvfs @@ -369,7 +428,7 @@ options = { "azure_storage_account_name": "azure_account_name", "azure_storage_account_key": "azure_account_key" } -fs = gvfs.GravitinoVirtualFileSystem(server_uri="${GRAVITINO_SERVER_IP:PORT}", metalake_name="test_metalake", options=options) +fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090", metalake_name="test_metalake", options=options) fs.ls("gvfs://fileset/{adls_catalog}/{adls_schema}/{adls_fileset}/") ``` @@ -382,7 +441,7 @@ The following are examples of how to use the pandas library to access the ADLS f import pandas as pd storage_options = { - "server_uri": "${GRAVITINO_SERVER_IP:PORT}", + "server_uri": "http://localhost:8090", "metalake_name": "test", "options": { "azure_storage_account_name": "azure_account_name", @@ -414,7 +473,7 @@ GVFS Java client: Configuration conf = new Configuration(); conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); -conf.set("fs.gravitino.server.uri","${GRAVITINO_SERVER_IP:PORT}"); +conf.set("fs.gravitino.server.uri","http://localhost:8090"); conf.set("fs.gravitino.client.metalake","test_metalake"); // No need to set azure-storage-account-name and azure-storage-account-name Path filesetPath = new Path("gvfs://fileset/adls_test_catalog/test_schema/test_fileset/new_dir"); @@ -430,7 +489,7 @@ spark = SparkSession.builder .appName("adls_fielset_test") .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") - .config("spark.hadoop.fs.gravitino.server.uri", "${GRAVITINO_SERVER_IP:PORT}") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") .config("spark.hadoop.fs.gravitino.client.metalake", "test") # No need to set azure-storage-account-name and azure-storage-account-name .config("spark.driver.memory", "2g") diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index 5b82f8bff66..e0fe31c279a 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -18,7 +18,8 @@ To set up a Hadoop catalog with OSS, follow these steps: ```bash $ bin/gravitino-server.sh start ``` -Once the server is up and running, you can proceed to configure the Hadoop catalog with GCS. +Once the server is up and running, you can proceed to configure the Hadoop catalog with GCS. In the rest of this document we will use `http://localhost:8090` as the Gravitino server URL, please +replace it with your actual server URL. ## Configurations for creating a Hadoop catalog with GCS @@ -63,7 +64,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ "gcs-service-account-file": "path_of_gcs_service_account_file", "filesystem-providers": "gcs" } -}' ${GRAVITINO_SERVER_IP:PORT}/api/metalakes/metalake/catalogs +}' http://localhost:8090/api/metalakes/metalake/catalogs ``` @@ -71,7 +72,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ ```java GravitinoClient gravitinoClient = GravitinoClient - .builder("${GRAVITINO_SERVER_IP:PORT}") + .builder("http://localhost:8090") .withMetalake("metalake") .build(); @@ -94,7 +95,7 @@ Catalog gcsCatalog = gravitinoClient.createCatalog("test_catalog", ```python -gravitino_client: GravitinoClient = GravitinoClient(uri="${GRAVITINO_SERVER_IP:PORT}", metalake_name="metalake") +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") gcs_properties = { "location": "gs://bucket/root", "gcs-service-account-file": "path_of_gcs_service_account_file", @@ -127,7 +128,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ "properties": { "location": "gs://bucket/root/schema" } -}' ${GRAVITINO_SERVER_IP:PORT}/api/metalakes/metalake/catalogs/test_catalog/schemas +}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas ``` @@ -180,7 +181,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ "properties": { "k1": "v1" } -}' ${GRAVITINO_SERVER_IP:PORT}/api/metalakes/metalake/catalogs/test_catalog/schemas/test_schema/filesets +}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas/test_schema/filesets ``` @@ -188,7 +189,7 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ ```java GravitinoClient gravitinoClient = GravitinoClient - .builder("${GRAVITINO_SERVER_IP:PORT}") + .builder("http://localhost:8090") .withMetalake("metalake") .build(); @@ -212,7 +213,7 @@ filesetCatalog.createFileset( ```python -gravitino_client: GravitinoClient = GravitinoClient(uri="${GRAVITINO_SERVER_IP:PORT}", metalake_name="metalake") +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") catalog: Catalog = gravitino_client.load_catalog(name="test_catalog") catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema", "example_fileset"), @@ -231,13 +232,21 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: +Before running the following code, you need to install required packages: + +```bash +pip install pyspark==3.1.3 +pip install gravitino==0.8.0-incubating +``` +Then you can run the following code: + ```python import logging from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient from pyspark.sql import SparkSession import os -gravitino_url = "${GRAVITINO_SERVER_IP:PORT}" +gravitino_url = "http://localhost:8090" metalake_name = "test" catalog_name = "your_gcs_catalog" @@ -249,7 +258,7 @@ spark = SparkSession.builder .appName("gcs_fielset_test") .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") -.config("spark.hadoop.fs.gravitino.server.uri", "${GRAVITINO_SERVER_URL}") +.config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") .config("spark.hadoop.fs.gravitino.client.metalake", "test_metalake") .config("spark.hadoop.gcs-service-account-file", "/path/to/gcs-service-account-file.json") .config("spark.driver.memory", "2g") @@ -285,13 +294,13 @@ Please choose the correct jar according to your environment. In some Spark versions, a Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work. If this is the case, you should add the jars to the spark CLASSPATH directly. ::: -### Using Gravitino virtual file system Java client to access the fileset +### Using the GVFS Java client to access the fileset ```java Configuration conf = new Configuration(); conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); -conf.set("fs.gravitino.server.uri","${GRAVITINO_SERVER_IP:PORT}"); +conf.set("fs.gravitino.server.uri","http://localhost:8090"); conf.set("fs.gravitino.client.metalake","test_metalake"); conf.set("gcs-service-account-file", "/path/your-service-account-file.json"); Path filesetPath = new Path("gvfs://fileset/test_catalog/test_schema/test_fileset/new_dir"); @@ -300,6 +309,50 @@ fs.mkdirs(filesetPath); ... ``` + +If your wants to custom your hadoop version or there is already a hadoop version in your project, you can add the following dependencies to your `pom.xml`: + +```xml + + org.apache.hadoop + hadoop-common + ${HADOOP_VERSION} + + + com.google.cloud.bigdataoss + gcs-connector + ${GCS_CONNECTOR_VERSION} + + + org.apache.gravitino + filesystem-hadoop3-runtime + 0.8.0-incubating-SNAPSHOT + + + + org.apache.gravitino + gravitino-gcp + 0.8.0-incubating-SNAPSHOT + +``` + +Or use the bundle jar with Hadoop environment: + +```xml + + org.apache.gravitino + gravitino-gcp-bundle + 0.8.0-incubating-SNAPSHOT + + + + org.apache.gravitino + filesystem-hadoop3-runtime + 0.8.0-incubating-SNAPSHOT + +``` + + Similar to Spark configurations, you need to add GCS bundle jars to the classpath according to your environment. ### Accessing a fileset using the Hadoop fs command @@ -321,7 +374,7 @@ The following are examples of how to use the `hadoop fs` command to access the f fs.gravitino.server.uri - ${GRAVITINO_SERVER_IP:PORT} + http://localhost:8090 @@ -342,7 +395,7 @@ Then copy `hadoop-gcp-${version}.jar` and other possible dependencies to the `${ 2. Add the necessary jars to the Hadoop classpath. -For GCS, you need to add `gravitino-gcp-${gravitino-version}.jar` and `gcs-connector-hadoop3-2.2.22-shaded.jar` can be found [here](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases/download/v2.2.22/gcs-connector-hadoop3-2.2.22-shaded.jar) to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. +For GCS, you need to add `gravitino-filesystem-hadoop3-runtime-${gravitino-version}.jar`, `gravitino-gcp-${gravitino-version}.jar` and `gcs-connector-hadoop3-2.2.22-shaded.jar` can be found [here](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases/download/v2.2.22/gcs-connector-hadoop3-2.2.22-shaded.jar) to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. 3. Run the following command to access the fileset: @@ -351,8 +404,13 @@ hadoop dfs -ls gvfs://fileset/gcs_catalog/gcs_schema/gcs_example hadoop dfs -put /path/to/local/file gvfs://fileset/gcs_catalog/gcs_schema/gcs_example ``` +### Using the GVFS Python client to access a fileset + +Please install the `gravitino` package before running the following code: -### Using the Gravitino virtual file system Python client to access a fileset +```bash +pip install gravitino==0.8.0-incubating +``` ```python from gravitino import gvfs @@ -362,7 +420,7 @@ options = { "auth_type": "simple", "gcs_service_account_file": "path_of_gcs_service_account_file.json", } -fs = gvfs.GravitinoVirtualFileSystem(server_uri="${GRAVITINO_SERVER_IP:PORT}", metalake_name="test_metalake", options=options) +fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090", metalake_name="test_metalake", options=options) fs.ls("gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/") ``` @@ -374,7 +432,7 @@ The following are examples of how to use the pandas library to access the GCS fi import pandas as pd storage_options = { - "server_uri": "${GRAVITINO_SERVER_IP:PORT}", + "server_uri": "http://localhost:8090", "metalake_name": "test", "options": { "gcs_service_account_file": "path_of_gcs_service_account_file.json", @@ -405,7 +463,7 @@ GVFS Java client: Configuration conf = new Configuration(); conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); -conf.set("fs.gravitino.server.uri","${GRAVITINO_SERVER_IP:PORT}"); +conf.set("fs.gravitino.server.uri","http://localhost:8090"); conf.set("fs.gravitino.client.metalake","test_metalake"); // No need to set gcs-service-account-file Path filesetPath = new Path("gvfs://fileset/gcs_test_catalog/test_schema/test_fileset/new_dir"); @@ -421,7 +479,7 @@ spark = SparkSession.builder .appName("gcs_fileset_test") .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") - .config("spark.hadoop.fs.gravitino.server.uri", "${GRAVITINO_SERVER_IP:PORT}") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") .config("spark.hadoop.fs.gravitino.client.metalake", "test") # No need to set gcs-service-account-file .config("spark.driver.memory", "2g") diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index b5de20fb11d..c14ef6c3ba3 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -19,7 +19,8 @@ To set up a Hadoop catalog with OSS, follow these steps: ```bash $ bin/gravitino-server.sh start ``` -Once the server is up and running, you can proceed to configure the Hadoop catalog with OSS. +Once the server is up and running, you can proceed to configure the Hadoop catalog with OSS. In the rest of this document we will use `http://localhost:8090` as the Gravitino server URL, please +replace it with your actual server URL. ## Configurations for creating a Hadoop catalog with OSS @@ -241,6 +242,14 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: +Before running the following code, you need to install required packages: + +```bash +pip install pyspark==3.1.3 +pip install gravitino==0.8.0-incubating +``` +Then you can run the following code: + ```python import logging from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient @@ -259,7 +268,7 @@ spark = SparkSession.builder .appName("oss_fielset_test") .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") -.config("spark.hadoop.fs.gravitino.server.uri", "${GRAVITINO_SERVER_URL}") +.config("spark.hadoop.fs.gravitino.server.uri", "${_URL}") .config("spark.hadoop.fs.gravitino.client.metalake", "test") .config("spark.hadoop.oss-access-key-id", os.environ["OSS_ACCESS_KEY_ID"]) .config("spark.hadoop.oss-secret-access-key", os.environ["OSS_SECRET_ACCESS_KEY"]) @@ -297,7 +306,7 @@ Please choose the correct jar according to your environment. In some Spark versions, a Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work. If this is the case, you should add the jars to the spark CLASSPATH directly. ::: -### Using Gravitino virtual file system Java client to access the fileset +### Using the GVFS Java client to access the fileset ```java Configuration conf = new Configuration(); @@ -305,7 +314,7 @@ conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hado conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); conf.set("fs.gravitino.server.uri","http://localhost:8090"); conf.set("fs.gravitino.client.metalake","test_metalake"); -conf.set("oss-endpoint", "${GRAVITINO_SERVER_IP:PORT}"); +conf.set("oss-endpoint", "http://localhost:8090"); conf.set("oss-access-key-id", "minio"); conf.set("oss-secret-access-key", "minio123"); Path filesetPath = new Path("gvfs://fileset/test_catalog/test_schema/test_fileset/new_dir"); @@ -316,6 +325,48 @@ fs.mkdirs(filesetPath); Similar to Spark configurations, you need to add OSS bundle jars to the classpath according to your environment. +```xml + + org.apache.hadoop + hadoop-common + ${HADOOP_VERSION} + + + + org.apache.hadoop + hadoop-aliyun + ${HADOOP_VERSION} + + + + org.apache.gravitino + filesystem-hadoop3-runtime + 0.8.0-incubating-SNAPSHOT + + + + org.apache.gravitino + gravitino-aliyun + 0.8.0-incubating-SNAPSHOT + +``` + +Or use the bundle jar with Hadoop environment: + +```xml + + org.apache.gravitino + gravitino-aliyun-bundle + 0.8.0-incubating-SNAPSHOT + + + + org.apache.gravitino + filesystem-hadoop3-runtime + 0.8.0-incubating-SNAPSHOT + +``` + ### Accessing a fileset using the Hadoop fs command The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. @@ -335,7 +386,7 @@ The following are examples of how to use the `hadoop fs` command to access the f fs.gravitino.server.uri - ${GRAVITINO_SERVER_IP:PORT} + http://localhost:8090 @@ -361,7 +412,7 @@ The following are examples of how to use the `hadoop fs` command to access the f 2. Add the necessary jars to the Hadoop classpath. -For OSS, you need to add `gravitino-aliyun-${gravitino-version}.jar` and `hadoop-aliyun-${hadoop-version}.jar` and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. +For OSS, you need to add `gravitino-filesystem-hadoop3-runtime-${gravitino-version}.jar`, `gravitino-aliyun-${gravitino-version}.jar` and `hadoop-aliyun-${hadoop-version}.jar` and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. 3. Run the following command to access the fileset: @@ -371,7 +422,13 @@ hadoop dfs -put /path/to/local/file gvfs://fileset/oss_catalog/schema/oss_filese ``` -### Using Gravitino virtual file system Python client to access a fileset +### Using the GVFS Python client to access a fileset + +Please install the `gravitino` package before running the following code: + +```bash +pip install gravitino==0.8.0-incubating +``` ```python from gravitino import gvfs @@ -379,11 +436,11 @@ options = { "cache_size": 20, "cache_expired_time": 3600, "auth_type": "simple", - "oss_endpoint": "${GRAVITINO_SERVER_IP:PORT}", + "oss_endpoint": "http://localhost:8090", "oss_access_key_id": "minio", "oss_secret_access_key": "minio123" } -fs = gvfs.GravitinoVirtualFileSystem(server_uri="${GRAVITINO_SERVER_IP:PORT}", metalake_name="test_metalake", options=options) +fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090", metalake_name="test_metalake", options=options) fs.ls("gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/") ``` @@ -397,7 +454,7 @@ The following are examples of how to use the pandas library to access the OSS fi import pandas as pd storage_options = { - "server_uri": "${GRAVITINO_SERVER_IP:PORT}", + "server_uri": "http://localhost:8090", "metalake_name": "test", "options": { "oss_access_key_id": "access_key", @@ -429,7 +486,7 @@ GVFS Java client: Configuration conf = new Configuration(); conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); -conf.set("fs.gravitino.server.uri","${GRAVITINO_SERVER_IP:PORT}"); +conf.set("fs.gravitino.server.uri","http://localhost:8090"); conf.set("fs.gravitino.client.metalake","test_metalake"); // No need to set oss-access-key-id and oss-secret-access-key Path filesetPath = new Path("gvfs://fileset/oss_test_catalog/test_schema/test_fileset/new_dir"); @@ -445,7 +502,7 @@ spark = SparkSession.builder .appName("oss_fileset_test") .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") - .config("spark.hadoop.fs.gravitino.server.uri", "${GRAVITINO_SERVER_IP:PORT}") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") .config("spark.hadoop.fs.gravitino.client.metalake", "test") # No need to set oss-access-key-id and oss-secret-access-key .config("spark.driver.memory", "2g") diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index f214af48417..cebb6d18b94 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -20,8 +20,8 @@ To create a Hadoop catalog with S3, follow these steps: $ bin/gravitino-server.sh start ``` -Once the server is running, you can proceed to create the Hadoop catalog with S3. - +Once the server is up and running, you can proceed to configure the Hadoop catalog with S3. In the rest of this document we will use `http://localhost:8090` as the Gravitino server URL, please +replace it with your actual server URL. ## Configurations for creating a Hadoop catalog with S3 @@ -247,6 +247,14 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", "e The following Python code demonstrates how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: +Before running the following code, you need to install required packages: + +```bash +pip install pyspark==3.1.3 +pip install gravitino==0.8.0-incubating +``` +Then you can run the following code: + ```python import logging from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient @@ -265,7 +273,7 @@ spark = SparkSession.builder .appName("s3_fielset_test") .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") - .config("spark.hadoop.fs.gravitino.server.uri", "${GRAVITINO_SERVER_URL}") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") .config("spark.hadoop.fs.gravitino.client.metalake", "test") .config("spark.hadoop.s3-access-key-id", os.environ["S3_ACCESS_KEY_ID"]) .config("spark.hadoop.s3-secret-access-key", os.environ["S3_SECRET_ACCESS_KEY"]) @@ -302,16 +310,16 @@ Please choose the correct jar according to your environment. In some Spark versions, a Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work. If this is the case, you should add the jars to the spark CLASSPATH directly. ::: -### Using Gravitino virtual file system Java client to access the fileset +### Using the GVFS Java client to access the fileset ```java Configuration conf = new Configuration(); conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); -conf.set("fs.gravitino.server.uri","${GRAVITINO_SERVER_IP:PORT}"); +conf.set("fs.gravitino.server.uri","http://localhost:8090"); conf.set("fs.gravitino.client.metalake","test_metalake"); -conf.set("s3-endpoint", "${GRAVITINO_SERVER_IP:PORT}"); +conf.set("s3-endpoint", "http://localhost:8090"); conf.set("s3-access-key-id", "minio"); conf.set("s3-secret-access-key", "minio123"); @@ -323,6 +331,48 @@ fs.mkdirs(filesetPath); Similar to Spark configurations, you need to add S3 (bundle) jars to the classpath according to your environment. +```xml + + org.apache.hadoop + hadoop-common + ${HADOOP_VERSION} + + + + org.apache.hadoop + hadoop-aws + ${HADOOP_VERSION} + + + + org.apache.gravitino + filesystem-hadoop3-runtime + 0.8.0-incubating-SNAPSHOT + + + + org.apache.gravitino + gravitino-aws + 0.8.0-incubating-SNAPSHOT + +``` + +Or use the bundle jar with Hadoop environment: + +```xml + + org.apache.gravitino + gravitino-aws-bundle + 0.8.0-incubating-SNAPSHOT + + + + org.apache.gravitino + filesystem-hadoop3-runtime + 0.8.0-incubating-SNAPSHOT + +``` + ### Accessing a fileset using the Hadoop fs command The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. @@ -342,7 +392,7 @@ The following are examples of how to use the `hadoop fs` command to access the f fs.gravitino.server.uri - ${GRAVITINO_SERVER_IP:PORT} + http://localhost:8090 @@ -368,7 +418,7 @@ The following are examples of how to use the `hadoop fs` command to access the f 2. Add the necessary jars to the Hadoop classpath. -For S3, you need to add `gravitino-aws-${gravitino-version}.jar` and `hadoop-aws-${hadoop-version}.jar` and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. +For S3, you need to add `gravitino-filesystem-hadoop3-runtime-${gravitino-version}.jar`, `gravitino-aws-${gravitino-version}.jar` and `hadoop-aws-${hadoop-version}.jar` and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. 3. Run the following command to access the fileset: @@ -378,7 +428,7 @@ hadoop dfs -ls gvfs://fileset/s3_catalog/s3_schema/s3_fileset hadoop dfs -put /path/to/local/file gvfs://fileset/s3_catalog/s3_schema/s3_fileset ``` -### Using the Gravitino virtual file system Python client to access a fileset +### Using the GVFS Python client to access a fileset ```python from gravitino import gvfs @@ -386,7 +436,7 @@ options = { "cache_size": 20, "cache_expired_time": 3600, "auth_type": "simple", - "s3_endpoint": "${GRAVITINO_SERVER_IP:PORT}", + "s3_endpoint": "http://localhost:8090", "s3_access_key_id": "minio", "s3_secret_access_key": "minio123" } From 58e3a90724bc89b6abddf53e8587fda54fc1c885 Mon Sep 17 00:00:00 2001 From: yuqi Date: Thu, 9 Jan 2025 23:49:02 +0800 Subject: [PATCH 22/39] Fix error. --- docs/hadoop-catalog-with-adls.md | 7 +++---- docs/hadoop-catalog-with-gcs.md | 17 ++++++----------- docs/hadoop-catalog-with-oss.md | 8 ++++---- docs/hadoop-catalog-with-s3.md | 7 +++---- 4 files changed, 16 insertions(+), 23 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index b4ff614386e..8b866a4d8f3 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -19,8 +19,7 @@ To set up a Hadoop catalog with ADLS, follow these steps: ```bash $ bin/gravitino-server.sh start ``` -Once the server is up and running, you can proceed to configure the Hadoop catalog with ADLS. In the rest of this document we will use `http://localhost:8090` as the Gravitino server URL, please -replace it with your actual server URL. +Once the server is up and running, you can proceed to configure the Hadoop catalog with ADLS. In the rest of this document we will use `http://localhost:8090` as the Gravitino server URL, please replace it with your actual server URL. ## Configurations for creating a Hadoop catalog with ADLS @@ -158,7 +157,7 @@ Schema schema = supportsSchemas.createSchema("test_schema", ```python -gravitino_client: GravitinoClient = GravitinoClient(uri="http://127.0.0.1:8090", metalake_name="metalake") +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") catalog: Catalog = gravitino_client.load_catalog(name="test_catalog") catalog.as_schemas().create_schema(name="test_schema", comment="This is a ADLS schema", @@ -290,7 +289,7 @@ If your Spark **without Hadoop environment**, you can use the following code sni os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar --master local[1] pyspark-shell" ``` -- [`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle) is the Gravitino ADLS jar with Hadoop environment and `hadoop-azure` jar. +- [`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle) is the Gravitino ADLS jar with Hadoop environment(3.3.1) and `hadoop-azure` jar. - [`gravitino-azure-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure) is a condensed version of the Gravitino ADLS bundle jar without Hadoop environment and `hadoop-azure` jar. - `hadoop-azure-3.2.0.jar` and `azure-storage-7.0.0.jar` can be found in the Hadoop distribution in the `${HADOOP_HOME}/share/hadoop/tools/lib` directory. diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index e0fe31c279a..6c282e3832a 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -18,8 +18,7 @@ To set up a Hadoop catalog with OSS, follow these steps: ```bash $ bin/gravitino-server.sh start ``` -Once the server is up and running, you can proceed to configure the Hadoop catalog with GCS. In the rest of this document we will use `http://localhost:8090` as the Gravitino server URL, please -replace it with your actual server URL. +Once the server is up and running, you can proceed to configure the Hadoop catalog with GCS. In the rest of this document we will use `http://localhost:8090` as the Gravitino server URL, please replace it with your actual server URL. ## Configurations for creating a Hadoop catalog with GCS @@ -153,7 +152,7 @@ Schema schema = supportsSchemas.createSchema("test_schema", ```python -gravitino_client: GravitinoClient = GravitinoClient(uri="http://127.0.0.1:8090", metalake_name="metalake") +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") catalog: Catalog = gravitino_client.load_catalog(name="test_catalog") catalog.as_schemas().create_schema(name="test_schema", comment="This is a GCS schema", @@ -284,9 +283,8 @@ If your Spark **without Hadoop environment**, you can use the following code sni os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-gcp-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar, --master local[1] pyspark-shell" ``` -- [`gravitino-gcp-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp-bundle) is the Gravitino GCP jar with Hadoop environment and `gcs-connector`. -- [`gravitino-gcp-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp) is a condensed version of the Gravitino GCP bundle jar without Hadoop environment and `gcs-connector`. -- `gcs-connector-hadoop3-2.2.22-shaded.jar` can be found [here](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases/download/v2.2.22/gcs-connector-hadoop3-2.2.22-shaded.jar) +- [`gravitino-gcp-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp-bundle) is the Gravitino GCP jar with Hadoop environment(3.3.1) and `gcs-connector`. +- [`gravitino-gcp-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp) is a condensed version of the Gravitino GCP bundle jar without Hadoop environment and [`gcs-connector`](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases/download/v2.2.22/gcs-connector-hadoop3-2.2.22-shaded.jar) Please choose the correct jar according to your environment. @@ -309,7 +307,7 @@ fs.mkdirs(filesetPath); ... ``` - +Similar to Spark configurations, you need to add GCS bundle jars to the classpath according to your environment. If your wants to custom your hadoop version or there is already a hadoop version in your project, you can add the following dependencies to your `pom.xml`: ```xml @@ -352,9 +350,6 @@ Or use the bundle jar with Hadoop environment: ``` - -Similar to Spark configurations, you need to add GCS bundle jars to the classpath according to your environment. - ### Accessing a fileset using the Hadoop fs command The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. @@ -395,7 +390,7 @@ Then copy `hadoop-gcp-${version}.jar` and other possible dependencies to the `${ 2. Add the necessary jars to the Hadoop classpath. -For GCS, you need to add `gravitino-filesystem-hadoop3-runtime-${gravitino-version}.jar`, `gravitino-gcp-${gravitino-version}.jar` and `gcs-connector-hadoop3-2.2.22-shaded.jar` can be found [here](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases/download/v2.2.22/gcs-connector-hadoop3-2.2.22-shaded.jar) to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. +For GCS, you need to add `gravitino-filesystem-hadoop3-runtime-${gravitino-version}.jar`, `gravitino-gcp-${gravitino-version}.jar` and [`gcs-connector-hadoop3-2.2.22-shaded.jar`](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases/download/v2.2.22/gcs-connector-hadoop3-2.2.22-shaded.jar) to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. 3. Run the following command to access the fileset: diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index c14ef6c3ba3..501cad2194f 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -19,8 +19,7 @@ To set up a Hadoop catalog with OSS, follow these steps: ```bash $ bin/gravitino-server.sh start ``` -Once the server is up and running, you can proceed to configure the Hadoop catalog with OSS. In the rest of this document we will use `http://localhost:8090` as the Gravitino server URL, please -replace it with your actual server URL. +Once the server is up and running, you can proceed to configure the Hadoop catalog with OSS. In the rest of this document we will use `http://localhost:8090` as the Gravitino server URL, please replace it with your actual server URL. ## Configurations for creating a Hadoop catalog with OSS @@ -162,7 +161,7 @@ Schema schema = supportsSchemas.createSchema("test_schema", ```python -gravitino_client: GravitinoClient = GravitinoClient(uri="http://127.0.0.1:8090", metalake_name="metalake") +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") catalog: Catalog = gravitino_client.load_catalog(name="test_catalog") catalog.as_schemas().create_schema(name="test_schema", comment="This is a OSS schema", @@ -296,7 +295,7 @@ If your Spark **without Hadoop environment**, you can use the following code sni os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aliyun-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar, --master local[1] pyspark-shell" ``` -- [`gravitino-aliyun-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun-bundle) is the Gravitino Aliyun jar with Hadoop environment and `hadoop-oss` jar. +- [`gravitino-aliyun-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun-bundle) is the Gravitino Aliyun jar with Hadoop environment(3.3.1) and `hadoop-oss` jar. - [`gravitino-aliyun-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun) is a condensed version of the Gravitino Aliyun bundle jar without Hadoop environment and `hadoop-aliyun` jar. -`hadoop-aliyun-3.2.0.jar` and `aliyun-sdk-oss-2.8.3.jar` can be found in the Hadoop distribution in the `${HADOOP_HOME}/share/hadoop/tools/lib` directory. @@ -324,6 +323,7 @@ fs.mkdirs(filesetPath); ``` Similar to Spark configurations, you need to add OSS bundle jars to the classpath according to your environment. +If your wants to custom your hadoop version or there is already a hadoop version in your project, you can add the following dependencies to your `pom.xml`: ```xml diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index cebb6d18b94..ef038005cfd 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -20,8 +20,7 @@ To create a Hadoop catalog with S3, follow these steps: $ bin/gravitino-server.sh start ``` -Once the server is up and running, you can proceed to configure the Hadoop catalog with S3. In the rest of this document we will use `http://localhost:8090` as the Gravitino server URL, please -replace it with your actual server URL. +Once the server is up and running, you can proceed to configure the Hadoop catalog with S3. In the rest of this document we will use `http://localhost:8090` as the Gravitino server URL, please replace it with your actual server URL. ## Configurations for creating a Hadoop catalog with S3 @@ -169,7 +168,7 @@ Schema schema = supportsSchemas.createSchema("test_schema", ```python -gravitino_client: GravitinoClient = GravitinoClient(uri="http://127.0.0.1:8090", metalake_name="metalake") +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") catalog: Catalog = gravitino_client.load_catalog(name="test_catalog") catalog.as_schemas().create_schema(name="test_schema", comment="This is a S3 schema", @@ -300,7 +299,7 @@ If your Spark **without Hadoop environment**, you can use the following code sni os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aws-bundle-${gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-${gravitino-version}-SNAPSHOT.jar --master local[1] pyspark-shell" ``` -- [`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle) is the Gravitino AWS jar with Hadoop environment and `hadoop-aws` jar. +- [`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle) is the Gravitino AWS jar with Hadoop environment(3.3.1) and `hadoop-aws` jar. - [`gravitino-aws-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws) is a condensed version of the Gravitino AWS bundle jar without Hadoop environment and `hadoop-aws` jar. - `hadoop-aws-3.2.0.jar` and `aws-java-sdk-bundle-1.11.375.jar` can be found in the Hadoop distribution in the `${HADOOP_HOME}/share/hadoop/tools/lib` directory. From 746a3ce6004655a855ece0882d73bb994fdbd2ca Mon Sep 17 00:00:00 2001 From: yuqi Date: Thu, 9 Jan 2025 23:58:25 +0800 Subject: [PATCH 23/39] fix --- docs/hadoop-catalog-with-adls.md | 2 +- docs/hadoop-catalog-with-gcs.md | 7 +------ docs/hadoop-catalog-with-oss.md | 2 +- docs/hadoop-catalog-with-s3.md | 3 +-- 4 files changed, 4 insertions(+), 10 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index 8b866a4d8f3..27b70e7ea38 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -401,7 +401,7 @@ The following are examples of how to use the `hadoop fs` command to access the f 2. Add the necessary jars to the Hadoop classpath. -For ADLS, you need to add `gravitino-filesystem-hadoop3-runtime-${gravitino-version}.jar`, `gravitino-azure-${gravitino-version}.jar` and `hadoop-azure-${hadoop-version}.jar` and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. +For ADLS, you need to add `gravitino-filesystem-hadoop3-runtime-${gravitino-version}.jar`, `gravitino-azure-${gravitino-version}.jar` and `hadoop-azure-${hadoop-version}.jar` located at `${HADOOP_HOME}/share/hadoop/tools/lib/` to the Hadoop classpath. 3. Run the following command to access the fileset: diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index 6c282e3832a..18645226b57 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -383,14 +383,9 @@ The following are examples of how to use the `hadoop fs` command to access the f ``` -2. Copy the necessary jars to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. - -For GCS, you need to copy `gravitino-gcp-{version}.jar` to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. -Then copy `hadoop-gcp-${version}.jar` and other possible dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory, you can add all the jars in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory to the `${HADOOP_HOME}/share/hadoop/common/lib` directory. - 2. Add the necessary jars to the Hadoop classpath. -For GCS, you need to add `gravitino-filesystem-hadoop3-runtime-${gravitino-version}.jar`, `gravitino-gcp-${gravitino-version}.jar` and [`gcs-connector-hadoop3-2.2.22-shaded.jar`](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases/download/v2.2.22/gcs-connector-hadoop3-2.2.22-shaded.jar) to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. +For GCS, you need to add `gravitino-filesystem-hadoop3-runtime-${gravitino-version}.jar`, `gravitino-gcp-${gravitino-version}.jar` and [`gcs-connector-hadoop3-2.2.22-shaded.jar`](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases/download/v2.2.22/gcs-connector-hadoop3-2.2.22-shaded.jar) to Hadoop classpath. 3. Run the following command to access the fileset: diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index 501cad2194f..f2485a22a07 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -412,7 +412,7 @@ The following are examples of how to use the `hadoop fs` command to access the f 2. Add the necessary jars to the Hadoop classpath. -For OSS, you need to add `gravitino-filesystem-hadoop3-runtime-${gravitino-version}.jar`, `gravitino-aliyun-${gravitino-version}.jar` and `hadoop-aliyun-${hadoop-version}.jar` and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. +For OSS, you need to add `gravitino-filesystem-hadoop3-runtime-${gravitino-version}.jar`, `gravitino-aliyun-${gravitino-version}.jar` and `hadoop-aliyun-${hadoop-version}.jar` located at `${HADOOP_HOME}/share/hadoop/tools/lib/` to Hadoop classpath. 3. Run the following command to access the fileset: diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index ef038005cfd..0d812477da4 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -417,8 +417,7 @@ The following are examples of how to use the `hadoop fs` command to access the f 2. Add the necessary jars to the Hadoop classpath. -For S3, you need to add `gravitino-filesystem-hadoop3-runtime-${gravitino-version}.jar`, `gravitino-aws-${gravitino-version}.jar` and `hadoop-aws-${hadoop-version}.jar` and related dependencies to the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. Those jars can be found in the `${HADOOP_HOME}/share/hadoop/tools/lib/` directory. - +For S3, you need to add `gravitino-filesystem-hadoop3-runtime-${gravitino-version}.jar`, `gravitino-aws-${gravitino-version}.jar` and `hadoop-aws-${hadoop-version}.jar` located at `${HADOOP_HOME}/share/hadoop/tools/lib/` to Hadoop classpath. 3. Run the following command to access the fileset: From 4d644f1c3eeeef750572b9141617fb0e37c92bed Mon Sep 17 00:00:00 2001 From: yuqi Date: Fri, 10 Jan 2025 21:57:18 +0800 Subject: [PATCH 24/39] Optimize document `how-to-use-gvfs.md` --- docs/hadoop-catalog-with-adls.md | 22 +++++ docs/hadoop-catalog-with-gcs.md | 20 ++++ docs/hadoop-catalog-with-oss.md | 25 ++++- docs/hadoop-catalog-with-s3.md | 32 ++++++ docs/how-to-use-gvfs.md | 162 +------------------------------ 5 files changed, 102 insertions(+), 159 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index 27b70e7ea38..fa1a3fb8a6f 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -302,6 +302,17 @@ In some Spark versions, a Hadoop environment is necessary for the driver, adding ### Using the GVFS Java client to access the fileset +To access fileset with Azure Blob Storage(ADLS) using the GVFS Java client, based on the [basic GVFS configurations](./how-to-use-gvfs.md#configuration-1), you need to add the following configurations: + +| Configuration item | Description | Default value | Required | Since version | +|------------------------------|-----------------------------------------|---------------|----------|------------------| +| `azure-storage-account-name` | The account name of Azure Blob Storage. | (none) | Yes | 0.8.0-incubating | +| `azure-storage-account-key` | The account key of Azure Blob Storage. | (none) | Yes | 0.8.0-incubating | + +:::note +If the catalog has enabled [credential vending](security/credential-vending.md), the properties above can be omitted. +::: + ```java Configuration conf = new Configuration(); conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); @@ -412,6 +423,17 @@ hadoop dfs -put /path/to/local/file gvfs://fileset/adls_catalog/adls_schema/adls ### Using the GVFS Python client to access a fileset +In order to access fileset with Azure Blob storage (ADLS) using the GVFS Python client, apart from [basic GVFS configurations](./how-to-use-gvfs.md#configuration-1), you need to add the following configurations: + +| Configuration item | Description | Default value | Required | Since version | +|--------------------|----------------------------------------|---------------|----------|------------------| +| `abs_account_name` | The account name of Azure Blob Storage | (none) | Yes | 0.8.0-incubating | +| `abs_account_key` | The account key of Azure Blob Storage | (none) | Yes | 0.8.0-incubating | + +::: +If the catalog has enabled [credential vending](security/credential-vending.md), the properties above can be omitted. +::: + Please install the `gravitino` package before running the following code: ```bash diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index 18645226b57..99328963e0e 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -294,6 +294,16 @@ In some Spark versions, a Hadoop environment is needed by the driver, adding the ### Using the GVFS Java client to access the fileset +To access fileset with GCS using the GVFS Java client, based on the [basic GVFS configurations](./how-to-use-gvfs.md#configuration-1), you need to add the following configurations: + +| Configuration item | Description | Default value | Required | Since version | +|----------------------------|--------------------------------------------|---------------|----------|------------------| +| `gcs-service-account-file` | The path of GCS service account JSON file. | (none) | Yes | 0.7.0-incubating | + +::: +If the catalog has enabled [credential vending](security/credential-vending.md), the properties above can be omitted. +::: + ```java Configuration conf = new Configuration(); conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); @@ -396,6 +406,16 @@ hadoop dfs -put /path/to/local/file gvfs://fileset/gcs_catalog/gcs_schema/gcs_ex ### Using the GVFS Python client to access a fileset +In order to access fileset with GCS using the GVFS Python client, apart from [basic GVFS configurations](./how-to-use-gvfs.md#configuration-1), you need to add the following configurations: + +| Configuration item | Description | Default value | Required | Since version | +|----------------------------|-------------------------------------------|---------------|----------|------------------| +| `gcs_service_account_file` | The path of GCS service account JSON file.| (none) | Yes | 0.7.0-incubating | + +:::note +If the catalog has enabled [credential vending](security/credential-vending.md), the properties above can be omitted. +::: + Please install the `gravitino` package before running the following code: ```bash diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index f2485a22a07..76ebbb0c5d8 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -307,6 +307,18 @@ In some Spark versions, a Hadoop environment is needed by the driver, adding the ### Using the GVFS Java client to access the fileset +To access fileset with OSS using the GVFS Java client, based on the [basic GVFS configurations](./how-to-use-gvfs.md#configuration-1), you need to add the following configurations: + +| Configuration item | Description | Default value | Required | Since version | +|-------------------------|-----------------------------------|---------------|----------|------------------| +| `oss-endpoint` | The endpoint of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | +| `oss-access-key-id` | The access key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | +| `oss-secret-access-key` | The secret key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | + +:::note +If the catalog has enabled [credential vending](security/credential-vending.md), the properties above can be omitted. +::: + ```java Configuration conf = new Configuration(); conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); @@ -421,9 +433,20 @@ hadoop dfs -ls gvfs://fileset/oss_catalog/oss_schema/oss_fileset hadoop dfs -put /path/to/local/file gvfs://fileset/oss_catalog/schema/oss_fileset ``` - ### Using the GVFS Python client to access a fileset +In order to access fileset with OSS using the GVFS Python client, apart from [basic GVFS configurations](./how-to-use-gvfs.md#configuration-1), you need to add the following configurations: + +| Configuration item | Description | Default value | Required | Since version | +|----------------------------|-----------------------------------|---------------|----------|------------------| +| `oss_endpoint` | The endpoint of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | +| `oss_access_key_id` | The access key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | +| `oss_secret_access_key` | The secret key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | + +:::note +If the catalog has enabled [credential vending](security/credential-vending.md), the properties above can be omitted. +::: + Please install the `gravitino` package before running the following code: ```bash diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index 0d812477da4..e930ead095e 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -311,6 +311,19 @@ In some Spark versions, a Hadoop environment is needed by the driver, adding the ### Using the GVFS Java client to access the fileset +To access fileset with S3 using the GVFS Java client, based on the [basic GVFS configurations](./how-to-use-gvfs.md#configuration-1), you need to add the following configurations: + +| Configuration item | Description | Default value | Required | Since version | +|------------------------|-------------------------------|---------------|----------|------------------| +| `s3-endpoint` | The endpoint of the AWS S3. | (none) | No | 0.7.0-incubating | +| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes | 0.7.0-incubating | +| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes | 0.7.0-incubating | + +::: +- `s3-endpoint` is an optional configuration for AWS S3, however, it is required for other S3-compatible storage services like MinIO. +- If the catalog has enabled [credential vending](security/credential-vending.md), the properties above can be omitted. +::: + ```java Configuration conf = new Configuration(); conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); @@ -428,6 +441,25 @@ hadoop dfs -put /path/to/local/file gvfs://fileset/s3_catalog/s3_schema/s3_files ### Using the GVFS Python client to access a fileset +In order to access fileset with S3 using the GVFS Python client, apart from [basic GVFS configurations](./how-to-use-gvfs.md#configuration-1), you need to add the following configurations: + +| Configuration item | Description | Default value | Required | Since version | +|----------------------------|-------------------------------|---------------|----------|------------------| +| `s3_endpoint` | The endpoint of the AWS S3. | (none) | No | 0.7.0-incubating | +| `s3_access_key_id` | The access key of the AWS S3. | (none) | Yes | 0.7.0-incubating | +| `s3_secret_access_key` | The secret key of the AWS S3. | (none) | Yes | 0.7.0-incubating | + +::: +- `s3_endpoint` is an optional configuration for AWS S3, however, it is required for other S3-compatible storage services like MinIO. +- If the catalog has enabled [credential vending](security/credential-vending.md), the properties above can be omitted. +::: + +Please install the `gravitino` package before running the following code: + +```bash +pip install gravitino==0.8.0-incubating +``` + ```python from gravitino import gvfs options = { diff --git a/docs/how-to-use-gvfs.md b/docs/how-to-use-gvfs.md index 9514f5457eb..05be7bda7bc 100644 --- a/docs/how-to-use-gvfs.md +++ b/docs/how-to-use-gvfs.md @@ -66,55 +66,8 @@ the path mapping and convert automatically. | `fs.gravitino.fileset.cache.evictionMillsAfterAccess` | The value of time that the cache expires after accessing in the Gravitino Virtual File System. The value is in `milliseconds`. | `3600000` | No | 0.5.0 | | `fs.gravitino.fileset.cache.evictionMillsAfterAccess` | The value of time that the cache expires after accessing in the Gravitino Virtual File System. The value is in `milliseconds`. | `3600000` | No | 0.5.0 | -Apart from the above properties, to access fileset like S3, GCS, OSS and custom fileset, you need to configure the following extra properties. - -#### S3 fileset - -| Configuration item | Description | Default value | Required | Since version | -|------------------------|-------------------------------|---------------|----------|------------------| -| `s3-endpoint` | The endpoint of the AWS S3. | (none) | Yes | 0.7.0-incubating | -| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes | 0.7.0-incubating | -| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes | 0.7.0-incubating | - -At the same time, you need to add the corresponding bundle jar -1. [`gravitino-aws-bundle-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws-bundle/) in the classpath if no Hadoop environment is available, or -2. [`gravitino-aws-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws/) and `hadoop-aws-${hadoop-version}.jar` and other necessary dependencies (They are usually located at `${HADOOP_HOME}/share/hadoop/tools/lib`) in the classpath. - - -#### GCS fileset - -| Configuration item | Description | Default value | Required | Since version | -|----------------------------|--------------------------------------------|---------------|----------|------------------| -| `gcs-service-account-file` | The path of GCS service account JSON file. | (none) | Yes | 0.7.0-incubating | - -In the meantime, you need to add the corresponding bundle jar -1. [`gravitino-gcp-bundle-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-gcp-bundle/) in the classpath if no hadoop environment is available, or -2. [`gravitino-gcp-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-gcp/) and [gcs-connector jar](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases) and other necessary dependencies in the classpath. - - -#### OSS fileset - -| Configuration item | Description | Default value | Required | Since version | -|-------------------------|-----------------------------------|---------------|----------|------------------| -| `oss-endpoint` | The endpoint of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | -| `oss-access-key-id` | The access key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | -| `oss-secret-access-key` | The secret key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | - - -In the meantime, you need to place the corresponding bundle jar -1. [`gravitino-aliyun-bundle-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aliyun-bundle/) in the classpath if no hadoop environment is available, or -2. [`gravitino-aliyun-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aliyun/) and `hadoop-aliyun-${hadoop-version}.jar` and other necessary dependencies (They are usually located at `${HADOOP_HOME}/share/hadoop/tools/lib`) in the classpath. - -#### Azure Blob Storage fileset - -| Configuration item | Description | Default value | Required | Since version | -|------------------------------|-----------------------------------------|---------------|----------|------------------| -| `azure-storage-account-name` | The account name of Azure Blob Storage. | (none) | Yes | 0.8.0-incubating | -| `azure-storage-account-key` | The account key of Azure Blob Storage. | (none) | Yes | 0.8.0-incubating | - -Similar to the above, you need to place the corresponding bundle jar -1. [`gravitino-azure-bundle-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-azure-bundle/) in the classpath if no hadoop environment is available, or -2. [`gravitino-azure-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-azure/) and `hadoop-azure-${hadoop-version}.jar` and other necessary dependencies (They are usually located at `${HADOOP_HOME}/share/hadoop/tools/lib) in the classpath. +Apart from the above properties, to access fileset like S3, GCS, OSS and custom fileset, extra properties are needed, please see +[S3 GVFS Java client configurations](./hadoop-catalog-with-s3.md#using-the-gvfs-java-client-to-access-the-fileset), [GCS GVFS Java client configurations](./hadoop-catalog-with-gcs.md#using-the-gvfs-java-client-to-access-the-fileset), [OSS GVFS Java client configurations](./hadoop-catalog-with-oss.md#using-the-gvfs-java-client-to-access-the-fileset) and [Azure Blob Storage GVFS Java client configurations](./hadoop-catalog-with-adls.md#using-the-gvfs-java-client-to-access-the-fileset) for more details. #### Custom fileset Since 0.7.0-incubating, users can define their own fileset type and configure the corresponding properties, for more, please refer to [Custom Fileset](./hadoop-catalog.md#how-to-custom-your-own-hcfs-file-system-fileset). @@ -134,20 +87,10 @@ You can configure these properties in two ways: conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); conf.set("fs.gravitino.server.uri","http://localhost:8090"); conf.set("fs.gravitino.client.metalake","test_metalake"); - - // Optional. It's only for S3 catalog. For GCS and OSS catalog, you should set the corresponding properties. - conf.set("s3-endpoint", "http://localhost:9000"); - conf.set("s3-access-key-id", "minio"); - conf.set("s3-secret-access-key", "minio123"); - Path filesetPath = new Path("gvfs://fileset/test_catalog/test_schema/test_fileset_1"); FileSystem fs = filesetPath.getFileSystem(conf); ``` -:::note -If you want to access the S3, GCS, OSS or custom fileset through GVFS, apart from the above properties, you need to place the corresponding bundle jars in the Hadoop environment. -::: - 2. Configure the properties in the `core-site.xml` file of the Hadoop environment: ```xml @@ -170,20 +113,6 @@ If you want to access the S3, GCS, OSS or custom fileset through GVFS, apart fro fs.gravitino.client.metalake test_metalake - - - - s3-endpoint - http://localhost:9000 - - - s3-access-key-id - minio - - - s3-secret-access-key - minio123 - ``` ### Usage examples @@ -219,11 +148,6 @@ cp gravitino-filesystem-hadoop3-runtime-{version}.jar ${HADOOP_HOME}/share/hadoo # You need to ensure that the Kerberos has permission on the HDFS directory. kinit -kt your_kerberos.keytab your_kerberos@xxx.com -# 4. Copy other dependencies to the Hadoop environment if you want to access the S3 fileset via GVFS -cp bundles/aws-bundle/build/libs/gravitino-aws-bundle-{version}.jar ${HADOOP_HOME}/share/hadoop/common/lib/ -cp clients/filesystem-hadoop3-runtime/build/libs/gravitino-filesystem-hadoop3-runtime-{version}-SNAPSHOT.jar ${HADOOP_HOME}/share/hadoop/common/lib/ -cp ${HADOOP_HOME}/share/hadoop/tools/lib/* ${HADOOP_HOME}/share/hadoop/common/lib/ - # 4. Try to list the fileset ./${HADOOP_HOME}/bin/hadoop dfs -ls gvfs://fileset/test_catalog/test_schema/test_fileset_1 ``` @@ -234,36 +158,6 @@ You can also perform operations on the files or directories managed by fileset t Make sure that your code is using the correct Hadoop environment, and that your environment has the `gravitino-filesystem-hadoop3-runtime-{version}.jar` dependency. -```xml - - - org.apache.gravitino - filesystem-hadoop3-runtime - {gravitino-version} - - - - - org.apache.gravitino - gravitino-aws-bundle - {gravitino-version} - - - - - org.apache.gravitino - gravitino-aws - {gravitino-version} - - - - org.apache.hadoop - hadoop-aws - {hadoop-version} - - -``` - For example: ```java @@ -461,62 +355,14 @@ to recompile the native libraries like `libhdfs` and others, and completely repl | `oauth2_path` | The auth server path for the Gravitino client when using `oauth2` auth type. Please remove the first slash `/` from the path, for example `oauth/token`. | (none) | Yes if you use `oauth2` auth type | 0.7.0-incubating | | `oauth2_scope` | The auth scope for the Gravitino client when using `oauth2` auth type with the Gravitino Virtual File System. | (none) | Yes if you use `oauth2` auth type | 0.7.0-incubating | +#### Configurations for S3, GCS, OSS and Azure Blob storage fileset -#### Extra configuration for S3, GCS, OSS fileset - -The following properties are required if you want to access the S3 fileset via the GVFS python client: - -| Configuration item | Description | Default value | Required | Since version | -|----------------------------|------------------------------|---------------|----------|------------------| -| `s3_endpoint` | The endpoint of the AWS S3. | (none) | Yes | 0.7.0-incubating | -| `s3_access_key_id` | The access key of the AWS S3.| (none) | Yes | 0.7.0-incubating | -| `s3_secret_access_key` | The secret key of the AWS S3.| (none) | Yes | 0.7.0-incubating | - -The following properties are required if you want to access the GCS fileset via the GVFS python client: - -| Configuration item | Description | Default value | Required | Since version | -|----------------------------|-------------------------------------------|---------------|----------|------------------| -| `gcs_service_account_file` | The path of GCS service account JSON file.| (none) | Yes | 0.7.0-incubating | - -The following properties are required if you want to access the OSS fileset via the GVFS python client: - -| Configuration item | Description | Default value | Required | Since version | -|----------------------------|-----------------------------------|---------------|----------|------------------| -| `oss_endpoint` | The endpoint of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | -| `oss_access_key_id` | The access key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | -| `oss_secret_access_key` | The secret key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | - -For Azure Blob Storage fileset, you need to configure the following properties: - -| Configuration item | Description | Default value | Required | Since version | -|--------------------|----------------------------------------|---------------|----------|------------------| -| `abs_account_name` | The account name of Azure Blob Storage | (none) | Yes | 0.8.0-incubating | -| `abs_account_key` | The account key of Azure Blob Storage | (none) | Yes | 0.8.0-incubating | - - -You can configure these properties when obtaining the `Gravitino Virtual FileSystem` in Python like this: - -```python -from gravitino import gvfs -options = { - "cache_size": 20, - "cache_expired_time": 3600, - "auth_type": "simple", - # Optional, the following properties are required if you want to access the S3 fileset via GVFS python client, for GCS and OSS fileset, you should set the corresponding properties. - "s3_endpoint": "http://localhost:9000", - "s3_access_key_id": "minio", - "s3_secret_access_key": "minio123" -} -fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090", metalake_name="test_metalake", options=options) -``` - +Please see the cloud-storage-specific configurations [GCS GVFS Java client configurations](./hadoop-catalog-with-gcs.md#using-the-gvfs-python-client-to-access-a-fileset), [S3 GVFS Java client configurations](./hadoop-catalog-with-s3.md#using-the-gvfs-python-client-to-access-a-fileset), [OSS GVFS Java client configurations](./hadoop-catalog-with-oss.md#using-the-gvfs-python-client-to-access-a-fileset) and [Azure Blob Storage GVFS Java client configurations](./hadoop-catalog-with-adls.md#using-the-gvfs-python-client-to-access-a-fileset) for more details. :::note - Gravitino python client does not support [customized file systems](hadoop-catalog.md#how-to-custom-your-own-hcfs-file-system-fileset) defined by users due to the limit of `fsspec` library. ::: - ### Usage examples 1. Make sure to obtain the Gravitino library. From cfb054cce7ab5137b7adc1761d8b6bc4bd44402d Mon Sep 17 00:00:00 2001 From: yuqi Date: Fri, 10 Jan 2025 22:02:24 +0800 Subject: [PATCH 25/39] Optimize structure. --- docs/hadoop-catalog-with-adls.md | 138 +++++++++++++++---------------- docs/hadoop-catalog-with-gcs.md | 130 ++++++++++++++--------------- docs/hadoop-catalog-with-oss.md | 138 +++++++++++++++---------------- docs/hadoop-catalog-with-s3.md | 136 +++++++++++++++--------------- 4 files changed, 271 insertions(+), 271 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index fa1a3fb8a6f..b49b30727da 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -231,75 +231,6 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema ## Accessing a fileset with ADLS -### Using Spark to access the fileset - -The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: - -Before running the following code, you need to install required packages: - -```bash -pip install pyspark==3.1.3 -pip install gravitino==0.8.0-incubating -``` -Then you can run the following code: - -```python -import logging -from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient -from pyspark.sql import SparkSession -import os - -gravitino_url = "http://localhost:8090" -metalake_name = "test" - -catalog_name = "your_adls_catalog" -schema_name = "your_adls_schema" -fileset_name = "your_adls_fileset" - -os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/hadoop-azure-3.2.0.jar,/path/to/azure-storage-7.0.0.jar,/path/to/wildfly-openssl-1.0.4.Final.jar --master local[1] pyspark-shell" -spark = SparkSession.builder -.appName("adls_fileset_test") -.config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") -.config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") -.config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") -.config("spark.hadoop.fs.gravitino.client.metalake", "test") -.config("spark.hadoop.azure-storage-account-name", "azure_account_name") -.config("spark.hadoop.azure-storage-account-key", "azure_account_name") -.config("spark.hadoop.fs.azure.skipUserGroupMetadataDuringInitialization", "true") -.config("spark.driver.memory", "2g") -.config("spark.driver.port", "2048") -.getOrCreate() - -data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)] -columns = ["Name", "Age"] -spark_df = spark.createDataFrame(data, schema=columns) -gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people" - -spark_df.coalesce(1).write -.mode("overwrite") -.option("header", "true") -.csv(gvfs_path) -``` - -If your Spark **without Hadoop environment**, you can use the following code snippet to access the fileset: - -```python -## Replace the following code snippet with the above code snippet with the same environment variables - -os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar --master local[1] pyspark-shell" -``` - -- [`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle) is the Gravitino ADLS jar with Hadoop environment(3.3.1) and `hadoop-azure` jar. -- [`gravitino-azure-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure) is a condensed version of the Gravitino ADLS bundle jar without Hadoop environment and `hadoop-azure` jar. -- `hadoop-azure-3.2.0.jar` and `azure-storage-7.0.0.jar` can be found in the Hadoop distribution in the `${HADOOP_HOME}/share/hadoop/tools/lib` directory. - - -Please choose the correct jar according to your environment. - -:::note -In some Spark versions, a Hadoop environment is necessary for the driver, adding the bundle jars with '--jars' may not work. If this is the case, you should add the jars to the spark CLASSPATH directly. -::: - ### Using the GVFS Java client to access the fileset To access fileset with Azure Blob Storage(ADLS) using the GVFS Java client, based on the [basic GVFS configurations](./how-to-use-gvfs.md#configuration-1), you need to add the following configurations: @@ -373,6 +304,75 @@ Or use the bundle jar with Hadoop environment: ``` +### Using Spark to access the fileset + +The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: + +Before running the following code, you need to install required packages: + +```bash +pip install pyspark==3.1.3 +pip install gravitino==0.8.0-incubating +``` +Then you can run the following code: + +```python +import logging +from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient +from pyspark.sql import SparkSession +import os + +gravitino_url = "http://localhost:8090" +metalake_name = "test" + +catalog_name = "your_adls_catalog" +schema_name = "your_adls_schema" +fileset_name = "your_adls_fileset" + +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/hadoop-azure-3.2.0.jar,/path/to/azure-storage-7.0.0.jar,/path/to/wildfly-openssl-1.0.4.Final.jar --master local[1] pyspark-shell" +spark = SparkSession.builder +.appName("adls_fileset_test") +.config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") +.config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") +.config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") +.config("spark.hadoop.fs.gravitino.client.metalake", "test") +.config("spark.hadoop.azure-storage-account-name", "azure_account_name") +.config("spark.hadoop.azure-storage-account-key", "azure_account_name") +.config("spark.hadoop.fs.azure.skipUserGroupMetadataDuringInitialization", "true") +.config("spark.driver.memory", "2g") +.config("spark.driver.port", "2048") +.getOrCreate() + +data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)] +columns = ["Name", "Age"] +spark_df = spark.createDataFrame(data, schema=columns) +gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people" + +spark_df.coalesce(1).write +.mode("overwrite") +.option("header", "true") +.csv(gvfs_path) +``` + +If your Spark **without Hadoop environment**, you can use the following code snippet to access the fileset: + +```python +## Replace the following code snippet with the above code snippet with the same environment variables + +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar --master local[1] pyspark-shell" +``` + +- [`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle) is the Gravitino ADLS jar with Hadoop environment(3.3.1) and `hadoop-azure` jar. +- [`gravitino-azure-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure) is a condensed version of the Gravitino ADLS bundle jar without Hadoop environment and `hadoop-azure` jar. +- `hadoop-azure-3.2.0.jar` and `azure-storage-7.0.0.jar` can be found in the Hadoop distribution in the `${HADOOP_HOME}/share/hadoop/tools/lib` directory. + + +Please choose the correct jar according to your environment. + +:::note +In some Spark versions, a Hadoop environment is necessary for the driver, adding the bundle jars with '--jars' may not work. If this is the case, you should add the jars to the spark CLASSPATH directly. +::: + ### Accessing a fileset using the Hadoop fs command The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index 99328963e0e..ecd8091eb45 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -227,71 +227,6 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema ## Accessing a fileset with GCS -### Using Spark to access the fileset - -The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: - -Before running the following code, you need to install required packages: - -```bash -pip install pyspark==3.1.3 -pip install gravitino==0.8.0-incubating -``` -Then you can run the following code: - -```python -import logging -from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient -from pyspark.sql import SparkSession -import os - -gravitino_url = "http://localhost:8090" -metalake_name = "test" - -catalog_name = "your_gcs_catalog" -schema_name = "your_gcs_schema" -fileset_name = "your_gcs_fileset" - -os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-gcp-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/gcs-connector-hadoop3-2.2.22-shaded.jar --master local[1] pyspark-shell" -spark = SparkSession.builder -.appName("gcs_fielset_test") -.config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") -.config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") -.config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") -.config("spark.hadoop.fs.gravitino.client.metalake", "test_metalake") -.config("spark.hadoop.gcs-service-account-file", "/path/to/gcs-service-account-file.json") -.config("spark.driver.memory", "2g") -.config("spark.driver.port", "2048") -.getOrCreate() - -data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)] -columns = ["Name", "Age"] -spark_df = spark.createDataFrame(data, schema=columns) -gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people" - -spark_df.coalesce(1).write -.mode("overwrite") -.option("header", "true") -.csv(gvfs_path) -``` - -If your Spark **without Hadoop environment**, you can use the following code snippet to access the fileset: - -```python -## Replace the following code snippet with the above code snippet with the same environment variables - -os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-gcp-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar, --master local[1] pyspark-shell" -``` - -- [`gravitino-gcp-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp-bundle) is the Gravitino GCP jar with Hadoop environment(3.3.1) and `gcs-connector`. -- [`gravitino-gcp-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp) is a condensed version of the Gravitino GCP bundle jar without Hadoop environment and [`gcs-connector`](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases/download/v2.2.22/gcs-connector-hadoop3-2.2.22-shaded.jar) - -Please choose the correct jar according to your environment. - -:::note -In some Spark versions, a Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work. If this is the case, you should add the jars to the spark CLASSPATH directly. -::: - ### Using the GVFS Java client to access the fileset To access fileset with GCS using the GVFS Java client, based on the [basic GVFS configurations](./how-to-use-gvfs.md#configuration-1), you need to add the following configurations: @@ -360,6 +295,71 @@ Or use the bundle jar with Hadoop environment: ``` +### Using Spark to access the fileset + +The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: + +Before running the following code, you need to install required packages: + +```bash +pip install pyspark==3.1.3 +pip install gravitino==0.8.0-incubating +``` +Then you can run the following code: + +```python +import logging +from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient +from pyspark.sql import SparkSession +import os + +gravitino_url = "http://localhost:8090" +metalake_name = "test" + +catalog_name = "your_gcs_catalog" +schema_name = "your_gcs_schema" +fileset_name = "your_gcs_fileset" + +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-gcp-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/gcs-connector-hadoop3-2.2.22-shaded.jar --master local[1] pyspark-shell" +spark = SparkSession.builder +.appName("gcs_fielset_test") +.config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") +.config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") +.config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") +.config("spark.hadoop.fs.gravitino.client.metalake", "test_metalake") +.config("spark.hadoop.gcs-service-account-file", "/path/to/gcs-service-account-file.json") +.config("spark.driver.memory", "2g") +.config("spark.driver.port", "2048") +.getOrCreate() + +data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)] +columns = ["Name", "Age"] +spark_df = spark.createDataFrame(data, schema=columns) +gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people" + +spark_df.coalesce(1).write +.mode("overwrite") +.option("header", "true") +.csv(gvfs_path) +``` + +If your Spark **without Hadoop environment**, you can use the following code snippet to access the fileset: + +```python +## Replace the following code snippet with the above code snippet with the same environment variables + +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-gcp-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar, --master local[1] pyspark-shell" +``` + +- [`gravitino-gcp-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp-bundle) is the Gravitino GCP jar with Hadoop environment(3.3.1) and `gcs-connector`. +- [`gravitino-gcp-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp) is a condensed version of the Gravitino GCP bundle jar without Hadoop environment and [`gcs-connector`](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases/download/v2.2.22/gcs-connector-hadoop3-2.2.22-shaded.jar) + +Please choose the correct jar according to your environment. + +:::note +In some Spark versions, a Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work. If this is the case, you should add the jars to the spark CLASSPATH directly. +::: + ### Accessing a fileset using the Hadoop fs command The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index 76ebbb0c5d8..0609018158a 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -237,74 +237,6 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema ## Accessing a fileset with OSS -### Using Spark to access the fileset - -The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: - -Before running the following code, you need to install required packages: - -```bash -pip install pyspark==3.1.3 -pip install gravitino==0.8.0-incubating -``` -Then you can run the following code: - -```python -import logging -from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient -from pyspark.sql import SparkSession -import os - -gravitino_url = "http://localhost:8090" -metalake_name = "test" - -catalog_name = "your_oss_catalog" -schema_name = "your_oss_schema" -fileset_name = "your_oss_fileset" - -os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aliyun-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/aliyun-sdk-oss-2.8.3.jar,/path/to/hadoop-aliyun-3.2.0.jar,/path/to/jdom-1.1.jar --master local[1] pyspark-shell" -spark = SparkSession.builder -.appName("oss_fielset_test") -.config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") -.config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") -.config("spark.hadoop.fs.gravitino.server.uri", "${_URL}") -.config("spark.hadoop.fs.gravitino.client.metalake", "test") -.config("spark.hadoop.oss-access-key-id", os.environ["OSS_ACCESS_KEY_ID"]) -.config("spark.hadoop.oss-secret-access-key", os.environ["OSS_SECRET_ACCESS_KEY"]) -.config("spark.hadoop.oss-endpoint", "http://oss-cn-hangzhou.aliyuncs.com") -.config("spark.driver.memory", "2g") -.config("spark.driver.port", "2048") -.getOrCreate() - -data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)] -columns = ["Name", "Age"] -spark_df = spark.createDataFrame(data, schema=columns) -gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people" - -spark_df.coalesce(1).write -.mode("overwrite") -.option("header", "true") -.csv(gvfs_path) -``` - -If your Spark **without Hadoop environment**, you can use the following code snippet to access the fileset: - -```python -## Replace the following code snippet with the above code snippet with the same environment variables - -os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aliyun-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar, --master local[1] pyspark-shell" -``` - -- [`gravitino-aliyun-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun-bundle) is the Gravitino Aliyun jar with Hadoop environment(3.3.1) and `hadoop-oss` jar. -- [`gravitino-aliyun-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun) is a condensed version of the Gravitino Aliyun bundle jar without Hadoop environment and `hadoop-aliyun` jar. --`hadoop-aliyun-3.2.0.jar` and `aliyun-sdk-oss-2.8.3.jar` can be found in the Hadoop distribution in the `${HADOOP_HOME}/share/hadoop/tools/lib` directory. - -Please choose the correct jar according to your environment. - -:::note -In some Spark versions, a Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work. If this is the case, you should add the jars to the spark CLASSPATH directly. -::: - ### Using the GVFS Java client to access the fileset To access fileset with OSS using the GVFS Java client, based on the [basic GVFS configurations](./how-to-use-gvfs.md#configuration-1), you need to add the following configurations: @@ -315,7 +247,7 @@ To access fileset with OSS using the GVFS Java client, based on the [basic GVFS | `oss-access-key-id` | The access key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | | `oss-secret-access-key` | The secret key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | -:::note +:::note If the catalog has enabled [credential vending](security/credential-vending.md), the properties above can be omitted. ::: @@ -379,6 +311,74 @@ Or use the bundle jar with Hadoop environment: ``` +### Using Spark to access the fileset + +The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: + +Before running the following code, you need to install required packages: + +```bash +pip install pyspark==3.1.3 +pip install gravitino==0.8.0-incubating +``` +Then you can run the following code: + +```python +import logging +from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient +from pyspark.sql import SparkSession +import os + +gravitino_url = "http://localhost:8090" +metalake_name = "test" + +catalog_name = "your_oss_catalog" +schema_name = "your_oss_schema" +fileset_name = "your_oss_fileset" + +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aliyun-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/aliyun-sdk-oss-2.8.3.jar,/path/to/hadoop-aliyun-3.2.0.jar,/path/to/jdom-1.1.jar --master local[1] pyspark-shell" +spark = SparkSession.builder +.appName("oss_fielset_test") +.config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") +.config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") +.config("spark.hadoop.fs.gravitino.server.uri", "${_URL}") +.config("spark.hadoop.fs.gravitino.client.metalake", "test") +.config("spark.hadoop.oss-access-key-id", os.environ["OSS_ACCESS_KEY_ID"]) +.config("spark.hadoop.oss-secret-access-key", os.environ["OSS_SECRET_ACCESS_KEY"]) +.config("spark.hadoop.oss-endpoint", "http://oss-cn-hangzhou.aliyuncs.com") +.config("spark.driver.memory", "2g") +.config("spark.driver.port", "2048") +.getOrCreate() + +data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)] +columns = ["Name", "Age"] +spark_df = spark.createDataFrame(data, schema=columns) +gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people" + +spark_df.coalesce(1).write +.mode("overwrite") +.option("header", "true") +.csv(gvfs_path) +``` + +If your Spark **without Hadoop environment**, you can use the following code snippet to access the fileset: + +```python +## Replace the following code snippet with the above code snippet with the same environment variables + +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aliyun-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar, --master local[1] pyspark-shell" +``` + +- [`gravitino-aliyun-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun-bundle) is the Gravitino Aliyun jar with Hadoop environment(3.3.1) and `hadoop-oss` jar. +- [`gravitino-aliyun-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun) is a condensed version of the Gravitino Aliyun bundle jar without Hadoop environment and `hadoop-aliyun` jar. +-`hadoop-aliyun-3.2.0.jar` and `aliyun-sdk-oss-2.8.3.jar` can be found in the Hadoop distribution in the `${HADOOP_HOME}/share/hadoop/tools/lib` directory. + +Please choose the correct jar according to your environment. + +:::note +In some Spark versions, a Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work. If this is the case, you should add the jars to the spark CLASSPATH directly. +::: + ### Accessing a fileset using the Hadoop fs command The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index e930ead095e..cbc0a4e01c3 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -242,73 +242,6 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", "e ## Accessing a fileset with S3 -### Using Spark to access the fileset - -The following Python code demonstrates how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: - -Before running the following code, you need to install required packages: - -```bash -pip install pyspark==3.1.3 -pip install gravitino==0.8.0-incubating -``` -Then you can run the following code: - -```python -import logging -from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient -from pyspark.sql import SparkSession -import os - -gravitino_url = "http://localhost:8090" -metalake_name = "test" - -catalog_name = "your_s3_catalog" -schema_name = "your_s3_schema" -fileset_name = "your_s3_fileset" - -os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aws-${gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-${gravitino-version}-SNAPSHOT.jar,/path/to/hadoop-aws-3.2.0.jar,/path/to/aws-java-sdk-bundle-1.11.375.jar --master local[1] pyspark-shell" -spark = SparkSession.builder - .appName("s3_fielset_test") - .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") - .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") - .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") - .config("spark.hadoop.fs.gravitino.client.metalake", "test") - .config("spark.hadoop.s3-access-key-id", os.environ["S3_ACCESS_KEY_ID"]) - .config("spark.hadoop.s3-secret-access-key", os.environ["S3_SECRET_ACCESS_KEY"]) - .config("spark.hadoop.s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com") - .config("spark.driver.memory", "2g") - .config("spark.driver.port", "2048") - .getOrCreate() - -data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)] -columns = ["Name", "Age"] -spark_df = spark.createDataFrame(data, schema=columns) -gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people" - -spark_df.coalesce(1).write -.mode("overwrite") -.option("header", "true") -.csv(gvfs_path) -``` - -If your Spark **without Hadoop environment**, you can use the following code snippet to access the fileset: - -```python -## Replace the following code snippet with the above code snippet with the same environment variables -os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aws-bundle-${gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-${gravitino-version}-SNAPSHOT.jar --master local[1] pyspark-shell" -``` - -- [`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle) is the Gravitino AWS jar with Hadoop environment(3.3.1) and `hadoop-aws` jar. -- [`gravitino-aws-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws) is a condensed version of the Gravitino AWS bundle jar without Hadoop environment and `hadoop-aws` jar. -- `hadoop-aws-3.2.0.jar` and `aws-java-sdk-bundle-1.11.375.jar` can be found in the Hadoop distribution in the `${HADOOP_HOME}/share/hadoop/tools/lib` directory. - -Please choose the correct jar according to your environment. - -:::note -In some Spark versions, a Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work. If this is the case, you should add the jars to the spark CLASSPATH directly. -::: - ### Using the GVFS Java client to access the fileset To access fileset with S3 using the GVFS Java client, based on the [basic GVFS configurations](./how-to-use-gvfs.md#configuration-1), you need to add the following configurations: @@ -322,7 +255,7 @@ To access fileset with S3 using the GVFS Java client, based on the [basic GVFS c ::: - `s3-endpoint` is an optional configuration for AWS S3, however, it is required for other S3-compatible storage services like MinIO. - If the catalog has enabled [credential vending](security/credential-vending.md), the properties above can be omitted. -::: + ::: ```java Configuration conf = new Configuration(); @@ -385,6 +318,73 @@ Or use the bundle jar with Hadoop environment: ``` +### Using Spark to access the fileset + +The following Python code demonstrates how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: + +Before running the following code, you need to install required packages: + +```bash +pip install pyspark==3.1.3 +pip install gravitino==0.8.0-incubating +``` +Then you can run the following code: + +```python +import logging +from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient +from pyspark.sql import SparkSession +import os + +gravitino_url = "http://localhost:8090" +metalake_name = "test" + +catalog_name = "your_s3_catalog" +schema_name = "your_s3_schema" +fileset_name = "your_s3_fileset" + +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aws-${gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-${gravitino-version}-SNAPSHOT.jar,/path/to/hadoop-aws-3.2.0.jar,/path/to/aws-java-sdk-bundle-1.11.375.jar --master local[1] pyspark-shell" +spark = SparkSession.builder + .appName("s3_fielset_test") + .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") + .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.client.metalake", "test") + .config("spark.hadoop.s3-access-key-id", os.environ["S3_ACCESS_KEY_ID"]) + .config("spark.hadoop.s3-secret-access-key", os.environ["S3_SECRET_ACCESS_KEY"]) + .config("spark.hadoop.s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com") + .config("spark.driver.memory", "2g") + .config("spark.driver.port", "2048") + .getOrCreate() + +data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)] +columns = ["Name", "Age"] +spark_df = spark.createDataFrame(data, schema=columns) +gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people" + +spark_df.coalesce(1).write +.mode("overwrite") +.option("header", "true") +.csv(gvfs_path) +``` + +If your Spark **without Hadoop environment**, you can use the following code snippet to access the fileset: + +```python +## Replace the following code snippet with the above code snippet with the same environment variables +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aws-bundle-${gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-${gravitino-version}-SNAPSHOT.jar --master local[1] pyspark-shell" +``` + +- [`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle) is the Gravitino AWS jar with Hadoop environment(3.3.1) and `hadoop-aws` jar. +- [`gravitino-aws-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws) is a condensed version of the Gravitino AWS bundle jar without Hadoop environment and `hadoop-aws` jar. +- `hadoop-aws-3.2.0.jar` and `aws-java-sdk-bundle-1.11.375.jar` can be found in the Hadoop distribution in the `${HADOOP_HOME}/share/hadoop/tools/lib` directory. + +Please choose the correct jar according to your environment. + +:::note +In some Spark versions, a Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work. If this is the case, you should add the jars to the spark CLASSPATH directly. +::: + ### Accessing a fileset using the Hadoop fs command The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3. From de96e7493cc39068e644adc3fbd86993485c5d5a Mon Sep 17 00:00:00 2001 From: yuqi Date: Mon, 13 Jan 2025 10:18:16 +0800 Subject: [PATCH 26/39] resolve comments --- docs/hadoop-catalog-with-adls.md | 4 ++-- docs/hadoop-catalog-with-gcs.md | 4 ++-- docs/hadoop-catalog-with-oss.md | 4 ++-- docs/hadoop-catalog-with-s3.md | 4 ++-- 4 files changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index b49b30727da..ab73e5b0983 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -417,8 +417,8 @@ For ADLS, you need to add `gravitino-filesystem-hadoop3-runtime-${gravitino-vers 3. Run the following command to access the fileset: ```shell -hadoop dfs -ls gvfs://fileset/adls_catalog/adls_schema/adls_fileset -hadoop dfs -put /path/to/local/file gvfs://fileset/adls_catalog/adls_schema/adls_fileset +./${HADOOP_HOME}/bin/hadoop dfs -ls gvfs://fileset/adls_catalog/adls_schema/adls_fileset +./${HADOOP_HOME}/bin/hadoop dfs -put /path/to/local/file gvfs://fileset/adls_catalog/adls_schema/adls_fileset ``` ### Using the GVFS Python client to access a fileset diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index ecd8091eb45..b39422c1d1b 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -400,8 +400,8 @@ For GCS, you need to add `gravitino-filesystem-hadoop3-runtime-${gravitino-versi 3. Run the following command to access the fileset: ```shell -hadoop dfs -ls gvfs://fileset/gcs_catalog/gcs_schema/gcs_example -hadoop dfs -put /path/to/local/file gvfs://fileset/gcs_catalog/gcs_schema/gcs_example +./${HADOOP_HOME}/bin/hadoop dfs -ls gvfs://fileset/gcs_catalog/gcs_schema/gcs_example +./${HADOOP_HOME}/bin/hadoop dfs -put /path/to/local/file gvfs://fileset/gcs_catalog/gcs_schema/gcs_example ``` ### Using the GVFS Python client to access a fileset diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index 0609018158a..9da88007296 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -429,8 +429,8 @@ For OSS, you need to add `gravitino-filesystem-hadoop3-runtime-${gravitino-versi 3. Run the following command to access the fileset: ```shell -hadoop dfs -ls gvfs://fileset/oss_catalog/oss_schema/oss_fileset -hadoop dfs -put /path/to/local/file gvfs://fileset/oss_catalog/schema/oss_fileset +./${HADOOP_HOME}/bin/hadoop dfs -ls gvfs://fileset/oss_catalog/oss_schema/oss_fileset +./${HADOOP_HOME}/bin/hadoop dfs -put /path/to/local/file gvfs://fileset/oss_catalog/schema/oss_fileset ``` ### Using the GVFS Python client to access a fileset diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index cbc0a4e01c3..a7ae4b29c3e 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -435,8 +435,8 @@ For S3, you need to add `gravitino-filesystem-hadoop3-runtime-${gravitino-versio 3. Run the following command to access the fileset: ```shell -hadoop dfs -ls gvfs://fileset/s3_catalog/s3_schema/s3_fileset -hadoop dfs -put /path/to/local/file gvfs://fileset/s3_catalog/s3_schema/s3_fileset +./${HADOOP_HOME}/bin/hadoop dfs -ls gvfs://fileset/s3_catalog/s3_schema/s3_fileset +./${HADOOP_HOME}/bin/hadoop dfs -put /path/to/local/file gvfs://fileset/s3_catalog/s3_schema/s3_fileset ``` ### Using the GVFS Python client to access a fileset From 7806b2f0edca61438389a2fad491135c09945373 Mon Sep 17 00:00:00 2001 From: yuqi Date: Mon, 13 Jan 2025 14:21:52 +0800 Subject: [PATCH 27/39] resolve comments --- docs/hadoop-catalog-index.md | 23 ++++++++ ...manage-fileset-metadata-using-gravitino.md | 59 +------------------ 2 files changed, 26 insertions(+), 56 deletions(-) create mode 100644 docs/hadoop-catalog-index.md diff --git a/docs/hadoop-catalog-index.md b/docs/hadoop-catalog-index.md new file mode 100644 index 00000000000..b18fd5c2453 --- /dev/null +++ b/docs/hadoop-catalog-index.md @@ -0,0 +1,23 @@ +--- +title: "Hadoop catalog index" +slug: /hadoop-catalog-index +date: 2025-01-13 +keyword: Hadoop catalog index S3 GCS ADLS OSS +license: "This software is licensed under the Apache License version 2." +--- + + +Gravitino Hadoop catalog index includes the following chapters: + +- [Hadoop catalog overview and features](./hadoop-catalog.md) +- [Manage Hadoop catalog with Gravitino API](./manage-fileset-metadata-using-gravitino.md) +- [Using Hadoop catalog with Gravitino virtual System](how-to-use-gvfs.md) + +Apart from the above, you can also refer to the following topics to manage and access cloud storage like S3, GCS, ADLS, and OSS: + +- [Using Hadoop catalog to manage S3](./hadoop-catalog-with-s3.md) +- [Using Hadoop catalog to manage GCS](./hadoop-catalog-with-gcs.md) +- [Using Hadoop catalog to manage ADLS](./hadoop-catalog-with-adls.md) +- [Using Hadoop catalog to manage OSS](./hadoop-catalog-with-oss.md) + +More storage options will be added soon. Stay tuned! \ No newline at end of file diff --git a/docs/manage-fileset-metadata-using-gravitino.md b/docs/manage-fileset-metadata-using-gravitino.md index 9d96287b564..0ff84c83461 100644 --- a/docs/manage-fileset-metadata-using-gravitino.md +++ b/docs/manage-fileset-metadata-using-gravitino.md @@ -15,7 +15,9 @@ filesets to manage non-tabular data like training datasets and other raw data. Typically, a fileset is mapped to a directory on a file system like HDFS, S3, ADLS, GCS, etc. With the fileset managed by Gravitino, the non-tabular data can be managed as assets together with -tabular data in Gravitino in a unified way. +tabular data in Gravitino in a unified way. The following operations will use HDFS as an example, for other +HCFS like S3, OSS, GCS, etc, please refer to the corresponding operations [hadoop-with-s3](./hadoop-catalog-with-s3.md), [hadoop-with-oss](./hadoop-catalog-with-oss.md), [hadoop-with-gcs](./hadoop-catalog-with-gcs.md) and +[hadoop-with-adls](./hadoop-catalog-with-adls.md). After a fileset is created, users can easily access, manage the files/directories through the fileset's identifier, without needing to know the physical path of the managed dataset. Also, with @@ -53,24 +55,6 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ } }' http://localhost:8090/api/metalakes/metalake/catalogs -# create a S3 catalog -curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ --H "Content-Type: application/json" -d '{ - "name": "catalog", - "type": "FILESET", - "comment": "comment", - "provider": "hadoop", - "properties": { - "location": "s3a://bucket/root", - "s3-access-key-id": "access_key", - "s3-secret-access-key": "secret_key", - "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com", - "filesystem-providers": "s3" - } -}' http://localhost:8090/api/metalakes/metalake/catalogs - -# For others HCFS like GCS, OSS, etc., the properties should be set accordingly. please refer to -# The following link about the catalog properties. ``` @@ -93,25 +77,8 @@ Catalog catalog = gravitinoClient.createCatalog("catalog", "hadoop", // provider, Gravitino only supports "hadoop" for now. "This is a Hadoop fileset catalog", properties); - -// create a S3 catalog -s3Properties = ImmutableMap.builder() - .put("location", "s3a://bucket/root") - .put("s3-access-key-id", "access_key") - .put("s3-secret-access-key", "secret_key") - .put("s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com") - .put("filesystem-providers", "s3") - .build(); - -Catalog s3Catalog = gravitinoClient.createCatalog("catalog", - Type.FILESET, - "hadoop", // provider, Gravitino only supports "hadoop" for now. - "This is a S3 fileset catalog", - s3Properties); // ... -// For others HCFS like GCS, OSS, etc., the properties should be set accordingly. please refer to -// The following link about the catalog properties. ``` @@ -124,23 +91,6 @@ catalog = gravitino_client.create_catalog(name="catalog", provider="hadoop", comment="This is a Hadoop fileset catalog", properties={"location": "/tmp/test1"}) - -# create a S3 catalog -s3_properties = { - "location": "s3a://bucket/root", - "s3-access-key-id": "access_key" - "s3-secret-access-key": "secret_key", - "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com" -} - -s3_catalog = gravitino_client.create_catalog(name="catalog", - type=Catalog.Type.FILESET, - provider="hadoop", - comment="This is a S3 fileset catalog", - properties=s3_properties) - -# For others HCFS like GCS, OSS, etc., the properties should be set accordingly. please refer to -# The following link about the catalog properties. ``` @@ -371,11 +321,8 @@ The `storageLocation` is the physical location of the fileset. Users can specify when creating a fileset, or follow the rules of the catalog/schema location if not specified. The value of `storageLocation` depends on the configuration settings of the catalog: -- If this is a S3 fileset catalog, the `storageLocation` should be in the format of `s3a://bucket-name/path/to/fileset`. -- If this is an OSS fileset catalog, the `storageLocation` should be in the format of `oss://bucket-name/path/to/fileset`. - If this is a local fileset catalog, the `storageLocation` should be in the format of `file:///path/to/fileset`. - If this is a HDFS fileset catalog, the `storageLocation` should be in the format of `hdfs://namenode:port/path/to/fileset`. -- If this is a GCS fileset catalog, the `storageLocation` should be in the format of `gs://bucket-name/path/to/fileset`. For a `MANAGED` fileset, the storage location is: From 71586f3334acbea1a40eac031f23f1364681c77a Mon Sep 17 00:00:00 2001 From: yuqi Date: Mon, 13 Jan 2025 19:52:00 +0800 Subject: [PATCH 28/39] Polish documents --- docs/hadoop-catalog-with-adls.md | 47 +++++++++--------- docs/hadoop-catalog-with-gcs.md | 44 ++++++++--------- docs/hadoop-catalog-with-oss.md | 62 ++++++++++++------------ docs/hadoop-catalog-with-s3.md | 83 ++++++++++++++++---------------- 4 files changed, 118 insertions(+), 118 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index ab73e5b0983..9469946a47b 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -19,6 +19,7 @@ To set up a Hadoop catalog with ADLS, follow these steps: ```bash $ bin/gravitino-server.sh start ``` + Once the server is up and running, you can proceed to configure the Hadoop catalog with ADLS. In the rest of this document we will use `http://localhost:8090` as the Gravitino server URL, please replace it with your actual server URL. ## Configurations for creating a Hadoop catalog with ADLS @@ -78,7 +79,7 @@ GravitinoClient gravitinoClient = GravitinoClient .withMetalake("metalake") .build(); -adlsProperties = ImmutableMap.builder() +Map adlsProperties = ImmutableMap.builder() .put("location", "abfss://container@account-name.dfs.core.windows.net/path") .put("azure-storage-account-name", "azure storage account name") .put("azure-storage-account-key", "azure storage account key") @@ -278,13 +279,13 @@ If your wants to custom your hadoop version or there is already a hadoop version org.apache.gravitino filesystem-hadoop3-runtime - 0.8.0-incubating-SNAPSHOT + ${GRAVITINO_VERSION} org.apache.gravitino gravitino-azure - 0.8.0-incubating-SNAPSHOT + ${GRAVITINO_VERSION} ``` @@ -294,13 +295,13 @@ Or use the bundle jar with Hadoop environment: org.apache.gravitino gravitino-azure-bundle - 0.8.0-incubating-SNAPSHOT + ${GRAVITINO_VERSION} org.apache.gravitino filesystem-hadoop3-runtime - 0.8.0-incubating-SNAPSHOT + ${GRAVITINO_VERSION} ``` @@ -312,7 +313,7 @@ Before running the following code, you need to install required packages: ```bash pip install pyspark==3.1.3 -pip install gravitino==0.8.0-incubating +pip install apache-gravitino==${GRAVITINO_VERSION} ``` Then you can run the following code: @@ -331,17 +332,17 @@ fileset_name = "your_adls_fileset" os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/hadoop-azure-3.2.0.jar,/path/to/azure-storage-7.0.0.jar,/path/to/wildfly-openssl-1.0.4.Final.jar --master local[1] pyspark-shell" spark = SparkSession.builder -.appName("adls_fileset_test") -.config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") -.config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") -.config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") -.config("spark.hadoop.fs.gravitino.client.metalake", "test") -.config("spark.hadoop.azure-storage-account-name", "azure_account_name") -.config("spark.hadoop.azure-storage-account-key", "azure_account_name") -.config("spark.hadoop.fs.azure.skipUserGroupMetadataDuringInitialization", "true") -.config("spark.driver.memory", "2g") -.config("spark.driver.port", "2048") -.getOrCreate() + .appName("adls_fileset_test") + .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") + .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.client.metalake", "test") + .config("spark.hadoop.azure-storage-account-name", "azure_account_name") + .config("spark.hadoop.azure-storage-account-key", "azure_account_name") + .config("spark.hadoop.fs.azure.skipUserGroupMetadataDuringInitialization", "true") + .config("spark.driver.memory", "2g") + .config("spark.driver.port", "2048") + .getOrCreate() data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)] columns = ["Name", "Age"] @@ -349,9 +350,9 @@ spark_df = spark.createDataFrame(data, schema=columns) gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people" spark_df.coalesce(1).write -.mode("overwrite") -.option("header", "true") -.csv(gvfs_path) + .mode("overwrite") + .option("header", "true") + .csv(gvfs_path) ``` If your Spark **without Hadoop environment**, you can use the following code snippet to access the fileset: @@ -430,14 +431,14 @@ In order to access fileset with Azure Blob storage (ADLS) using the GVFS Python | `abs_account_name` | The account name of Azure Blob Storage | (none) | Yes | 0.8.0-incubating | | `abs_account_key` | The account key of Azure Blob Storage | (none) | Yes | 0.8.0-incubating | -::: +:::note If the catalog has enabled [credential vending](security/credential-vending.md), the properties above can be omitted. ::: Please install the `gravitino` package before running the following code: ```bash -pip install gravitino==0.8.0-incubating +pip install apache-gravitino==${GRAVITINO_VERSION} ``` ```python @@ -476,7 +477,7 @@ ds.head() For other use cases, please refer to the [Gravitino Virtual File System](./how-to-use-gvfs.md) document. -## Fileset with credential +## Fileset with credential vending Since 0.8.0-incubating, Gravitino supports credential vending for ADLS fileset. If the catalog has been [configured with credential](./security/credential-vending.md), you can access ADLS fileset without providing authentication information like `azure-storage-account-name` and `azure-storage-account-key` in the properties. diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index b39422c1d1b..4079dce1843 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -75,13 +75,13 @@ GravitinoClient gravitinoClient = GravitinoClient .withMetalake("metalake") .build(); -gcsProperties = ImmutableMap.builder() +Map gcsProperties = ImmutableMap.builder() .put("location", "gs://bucket/root") .put("gcs-service-account-file", "path_of_gcs_service_account_file") .put("filesystem-providers", "gcs") .build(); -Catalog gcsCatalog = gravitinoClient.createCatalog("test_catalog", +Catalog gcsCatalog = gravitinoClient.createCatalog("test_catalog", Type.FILESET, "hadoop", // provider, Gravitino only supports "hadoop" for now. "This is a GCS fileset catalog", @@ -235,7 +235,7 @@ To access fileset with GCS using the GVFS Java client, based on the [basic GVFS |----------------------------|--------------------------------------------|---------------|----------|------------------| | `gcs-service-account-file` | The path of GCS service account JSON file. | (none) | Yes | 0.7.0-incubating | -::: +:::note If the catalog has enabled [credential vending](security/credential-vending.md), the properties above can be omitted. ::: @@ -269,13 +269,13 @@ If your wants to custom your hadoop version or there is already a hadoop version org.apache.gravitino filesystem-hadoop3-runtime - 0.8.0-incubating-SNAPSHOT + ${GRAVITINO_VERSION} org.apache.gravitino gravitino-gcp - 0.8.0-incubating-SNAPSHOT + ${GRAVITINO_VERSION} ``` @@ -285,13 +285,13 @@ Or use the bundle jar with Hadoop environment: org.apache.gravitino gravitino-gcp-bundle - 0.8.0-incubating-SNAPSHOT + ${GRAVITINO_VERSION} org.apache.gravitino filesystem-hadoop3-runtime - 0.8.0-incubating-SNAPSHOT + ${GRAVITINO_VERSION} ``` @@ -303,7 +303,7 @@ Before running the following code, you need to install required packages: ```bash pip install pyspark==3.1.3 -pip install gravitino==0.8.0-incubating +pip install apache-gravitino==${GRAVITINO_VERSION} ``` Then you can run the following code: @@ -322,15 +322,15 @@ fileset_name = "your_gcs_fileset" os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-gcp-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/gcs-connector-hadoop3-2.2.22-shaded.jar --master local[1] pyspark-shell" spark = SparkSession.builder -.appName("gcs_fielset_test") -.config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") -.config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") -.config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") -.config("spark.hadoop.fs.gravitino.client.metalake", "test_metalake") -.config("spark.hadoop.gcs-service-account-file", "/path/to/gcs-service-account-file.json") -.config("spark.driver.memory", "2g") -.config("spark.driver.port", "2048") -.getOrCreate() + .appName("gcs_fielset_test") + .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") + .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.client.metalake", "test_metalake") + .config("spark.hadoop.gcs-service-account-file", "/path/to/gcs-service-account-file.json") + .config("spark.driver.memory", "2g") + .config("spark.driver.port", "2048") + .getOrCreate() data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)] columns = ["Name", "Age"] @@ -338,9 +338,9 @@ spark_df = spark.createDataFrame(data, schema=columns) gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people" spark_df.coalesce(1).write -.mode("overwrite") -.option("header", "true") -.csv(gvfs_path) + .mode("overwrite") + .option("header", "true") + .csv(gvfs_path) ``` If your Spark **without Hadoop environment**, you can use the following code snippet to access the fileset: @@ -419,7 +419,7 @@ If the catalog has enabled [credential vending](security/credential-vending.md), Please install the `gravitino` package before running the following code: ```bash -pip install gravitino==0.8.0-incubating +pip install apache-gravitino==${GRAVITINO_VERSION} ``` ```python @@ -455,7 +455,7 @@ ds.head() For other use cases, please refer to the [Gravitino Virtual File System](./how-to-use-gvfs.md) document. -## Fileset with credential +## Fileset with credential vending Since 0.8.0-incubating, Gravitino supports credential vending for GCS fileset. If the catalog has been [configured with credential](./security/credential-vending.md), you can access GCS fileset without providing authentication information like `gcs-service-account-file` in the properties. diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index 9da88007296..89d01092a4c 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -80,7 +80,7 @@ GravitinoClient gravitinoClient = GravitinoClient .withMetalake("metalake") .build(); -ossProperties = ImmutableMap.builder() +Map ossProperties = ImmutableMap.builder() .put("location", "oss://bucket/root") .put("oss-access-key-id", "access_key") .put("oss-secret-access-key", "secret_key") @@ -266,7 +266,7 @@ fs.mkdirs(filesetPath); ... ``` -Similar to Spark configurations, you need to add OSS bundle jars to the classpath according to your environment. +Similar to Spark configurations, you need to add OSS (bundle) jars to the classpath according to your environment. If your wants to custom your hadoop version or there is already a hadoop version in your project, you can add the following dependencies to your `pom.xml`: ```xml @@ -285,13 +285,13 @@ If your wants to custom your hadoop version or there is already a hadoop version org.apache.gravitino filesystem-hadoop3-runtime - 0.8.0-incubating-SNAPSHOT + ${GRAVITINO_VERSION} org.apache.gravitino gravitino-aliyun - 0.8.0-incubating-SNAPSHOT + ${GRAVITINO_VERSION} ``` @@ -301,13 +301,13 @@ Or use the bundle jar with Hadoop environment: org.apache.gravitino gravitino-aliyun-bundle - 0.8.0-incubating-SNAPSHOT + ${GRAVITINO_VERSION} org.apache.gravitino filesystem-hadoop3-runtime - 0.8.0-incubating-SNAPSHOT + ${GRAVITINO_VERSION} ``` @@ -319,7 +319,7 @@ Before running the following code, you need to install required packages: ```bash pip install pyspark==3.1.3 -pip install gravitino==0.8.0-incubating +pip install apache-gravitino==${GRAVITINO_VERSION} ``` Then you can run the following code: @@ -338,17 +338,17 @@ fileset_name = "your_oss_fileset" os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aliyun-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/aliyun-sdk-oss-2.8.3.jar,/path/to/hadoop-aliyun-3.2.0.jar,/path/to/jdom-1.1.jar --master local[1] pyspark-shell" spark = SparkSession.builder -.appName("oss_fielset_test") -.config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") -.config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") -.config("spark.hadoop.fs.gravitino.server.uri", "${_URL}") -.config("spark.hadoop.fs.gravitino.client.metalake", "test") -.config("spark.hadoop.oss-access-key-id", os.environ["OSS_ACCESS_KEY_ID"]) -.config("spark.hadoop.oss-secret-access-key", os.environ["OSS_SECRET_ACCESS_KEY"]) -.config("spark.hadoop.oss-endpoint", "http://oss-cn-hangzhou.aliyuncs.com") -.config("spark.driver.memory", "2g") -.config("spark.driver.port", "2048") -.getOrCreate() + .appName("oss_fileset_test") + .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") + .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") + .config("spark.hadoop.fs.gravitino.server.uri", "${_URL}") + .config("spark.hadoop.fs.gravitino.client.metalake", "test") + .config("spark.hadoop.oss-access-key-id", os.environ["OSS_ACCESS_KEY_ID"]) + .config("spark.hadoop.oss-secret-access-key", os.environ["OSS_SECRET_ACCESS_KEY"]) + .config("spark.hadoop.oss-endpoint", "http://oss-cn-hangzhou.aliyuncs.com") + .config("spark.driver.memory", "2g") + .config("spark.driver.port", "2048") + .getOrCreate() data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)] columns = ["Name", "Age"] @@ -356,9 +356,9 @@ spark_df = spark.createDataFrame(data, schema=columns) gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people" spark_df.coalesce(1).write -.mode("overwrite") -.option("header", "true") -.csv(gvfs_path) + .mode("overwrite") + .option("header", "true") + .csv(gvfs_path) ``` If your Spark **without Hadoop environment**, you can use the following code snippet to access the fileset: @@ -450,7 +450,7 @@ If the catalog has enabled [credential vending](security/credential-vending.md), Please install the `gravitino` package before running the following code: ```bash -pip install gravitino==0.8.0-incubating +pip install apache-gravitino==${GRAVITINO_VERSION} ``` ```python @@ -491,7 +491,7 @@ ds.head() ``` For other use cases, please refer to the [Gravitino Virtual File System](./how-to-use-gvfs.md) document. -## Fileset with credential +## Fileset with credential vending Since 0.8.0-incubating, Gravitino supports credential vending for OSS fileset. If the catalog has been [configured with credential](./security/credential-vending.md), you can access OSS fileset without providing authentication information like `oss-access-key-id` and `oss-secret-access-key` in the properties. @@ -522,15 +522,15 @@ Spark: ```python spark = SparkSession.builder - .appName("oss_fileset_test") - .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") - .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") - .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") - .config("spark.hadoop.fs.gravitino.client.metalake", "test") + .appName("oss_fileset_test") + .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") + .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.client.metalake", "test") # No need to set oss-access-key-id and oss-secret-access-key - .config("spark.driver.memory", "2g") - .config("spark.driver.port", "2048") - .getOrCreate() + .config("spark.driver.memory", "2g") + .config("spark.driver.port", "2048") + .getOrCreate() ``` Python client and Hadoop command are similar to the above examples. diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index a7ae4b29c3e..2cf1a9f68b5 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -82,7 +82,7 @@ GravitinoClient gravitinoClient = GravitinoClient .withMetalake("metalake") .build(); -s3Properties = ImmutableMap.builder() +Map s3Properties = ImmutableMap.builder() .put("location", "s3a://bucket/root") .put("s3-access-key-id", "access_key") .put("s3-secret-access-key", "secret_key") @@ -149,7 +149,6 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ ```java -// Assuming you have just created a Hive catalog named `hive_catalog` Catalog catalog = gravitinoClient.loadCatalog("hive_catalog"); SupportsSchemas supportsSchemas = catalog.asSchemas(); @@ -211,15 +210,15 @@ Catalog catalog = gravitinoClient.loadCatalog("test_catalog"); FilesetCatalog filesetCatalog = catalog.asFilesetCatalog(); Map propertiesMap = ImmutableMap.builder() - .put("k1", "v1") - .build(); + .put("k1", "v1") + .build(); filesetCatalog.createFileset( - NameIdentifier.of("test_schema", "example_fileset"), - "This is an example fileset", - Fileset.Type.MANAGED, - "s3a://bucket/root/schema/example_fileset", - propertiesMap, + NameIdentifier.of("test_schema", "example_fileset"), + "This is an example fileset", + Fileset.Type.MANAGED, + "s3a://bucket/root/schema/example_fileset", + propertiesMap, ); ``` @@ -252,10 +251,10 @@ To access fileset with S3 using the GVFS Java client, based on the [basic GVFS c | `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes | 0.7.0-incubating | | `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes | 0.7.0-incubating | -::: +:::note - `s3-endpoint` is an optional configuration for AWS S3, however, it is required for other S3-compatible storage services like MinIO. - If the catalog has enabled [credential vending](security/credential-vending.md), the properties above can be omitted. - ::: +::: ```java Configuration conf = new Configuration(); @@ -292,13 +291,13 @@ Similar to Spark configurations, you need to add S3 (bundle) jars to the classpa org.apache.gravitino filesystem-hadoop3-runtime - 0.8.0-incubating-SNAPSHOT + ${GRAVITINO_VERSION} org.apache.gravitino gravitino-aws - 0.8.0-incubating-SNAPSHOT + ${GRAVITINO_VERSION} ``` @@ -308,13 +307,13 @@ Or use the bundle jar with Hadoop environment: org.apache.gravitino gravitino-aws-bundle - 0.8.0-incubating-SNAPSHOT + ${GRAVITINO_VERSION} org.apache.gravitino filesystem-hadoop3-runtime - 0.8.0-incubating-SNAPSHOT + ${GRAVITINO_VERSION} ``` @@ -326,7 +325,7 @@ Before running the following code, you need to install required packages: ```bash pip install pyspark==3.1.3 -pip install gravitino==0.8.0-incubating +pip install apache-gravitino==${GRAVITINO_VERSION} ``` Then you can run the following code: @@ -345,17 +344,17 @@ fileset_name = "your_s3_fileset" os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aws-${gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-${gravitino-version}-SNAPSHOT.jar,/path/to/hadoop-aws-3.2.0.jar,/path/to/aws-java-sdk-bundle-1.11.375.jar --master local[1] pyspark-shell" spark = SparkSession.builder - .appName("s3_fielset_test") - .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") - .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") - .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") - .config("spark.hadoop.fs.gravitino.client.metalake", "test") - .config("spark.hadoop.s3-access-key-id", os.environ["S3_ACCESS_KEY_ID"]) - .config("spark.hadoop.s3-secret-access-key", os.environ["S3_SECRET_ACCESS_KEY"]) - .config("spark.hadoop.s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com") - .config("spark.driver.memory", "2g") - .config("spark.driver.port", "2048") - .getOrCreate() + .appName("s3_fielset_test") + .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") + .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.client.metalake", "test") + .config("spark.hadoop.s3-access-key-id", os.environ["S3_ACCESS_KEY_ID"]) + .config("spark.hadoop.s3-secret-access-key", os.environ["S3_SECRET_ACCESS_KEY"]) + .config("spark.hadoop.s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com") + .config("spark.driver.memory", "2g") + .config("spark.driver.port", "2048") + .getOrCreate() data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)] columns = ["Name", "Age"] @@ -363,9 +362,9 @@ spark_df = spark.createDataFrame(data, schema=columns) gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people" spark_df.coalesce(1).write -.mode("overwrite") -.option("header", "true") -.csv(gvfs_path) + .mode("overwrite") + .option("header", "true") + .csv(gvfs_path) ``` If your Spark **without Hadoop environment**, you can use the following code snippet to access the fileset: @@ -449,7 +448,7 @@ In order to access fileset with S3 using the GVFS Python client, apart from [bas | `s3_access_key_id` | The access key of the AWS S3. | (none) | Yes | 0.7.0-incubating | | `s3_secret_access_key` | The secret key of the AWS S3. | (none) | Yes | 0.7.0-incubating | -::: +:::note - `s3_endpoint` is an optional configuration for AWS S3, however, it is required for other S3-compatible storage services like MinIO. - If the catalog has enabled [credential vending](security/credential-vending.md), the properties above can be omitted. ::: @@ -457,7 +456,7 @@ In order to access fileset with S3 using the GVFS Python client, apart from [bas Please install the `gravitino` package before running the following code: ```bash -pip install gravitino==0.8.0-incubating +pip install apache-gravitino==${GRAVITINO_VERSION} ``` ```python @@ -497,7 +496,7 @@ ds.head() For more use cases, please refer to the [Gravitino Virtual File System](./how-to-use-gvfs.md) document. -## Fileset with credential +## Fileset with credential vending Since 0.8.0-incubating, Gravitino supports credential vending for S3 fileset. If the catalog has been [configured with credential](./security/credential-vending.md), you can access S3 fileset without providing authentication information like `s3-access-key-id` and `s3-secret-access-key` in the properties. @@ -528,15 +527,15 @@ Spark: ```python spark = SparkSession.builder - .appName("s3_fielset_test") - .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") - .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") - .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") - .config("spark.hadoop.fs.gravitino.client.metalake", "test") - # No need to set s3-access-key-id and s3-secret-access-key - .config("spark.driver.memory", "2g") - .config("spark.driver.port", "2048") - .getOrCreate() + .appName("s3_fielset_test") + .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") + .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.client.metalake", "test") + # No need to set s3-access-key-id and s3-secret-access-key + .config("spark.driver.memory", "2g") + .config("spark.driver.port", "2048") + .getOrCreate() ``` Python client and Hadoop command are similar to the above examples. From 7b8ad3152d0eee0c40389a1b5d0f889e14fe7760 Mon Sep 17 00:00:00 2001 From: yuqi Date: Mon, 13 Jan 2025 19:59:18 +0800 Subject: [PATCH 29/39] fix --- docs/hadoop-catalog-with-adls.md | 16 ++++++++-------- docs/hadoop-catalog-with-gcs.md | 9 +++++---- docs/hadoop-catalog-with-oss.md | 9 +++++---- docs/hadoop-catalog-with-s3.md | 8 ++++---- 4 files changed, 22 insertions(+), 20 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index 9469946a47b..8335fcb3bc5 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -426,10 +426,10 @@ For ADLS, you need to add `gravitino-filesystem-hadoop3-runtime-${gravitino-vers In order to access fileset with Azure Blob storage (ADLS) using the GVFS Python client, apart from [basic GVFS configurations](./how-to-use-gvfs.md#configuration-1), you need to add the following configurations: -| Configuration item | Description | Default value | Required | Since version | -|--------------------|----------------------------------------|---------------|----------|------------------| -| `abs_account_name` | The account name of Azure Blob Storage | (none) | Yes | 0.8.0-incubating | -| `abs_account_key` | The account key of Azure Blob Storage | (none) | Yes | 0.8.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|------------------------------|----------------------------------------|---------------|----------|------------------| +| `azure_storage_account_name` | The account name of Azure Blob Storage | (none) | Yes | 0.8.0-incubating | +| `azure_storage_account_key` | The account key of Azure Blob Storage | (none) | Yes | 0.8.0-incubating | :::note If the catalog has enabled [credential vending](security/credential-vending.md), the properties above can be omitted. @@ -493,10 +493,10 @@ GVFS Java client: ```java Configuration conf = new Configuration(); -conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); -conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); -conf.set("fs.gravitino.server.uri","http://localhost:8090"); -conf.set("fs.gravitino.client.metalake","test_metalake"); +conf.set("fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs"); +conf.set("fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); +conf.set("fs.gravitino.server.uri", "http://localhost:8090"); +conf.set("fs.gravitino.client.metalake", "test_metalake"); // No need to set azure-storage-account-name and azure-storage-account-name Path filesetPath = new Path("gvfs://fileset/adls_test_catalog/test_schema/test_fileset/new_dir"); FileSystem fs = filesetPath.getFileSystem(conf); diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index 4079dce1843..125bab09698 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -18,6 +18,7 @@ To set up a Hadoop catalog with OSS, follow these steps: ```bash $ bin/gravitino-server.sh start ``` + Once the server is up and running, you can proceed to configure the Hadoop catalog with GCS. In the rest of this document we will use `http://localhost:8090` as the Gravitino server URL, please replace it with your actual server URL. ## Configurations for creating a Hadoop catalog with GCS @@ -471,10 +472,10 @@ GVFS Java client: ```java Configuration conf = new Configuration(); -conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); -conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); -conf.set("fs.gravitino.server.uri","http://localhost:8090"); -conf.set("fs.gravitino.client.metalake","test_metalake"); +conf.set("fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs"); +conf.set("fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); +conf.set("fs.gravitino.server.uri", "http://localhost:8090"); +conf.set("fs.gravitino.client.metalake", "test_metalake"); // No need to set gcs-service-account-file Path filesetPath = new Path("gvfs://fileset/gcs_test_catalog/test_schema/test_fileset/new_dir"); FileSystem fs = filesetPath.getFileSystem(conf); diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index 89d01092a4c..27cfc023065 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -19,6 +19,7 @@ To set up a Hadoop catalog with OSS, follow these steps: ```bash $ bin/gravitino-server.sh start ``` + Once the server is up and running, you can proceed to configure the Hadoop catalog with OSS. In the rest of this document we will use `http://localhost:8090` as the Gravitino server URL, please replace it with your actual server URL. ## Configurations for creating a Hadoop catalog with OSS @@ -507,10 +508,10 @@ GVFS Java client: ```java Configuration conf = new Configuration(); -conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); -conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); -conf.set("fs.gravitino.server.uri","http://localhost:8090"); -conf.set("fs.gravitino.client.metalake","test_metalake"); +conf.set("fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs"); +conf.set("fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); +conf.set("fs.gravitino.server.uri", "http://localhost:8090"); +conf.set("fs.gravitino.client.metalake", "test_metalake"); // No need to set oss-access-key-id and oss-secret-access-key Path filesetPath = new Path("gvfs://fileset/oss_test_catalog/test_schema/test_fileset/new_dir"); FileSystem fs = filesetPath.getFileSystem(conf); diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index 2cf1a9f68b5..284740ce3e4 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -512,10 +512,10 @@ GVFS Java client: ```java Configuration conf = new Configuration(); -conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); -conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); -conf.set("fs.gravitino.server.uri","http://localhost:8090"); -conf.set("fs.gravitino.client.metalake","test_metalake"); +conf.set("fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs"); +conf.set("fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); +conf.set("fs.gravitino.server.uri", "http://localhost:8090"); +conf.set("fs.gravitino.client.metalake", "test_metalake"); // No need to set s3-access-key-id and s3-secret-access-key Path filesetPath = new Path("gvfs://fileset/test_catalog/test_schema/test_fileset/new_dir"); FileSystem fs = filesetPath.getFileSystem(conf); From c9eca7321e967af5bdd76acfe45235c7096a8e84 Mon Sep 17 00:00:00 2001 From: yuqi Date: Mon, 13 Jan 2025 20:11:20 +0800 Subject: [PATCH 30/39] fix --- docs/hadoop-catalog-with-adls.md | 28 +++++++++++++------------- docs/hadoop-catalog-with-gcs.md | 28 +++++++++++++------------- docs/hadoop-catalog-with-oss.md | 34 ++++++++++++++++---------------- 3 files changed, 45 insertions(+), 45 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index 8335fcb3bc5..13206957151 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -205,11 +205,11 @@ Map propertiesMap = ImmutableMap.builder() .build(); filesetCatalog.createFileset( - NameIdentifier.of("test_schema", "example_fileset"), - "This is an example fileset", - Fileset.Type.MANAGED, - "abfss://container@account-name.dfs.core.windows.net/path/example_fileset", - propertiesMap, + NameIdentifier.of("test_schema", "example_fileset"), + "This is an example fileset", + Fileset.Type.MANAGED, + "abfss://container@account-name.dfs.core.windows.net/path/example_fileset", + propertiesMap, ); ``` @@ -508,15 +508,15 @@ Spark: ```python spark = SparkSession.builder - .appName("adls_fielset_test") - .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") - .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") - .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") - .config("spark.hadoop.fs.gravitino.client.metalake", "test") - # No need to set azure-storage-account-name and azure-storage-account-name - .config("spark.driver.memory", "2g") - .config("spark.driver.port", "2048") - .getOrCreate() + .appName("adls_fielset_test") + .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") + .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.client.metalake", "test") + # No need to set azure-storage-account-name and azure-storage-account-name + .config("spark.driver.memory", "2g") + .config("spark.driver.port", "2048") + .getOrCreate() ``` Python client and Hadoop command are similar to the above examples. diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index 125bab09698..cf1d74aef95 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -201,11 +201,11 @@ Map propertiesMap = ImmutableMap.builder() .build(); filesetCatalog.createFileset( - NameIdentifier.of("test_schema", "example_fileset"), - "This is an example fileset", - Fileset.Type.MANAGED, - "gs://bucket/root/schema/example_fileset", - propertiesMap, + NameIdentifier.of("test_schema", "example_fileset"), + "This is an example fileset", + Fileset.Type.MANAGED, + "gs://bucket/root/schema/example_fileset", + propertiesMap, ); ``` @@ -487,15 +487,15 @@ Spark: ```python spark = SparkSession.builder - .appName("gcs_fileset_test") - .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") - .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") - .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") - .config("spark.hadoop.fs.gravitino.client.metalake", "test") - # No need to set gcs-service-account-file - .config("spark.driver.memory", "2g") - .config("spark.driver.port", "2048") - .getOrCreate() + .appName("gcs_fileset_test") + .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") + .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.client.metalake", "test") + # No need to set gcs-service-account-file + .config("spark.driver.memory", "2g") + .config("spark.driver.port", "2048") + .getOrCreate() ``` Python client and Hadoop command are similar to the above examples. diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index 27cfc023065..79759d52942 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -211,11 +211,11 @@ Map propertiesMap = ImmutableMap.builder() .build(); filesetCatalog.createFileset( - NameIdentifier.of("test_schema", "example_fileset"), - "This is an example fileset", - Fileset.Type.MANAGED, - "oss://bucket/root/schema/example_fileset", - propertiesMap, + NameIdentifier.of("test_schema", "example_fileset"), + "This is an example fileset", + Fileset.Type.MANAGED, + "oss://bucket/root/schema/example_fileset", + propertiesMap, ); ``` @@ -339,17 +339,17 @@ fileset_name = "your_oss_fileset" os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aliyun-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/aliyun-sdk-oss-2.8.3.jar,/path/to/hadoop-aliyun-3.2.0.jar,/path/to/jdom-1.1.jar --master local[1] pyspark-shell" spark = SparkSession.builder - .appName("oss_fileset_test") - .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") - .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") - .config("spark.hadoop.fs.gravitino.server.uri", "${_URL}") - .config("spark.hadoop.fs.gravitino.client.metalake", "test") - .config("spark.hadoop.oss-access-key-id", os.environ["OSS_ACCESS_KEY_ID"]) - .config("spark.hadoop.oss-secret-access-key", os.environ["OSS_SECRET_ACCESS_KEY"]) - .config("spark.hadoop.oss-endpoint", "http://oss-cn-hangzhou.aliyuncs.com") - .config("spark.driver.memory", "2g") - .config("spark.driver.port", "2048") - .getOrCreate() + .appName("oss_fileset_test") + .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") + .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") + .config("spark.hadoop.fs.gravitino.server.uri", "${_URL}") + .config("spark.hadoop.fs.gravitino.client.metalake", "test") + .config("spark.hadoop.oss-access-key-id", os.environ["OSS_ACCESS_KEY_ID"]) + .config("spark.hadoop.oss-secret-access-key", os.environ["OSS_SECRET_ACCESS_KEY"]) + .config("spark.hadoop.oss-endpoint", "http://oss-cn-hangzhou.aliyuncs.com") + .config("spark.driver.memory", "2g") + .config("spark.driver.port", "2048") + .getOrCreate() data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)] columns = ["Name", "Age"] @@ -528,7 +528,7 @@ spark = SparkSession.builder .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") .config("spark.hadoop.fs.gravitino.client.metalake", "test") - # No need to set oss-access-key-id and oss-secret-access-key + # No need to set oss-access-key-id and oss-secret-access-key .config("spark.driver.memory", "2g") .config("spark.driver.port", "2048") .getOrCreate() From aacd58f151c97fa8f23e6de434fd97836f975279 Mon Sep 17 00:00:00 2001 From: yuqi Date: Mon, 13 Jan 2025 20:54:06 +0800 Subject: [PATCH 31/39] fix --- docs/hadoop-catalog-with-adls.md | 18 +++++++++--------- docs/hadoop-catalog-with-gcs.md | 18 +++++++++--------- docs/hadoop-catalog-with-oss.md | 16 ++++++++-------- docs/hadoop-catalog-with-s3.md | 18 +++++++++--------- 4 files changed, 35 insertions(+), 35 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index 13206957151..c38def9f27b 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -17,7 +17,7 @@ To set up a Hadoop catalog with ADLS, follow these steps: 3. Start the Gravitino server by running the following command: ```bash -$ bin/gravitino-server.sh start +$ ${GRAVITINO_HOME}/bin/gravitino-server.sh start ``` Once the server is up and running, you can proceed to configure the Hadoop catalog with ADLS. In the rest of this document we will use `http://localhost:8090` as the Gravitino server URL, please replace it with your actual server URL. @@ -242,15 +242,15 @@ To access fileset with Azure Blob Storage(ADLS) using the GVFS Java client, base | `azure-storage-account-key` | The account key of Azure Blob Storage. | (none) | Yes | 0.8.0-incubating | :::note -If the catalog has enabled [credential vending](security/credential-vending.md), the properties above can be omitted. +If the catalog has enabled [credential vending](security/credential-vending.md), the properties above can be omitted. More details can be found in [Fileset with credential vending](#fileset-with-credential-vending). ::: ```java Configuration conf = new Configuration(); -conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); -conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); -conf.set("fs.gravitino.server.uri","http://localhost:8090"); -conf.set("fs.gravitino.client.metalake","test_metalake"); +conf.set("fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs"); +conf.set("fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); +conf.set("fs.gravitino.server.uri", "http://localhost:8090"); +conf.set("fs.gravitino.client.metalake", "test_metalake"); conf.set("azure-storage-account-name", "account_name_of_adls"); conf.set("azure-storage-account-key", "account_key_of_adls"); Path filesetPath = new Path("gvfs://fileset/test_catalog/test_schema/test_fileset/new_dir"); @@ -259,7 +259,7 @@ fs.mkdirs(filesetPath); ... ``` -Similar to Spark configurations, you need to add ADLS bundle jars to the classpath according to your environment. +Similar to Spark configurations, you need to add ADLS (bundle) jars to the classpath according to your environment. If your wants to custom your hadoop version or there is already a hadoop version in your project, you can add the following dependencies to your `pom.xml`: @@ -289,7 +289,7 @@ If your wants to custom your hadoop version or there is already a hadoop version ``` -Or use the bundle jar with Hadoop environment: +Or use the bundle jar with Hadoop environment if there is no Hadoop environment: ```xml @@ -487,7 +487,7 @@ Apart from configuration method in [create-adls-hadoop-catalog](#configuration-f ### How to access ADLS fileset with credential -If the catalog has been configured with credential, you can access ADLS fileset without providing authentication information via GVFS. Let's see how to access ADLS fileset with credential: +If the catalog has been configured with credential, you can access ADLS fileset without providing authentication information via GVFS Java/Python client and Spark. Let's see how to access ADLS fileset with credential: GVFS Java client: diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index cf1d74aef95..36bf9131454 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -16,7 +16,7 @@ To set up a Hadoop catalog with OSS, follow these steps: 3. Start the Gravitino server by running the following command: ```bash -$ bin/gravitino-server.sh start +$ ${GRAVITINO_HOME}/bin/gravitino-server.sh start ``` Once the server is up and running, you can proceed to configure the Hadoop catalog with GCS. In the rest of this document we will use `http://localhost:8090` as the Gravitino server URL, please replace it with your actual server URL. @@ -237,15 +237,15 @@ To access fileset with GCS using the GVFS Java client, based on the [basic GVFS | `gcs-service-account-file` | The path of GCS service account JSON file. | (none) | Yes | 0.7.0-incubating | :::note -If the catalog has enabled [credential vending](security/credential-vending.md), the properties above can be omitted. +If the catalog has enabled [credential vending](security/credential-vending.md), the properties above can be omitted. More details can be found in [Fileset with credential vending](#fileset-with-credential-vending). ::: ```java Configuration conf = new Configuration(); -conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); -conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); -conf.set("fs.gravitino.server.uri","http://localhost:8090"); -conf.set("fs.gravitino.client.metalake","test_metalake"); +conf.set("fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs"); +conf.set("fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); +conf.set("fs.gravitino.server.uri", "http://localhost:8090"); +conf.set("fs.gravitino.client.metalake", "test_metalake"); conf.set("gcs-service-account-file", "/path/your-service-account-file.json"); Path filesetPath = new Path("gvfs://fileset/test_catalog/test_schema/test_fileset/new_dir"); FileSystem fs = filesetPath.getFileSystem(conf); @@ -253,7 +253,7 @@ fs.mkdirs(filesetPath); ... ``` -Similar to Spark configurations, you need to add GCS bundle jars to the classpath according to your environment. +Similar to Spark configurations, you need to add GCS (bundle) jars to the classpath according to your environment. If your wants to custom your hadoop version or there is already a hadoop version in your project, you can add the following dependencies to your `pom.xml`: ```xml @@ -280,7 +280,7 @@ If your wants to custom your hadoop version or there is already a hadoop version ``` -Or use the bundle jar with Hadoop environment: +Or use the bundle jar with Hadoop environment if there is no Hadoop environment: ```xml @@ -466,7 +466,7 @@ Apart from configuration method in [create-gcs-hadoop-catalog](#configurations-f ### How to access GCS fileset with credential -If the catalog has been configured with credential, you can access GCS fileset without providing authentication information via GVFS. Let's see how to access GCS fileset with credential: +If the catalog has been configured with credential, you can access GCS fileset without providing authentication information via GVFS Java/Python client and Spark. Let's see how to access GCS fileset with credential: GVFS Java client: diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index 79759d52942..70d47af1fb5 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -17,7 +17,7 @@ To set up a Hadoop catalog with OSS, follow these steps: 3. Start the Gravitino server by running the following command: ```bash -$ bin/gravitino-server.sh start +$ ${GRAVITINO_HOME}/bin/gravitino-server.sh start ``` Once the server is up and running, you can proceed to configure the Hadoop catalog with OSS. In the rest of this document we will use `http://localhost:8090` as the Gravitino server URL, please replace it with your actual server URL. @@ -249,15 +249,15 @@ To access fileset with OSS using the GVFS Java client, based on the [basic GVFS | `oss-secret-access-key` | The secret key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | :::note -If the catalog has enabled [credential vending](security/credential-vending.md), the properties above can be omitted. +If the catalog has enabled [credential vending](security/credential-vending.md), the properties above can be omitted. More details can be found in [Fileset with credential vending](#fileset-with-credential-vending). ::: ```java Configuration conf = new Configuration(); -conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); -conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); -conf.set("fs.gravitino.server.uri","http://localhost:8090"); -conf.set("fs.gravitino.client.metalake","test_metalake"); +conf.set("fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs"); +conf.set("fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); +conf.set("fs.gravitino.server.uri", "http://localhost:8090"); +conf.set("fs.gravitino.client.metalake", "test_metalake"); conf.set("oss-endpoint", "http://localhost:8090"); conf.set("oss-access-key-id", "minio"); conf.set("oss-secret-access-key", "minio123"); @@ -296,7 +296,7 @@ If your wants to custom your hadoop version or there is already a hadoop version ``` -Or use the bundle jar with Hadoop environment: +Or use the bundle jar with Hadoop environment if there is no Hadoop environment: ```xml @@ -502,7 +502,7 @@ Apart from configuration method in [create-oss-hadoop-catalog](#configuration-fo ### How to access OSS fileset with credential -If the catalog has been configured with credential, you can access OSS fileset without providing authentication information via GVFS. Let's see how to access OSS fileset with credential: +If the catalog has been configured with credential, you can access OSS fileset without providing authentication information via GVFS Java/Python client and Spark. Let's see how to access OSS fileset with credential: GVFS Java client: diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index 284740ce3e4..8a6982af183 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -17,7 +17,7 @@ To create a Hadoop catalog with S3, follow these steps: 3. Start the Gravitino server using the following command: ```bash -$ bin/gravitino-server.sh start +$ ${GRAVITINO_HOME}/bin/gravitino-server.sh start ``` Once the server is up and running, you can proceed to configure the Hadoop catalog with S3. In the rest of this document we will use `http://localhost:8090` as the Gravitino server URL, please replace it with your actual server URL. @@ -253,16 +253,15 @@ To access fileset with S3 using the GVFS Java client, based on the [basic GVFS c :::note - `s3-endpoint` is an optional configuration for AWS S3, however, it is required for other S3-compatible storage services like MinIO. -- If the catalog has enabled [credential vending](security/credential-vending.md), the properties above can be omitted. +- If the catalog has enabled [credential vending](security/credential-vending.md), the properties above can be omitted. More details can be found in [Fileset with credential vending](#fileset-with-credential-vending). ::: ```java Configuration conf = new Configuration(); -conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs"); -conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); -conf.set("fs.gravitino.server.uri","http://localhost:8090"); -conf.set("fs.gravitino.client.metalake","test_metalake"); - +conf.set("fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs"); +conf.set("fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); +conf.set("fs.gravitino.server.uri", "http://localhost:8090"); +conf.set("fs.gravitino.client.metalake", "test_metalake"); conf.set("s3-endpoint", "http://localhost:8090"); conf.set("s3-access-key-id", "minio"); conf.set("s3-secret-access-key", "minio123"); @@ -301,7 +300,8 @@ Similar to Spark configurations, you need to add S3 (bundle) jars to the classpa ``` -Or use the bundle jar with Hadoop environment: +Or use the bundle jar with Hadoop environment if there is no Hadoop environment: + ```xml @@ -506,7 +506,7 @@ Apart from configuration method in [create-s3-hadoop-catalog](#configurations-fo ### How to access S3 fileset with credential -If the catalog has been configured with credential, you can access S3 fileset without providing authentication information via GVFS. Let's see how to access S3 fileset with credential: +If the catalog has been configured with credential, you can access S3 fileset without providing authentication information via GVFS Java/Python client and Spark. Let's see how to access S3 fileset with credential: GVFS Java client: From d3a898625022cbb5924946107ec942423e913420 Mon Sep 17 00:00:00 2001 From: yuqi Date: Tue, 14 Jan 2025 11:48:01 +0800 Subject: [PATCH 32/39] fix --- docs/hadoop-catalog-index.md | 17 ++++++++++------- docs/hadoop-catalog-with-adls.md | 14 ++++++++------ docs/hadoop-catalog-with-gcs.md | 12 +++++++----- docs/hadoop-catalog-with-oss.md | 16 +++++++++------- docs/hadoop-catalog-with-s3.md | 15 ++++++++------- 5 files changed, 42 insertions(+), 32 deletions(-) diff --git a/docs/hadoop-catalog-index.md b/docs/hadoop-catalog-index.md index b18fd5c2453..f5c06607f02 100644 --- a/docs/hadoop-catalog-index.md +++ b/docs/hadoop-catalog-index.md @@ -6,18 +6,21 @@ keyword: Hadoop catalog index S3 GCS ADLS OSS license: "This software is licensed under the Apache License version 2." --- +### Hadoop catalog overall Gravitino Hadoop catalog index includes the following chapters: -- [Hadoop catalog overview and features](./hadoop-catalog.md) -- [Manage Hadoop catalog with Gravitino API](./manage-fileset-metadata-using-gravitino.md) -- [Using Hadoop catalog with Gravitino virtual System](how-to-use-gvfs.md) +- [Hadoop catalog overview and features](./hadoop-catalog.md): This chapter provides an overview of the Hadoop catalog, its features, capabilities and related configurations. +- [Manage Hadoop catalog with Gravitino API](./manage-fileset-metadata-using-gravitino.md): This chapter explains how to manage fileset metadata using Gravitino API and provides detailed examples. +- [Using Hadoop catalog with Gravitino virtual System](how-to-use-gvfs.md): This chapter explains how to use Hadoop catalog with Gravitino virtual System and provides detailed examples. + +### Hadoop catalog with cloud storage Apart from the above, you can also refer to the following topics to manage and access cloud storage like S3, GCS, ADLS, and OSS: -- [Using Hadoop catalog to manage S3](./hadoop-catalog-with-s3.md) -- [Using Hadoop catalog to manage GCS](./hadoop-catalog-with-gcs.md) -- [Using Hadoop catalog to manage ADLS](./hadoop-catalog-with-adls.md) -- [Using Hadoop catalog to manage OSS](./hadoop-catalog-with-oss.md) +- [Using Hadoop catalog to manage S3](./hadoop-catalog-with-s3.md). +- [Using Hadoop catalog to manage GCS](./hadoop-catalog-with-gcs.md). +- [Using Hadoop catalog to manage ADLS](./hadoop-catalog-with-adls.md). +- [Using Hadoop catalog to manage OSS](./hadoop-catalog-with-oss.md). More storage options will be added soon. Stay tuned! \ No newline at end of file diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index c38def9f27b..b0e903946ab 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -28,12 +28,14 @@ Once the server is up and running, you can proceed to configure the Hadoop catal Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with ADLS: -| Configuration item | Description | Default value | Required | Since version | -|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------| -| `filesystem-providers` | The file system providers to add. Set it to `abs` if it's a Azure Blob Storage fileset, or a comma separated string that contains `abs` like `oss,abs,s3` to support multiple kinds of fileset including `abs`. | (none) | Yes | 0.8.0-incubating | -| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for Azure Blob Storage, if we set this value, we can omit the prefix 'abfss://' in the location. | `builtin-local` | No | 0.8.0-incubating | -| `azure-storage-account-name ` | The account name of Azure Blob Storage. | (none) | Yes | 0.8.0-incubating | -| `azure-storage-account-key` | The account key of Azure Blob Storage. | (none) | Yes | 0.8.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|-------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------| +| `filesystem-providers` | The file system providers to add. Set it to `abs` if it's a Azure Blob Storage fileset, or a comma separated string that contains `abs` like `oss,abs,s3` to support multiple kinds of fileset including `abs`. | (none) | Yes | 0.8.0-incubating | +| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for Azure Blob Storage, if we set this value, we can omit the prefix 'abfss://' in the location. | `builtin-local` | No | 0.8.0-incubating | +| `azure-storage-account-name ` | The account name of Azure Blob Storage. | (none) | Yes | 0.8.0-incubating | +| `azure-storage-account-key` | The account key of Azure Blob Storage. | (none) | Yes | 0.8.0-incubating | +| `credential-providers` | The credential provider types, separated by comma, possible value can be `adls-token`, `azure-account-key`. As the default authentication type is using account name and account key as the above, this configuration can enable credential vending provided by Gravitino server and client will no longer need to provide authentication information like account_name/account_key to access ADLS by GVFS. Once it's set, more configuration items are needed to make it works, please see [adls-credential-vending](security/credential-vending.md) | (none) | No | 0.8.0-incubating | + ### Configurations for a schema diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index 36bf9131454..db556594aa5 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -27,11 +27,13 @@ Once the server is up and running, you can proceed to configure the Hadoop catal Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with GCS: -| Configuration item | Description | Default value | Required | Since version | -|-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------| -| `filesystem-providers` | The file system providers to add. Set it to `gcs` if it's a GCS fileset, a comma separated string that contains `gcs` like `gcs,s3` to support multiple kinds of fileset including `gcs`. | (none) | Yes | 0.7.0-incubating | -| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for GCS, if we set this value, we can omit the prefix 'gs://' in the location. | `builtin-local` | No | 0.7.0-incubating | -| `gcs-service-account-file` | The path of GCS service account JSON file. | (none) | Yes | 0.7.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|-------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------| +| `filesystem-providers` | The file system providers to add. Set it to `gcs` if it's a GCS fileset, a comma separated string that contains `gcs` like `gcs,s3` to support multiple kinds of fileset including `gcs`. | (none) | Yes | 0.7.0-incubating | +| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for GCS, if we set this value, we can omit the prefix 'gs://' in the location. | `builtin-local` | No | 0.7.0-incubating | +| `gcs-service-account-file` | The path of GCS service account JSON file. | (none) | Yes | 0.7.0-incubating | +| `credential-providers` | The credential provider types, separated by comma, possible value can be `gcs-token`. As the default authentication type is using service account as the above, this configuration can enable credential vending provided by Gravitino server and client will no longer need to provide authentication information like service account to access GCS by GVFS. Once it's set, more configuration items are needed to make it works, please see [gcs-credential-vending](security/credential-vending.md) | (none) | No | 0.8.0-incubating | + ### Configurations for a schema diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index 70d47af1fb5..2d542fc63e6 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -28,13 +28,15 @@ Once the server is up and running, you can proceed to configure the Hadoop catal In addition to the basic configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with OSS: -| Configuration item | Description | Default value | Required | Since version | -|-------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------| -| `filesystem-providers` | The file system providers to add. Set it to `oss` if it's a OSS fileset, or a comma separated string that contains `oss` like `oss,gs,s3` to support multiple kinds of fileset including `oss`. | (none) | Yes | 0.7.0-incubating | -| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for OSS, if we set this value, we can omit the prefix 'oss://' in the location. | `builtin-local` | No | 0.7.0-incubating | -| `oss-endpoint` | The endpoint of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | -| `oss-access-key-id` | The access key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | -| `oss-secret-access-key` | The secret key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------| +| `filesystem-providers` | The file system providers to add. Set it to `oss` if it's a OSS fileset, or a comma separated string that contains `oss` like `oss,gs,s3` to support multiple kinds of fileset including `oss`. | (none) | Yes | 0.7.0-incubating | +| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for OSS, if we set this value, we can omit the prefix 'oss://' in the location. | `builtin-local` | No | 0.7.0-incubating | +| `oss-endpoint` | The endpoint of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | +| `oss-access-key-id` | The access key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | +| `oss-secret-access-key` | The secret key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | +| `credential-providers` | The credential provider types, separated by comma, possible value can be `oss-token`, `oss-secret-key`. As the default authentication type is using AKSK as the above, this configuration can enable credential vending provided by Gravitino server and client will no longer need to provide authentication information like AKSK to access OSS by GVFS. Once it's set, more configuration items are needed to make it works, please see [oss-credential-vending](security/credential-vending.md) | (none) | No | 0.8.0-incubating | + ### Configurations for a schema diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index 8a6982af183..4cf3db5d8ac 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -28,13 +28,14 @@ Once the server is up and running, you can proceed to configure the Hadoop catal In addition to the basic configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are necessary to configure a Hadoop catalog with S3: -| Configuration item | Description | Default value | Required | Since version | -|-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------| -| `filesystem-providers` | The file system providers to add. Set it to `s3` if it's a S3 fileset, or a comma separated string that contains `s3` like `gs,s3` to support multiple kinds of fileset including `s3`. | (none) | Yes | 0.7.0-incubating | -| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for S3, if we set this value, we can omit the prefix 's3a://' in the location. | `builtin-local` | No | 0.7.0-incubating | -| `s3-endpoint` | The endpoint of the AWS S3. | (none) | Yes | 0.7.0-incubating | -| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes | 0.7.0-incubating | -| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes | 0.7.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|--------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------| +| `filesystem-providers` | The file system providers to add. Set it to `s3` if it's a S3 fileset, or a comma separated string that contains `s3` like `gs,s3` to support multiple kinds of fileset including `s3`. | (none) | Yes | 0.7.0-incubating | +| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for S3, if we set this value, we can omit the prefix 's3a://' in the location. | `builtin-local` | No | 0.7.0-incubating | +| `s3-endpoint` | The endpoint of the AWS S3. | (none) | Yes | 0.7.0-incubating | +| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes | 0.7.0-incubating | +| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes | 0.7.0-incubating | +| `credential-providers` | The credential provider types, separated by comma, possible value can be `s3-token`, `s3-secret-key`. As the default authentication type is using AKSK as the above, this configuration can enable credential vending provided by Gravitino server and client will no longer need to provide authentication information like AKSK to access S3 by GVFS. Once it's set, more configuration items are needed to make it works, please see [s3-credential-vending](security/credential-vending.md) | (none) | No | 0.8.0-incubating | ### Configurations for a schema From 54536d967542751a9006404654308ebe08e403f7 Mon Sep 17 00:00:00 2001 From: yuqi Date: Tue, 14 Jan 2025 12:04:45 +0800 Subject: [PATCH 33/39] fix --- docs/hadoop-catalog-with-adls.md | 14 +++++++------- docs/hadoop-catalog-with-gcs.md | 12 ++++++------ docs/hadoop-catalog-with-oss.md | 16 ++++++++-------- docs/hadoop-catalog-with-s3.md | 16 ++++++++-------- 4 files changed, 29 insertions(+), 29 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index b0e903946ab..4e1469bb378 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -28,13 +28,13 @@ Once the server is up and running, you can proceed to configure the Hadoop catal Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with ADLS: -| Configuration item | Description | Default value | Required | Since version | -|-------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------| -| `filesystem-providers` | The file system providers to add. Set it to `abs` if it's a Azure Blob Storage fileset, or a comma separated string that contains `abs` like `oss,abs,s3` to support multiple kinds of fileset including `abs`. | (none) | Yes | 0.8.0-incubating | -| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for Azure Blob Storage, if we set this value, we can omit the prefix 'abfss://' in the location. | `builtin-local` | No | 0.8.0-incubating | -| `azure-storage-account-name ` | The account name of Azure Blob Storage. | (none) | Yes | 0.8.0-incubating | -| `azure-storage-account-key` | The account key of Azure Blob Storage. | (none) | Yes | 0.8.0-incubating | -| `credential-providers` | The credential provider types, separated by comma, possible value can be `adls-token`, `azure-account-key`. As the default authentication type is using account name and account key as the above, this configuration can enable credential vending provided by Gravitino server and client will no longer need to provide authentication information like account_name/account_key to access ADLS by GVFS. Once it's set, more configuration items are needed to make it works, please see [adls-credential-vending](security/credential-vending.md) | (none) | No | 0.8.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------| +| `filesystem-providers` | The file system providers to add. Set it to `abs` if it's a Azure Blob Storage fileset, or a comma separated string that contains `abs` like `oss,abs,s3` to support multiple kinds of fileset including `abs`. | (none) | Yes | 0.8.0-incubating | +| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for Azure Blob Storage, if we set this value, we can omit the prefix 'abfss://' in the location. | `builtin-local` | No | 0.8.0-incubating | +| `azure-storage-account-name ` | The account name of Azure Blob Storage. | (none) | Yes | 0.8.0-incubating | +| `azure-storage-account-key` | The account key of Azure Blob Storage. | (none) | Yes | 0.8.0-incubating | +| `credential-providers` | The credential provider types, separated by comma, possible value can be `adls-token`, `azure-account-key`. As the default authentication type is using account name and account key as the above, this configuration can enable credential vending provided by Gravitino server and client will no longer need to provide authentication information like account_name/account_key to access ADLS by GVFS. Once it's set, more configuration items are needed to make it works, please see [adls-credential-vending](security/credential-vending.md#adls-credentials) | (none) | No | 0.8.0-incubating | ### Configurations for a schema diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index db556594aa5..ada5e648a4e 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -27,12 +27,12 @@ Once the server is up and running, you can proceed to configure the Hadoop catal Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with GCS: -| Configuration item | Description | Default value | Required | Since version | -|-------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------| -| `filesystem-providers` | The file system providers to add. Set it to `gcs` if it's a GCS fileset, a comma separated string that contains `gcs` like `gcs,s3` to support multiple kinds of fileset including `gcs`. | (none) | Yes | 0.7.0-incubating | -| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for GCS, if we set this value, we can omit the prefix 'gs://' in the location. | `builtin-local` | No | 0.7.0-incubating | -| `gcs-service-account-file` | The path of GCS service account JSON file. | (none) | Yes | 0.7.0-incubating | -| `credential-providers` | The credential provider types, separated by comma, possible value can be `gcs-token`. As the default authentication type is using service account as the above, this configuration can enable credential vending provided by Gravitino server and client will no longer need to provide authentication information like service account to access GCS by GVFS. Once it's set, more configuration items are needed to make it works, please see [gcs-credential-vending](security/credential-vending.md) | (none) | No | 0.8.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|-------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------| +| `filesystem-providers` | The file system providers to add. Set it to `gcs` if it's a GCS fileset, a comma separated string that contains `gcs` like `gcs,s3` to support multiple kinds of fileset including `gcs`. | (none) | Yes | 0.7.0-incubating | +| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for GCS, if we set this value, we can omit the prefix 'gs://' in the location. | `builtin-local` | No | 0.7.0-incubating | +| `gcs-service-account-file` | The path of GCS service account JSON file. | (none) | Yes | 0.7.0-incubating | +| `credential-providers` | The credential provider types, separated by comma, possible value can be `gcs-token`. As the default authentication type is using service account as the above, this configuration can enable credential vending provided by Gravitino server and client will no longer need to provide authentication information like service account to access GCS by GVFS. Once it's set, more configuration items are needed to make it works, please see [gcs-credential-vending](security/credential-vending.md#gcs-credentials) | (none) | No | 0.8.0-incubating | ### Configurations for a schema diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index 2d542fc63e6..3ce4406e8e3 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -28,14 +28,14 @@ Once the server is up and running, you can proceed to configure the Hadoop catal In addition to the basic configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with OSS: -| Configuration item | Description | Default value | Required | Since version | -|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------| -| `filesystem-providers` | The file system providers to add. Set it to `oss` if it's a OSS fileset, or a comma separated string that contains `oss` like `oss,gs,s3` to support multiple kinds of fileset including `oss`. | (none) | Yes | 0.7.0-incubating | -| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for OSS, if we set this value, we can omit the prefix 'oss://' in the location. | `builtin-local` | No | 0.7.0-incubating | -| `oss-endpoint` | The endpoint of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | -| `oss-access-key-id` | The access key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | -| `oss-secret-access-key` | The secret key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | -| `credential-providers` | The credential provider types, separated by comma, possible value can be `oss-token`, `oss-secret-key`. As the default authentication type is using AKSK as the above, this configuration can enable credential vending provided by Gravitino server and client will no longer need to provide authentication information like AKSK to access OSS by GVFS. Once it's set, more configuration items are needed to make it works, please see [oss-credential-vending](security/credential-vending.md) | (none) | No | 0.8.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|--------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------| +| `filesystem-providers` | The file system providers to add. Set it to `oss` if it's a OSS fileset, or a comma separated string that contains `oss` like `oss,gs,s3` to support multiple kinds of fileset including `oss`. | (none) | Yes | 0.7.0-incubating | +| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for OSS, if we set this value, we can omit the prefix 'oss://' in the location. | `builtin-local` | No | 0.7.0-incubating | +| `oss-endpoint` | The endpoint of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | +| `oss-access-key-id` | The access key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | +| `oss-secret-access-key` | The secret key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | +| `credential-providers` | The credential provider types, separated by comma, possible value can be `oss-token`, `oss-secret-key`. As the default authentication type is using AKSK as the above, this configuration can enable credential vending provided by Gravitino server and client will no longer need to provide authentication information like AKSK to access OSS by GVFS. Once it's set, more configuration items are needed to make it works, please see [oss-credential-vending](security/credential-vending.md#oss-credentials) | (none) | No | 0.8.0-incubating | ### Configurations for a schema diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index 4cf3db5d8ac..4546ff4fb56 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -28,14 +28,14 @@ Once the server is up and running, you can proceed to configure the Hadoop catal In addition to the basic configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are necessary to configure a Hadoop catalog with S3: -| Configuration item | Description | Default value | Required | Since version | -|--------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------| -| `filesystem-providers` | The file system providers to add. Set it to `s3` if it's a S3 fileset, or a comma separated string that contains `s3` like `gs,s3` to support multiple kinds of fileset including `s3`. | (none) | Yes | 0.7.0-incubating | -| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for S3, if we set this value, we can omit the prefix 's3a://' in the location. | `builtin-local` | No | 0.7.0-incubating | -| `s3-endpoint` | The endpoint of the AWS S3. | (none) | Yes | 0.7.0-incubating | -| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes | 0.7.0-incubating | -| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes | 0.7.0-incubating | -| `credential-providers` | The credential provider types, separated by comma, possible value can be `s3-token`, `s3-secret-key`. As the default authentication type is using AKSK as the above, this configuration can enable credential vending provided by Gravitino server and client will no longer need to provide authentication information like AKSK to access S3 by GVFS. Once it's set, more configuration items are needed to make it works, please see [s3-credential-vending](security/credential-vending.md) | (none) | No | 0.8.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|--------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------| +| `filesystem-providers` | The file system providers to add. Set it to `s3` if it's a S3 fileset, or a comma separated string that contains `s3` like `gs,s3` to support multiple kinds of fileset including `s3`. | (none) | Yes | 0.7.0-incubating | +| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for S3, if we set this value, we can omit the prefix 's3a://' in the location. | `builtin-local` | No | 0.7.0-incubating | +| `s3-endpoint` | The endpoint of the AWS S3. | (none) | Yes | 0.7.0-incubating | +| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes | 0.7.0-incubating | +| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes | 0.7.0-incubating | +| `credential-providers` | The credential provider types, separated by comma, possible value can be `s3-token`, `s3-secret-key`. As the default authentication type is using AKSK as the above, this configuration can enable credential vending provided by Gravitino server and client will no longer need to provide authentication information like AKSK to access S3 by GVFS. Once it's set, more configuration items are needed to make it works, please see [s3-credential-vending](security/credential-vending.md#s3-credentials) | (none) | No | 0.8.0-incubating | ### Configurations for a schema From 1971ba17ee2728039d1d7f9806a01f5f270e107b Mon Sep 17 00:00:00 2001 From: yuqi Date: Tue, 14 Jan 2025 14:31:46 +0800 Subject: [PATCH 34/39] Resolve python code indent and fix table format problem. --- docs/hadoop-catalog-with-adls.md | 11 ++++------- docs/hadoop-catalog-with-gcs.md | 11 ++++------- docs/hadoop-catalog-with-oss.md | 21 +++++++++------------ docs/hadoop-catalog-with-s3.md | 13 +++++-------- 4 files changed, 22 insertions(+), 34 deletions(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index 4e1469bb378..2f390d7faee 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -110,11 +110,10 @@ adls_properties = { } adls_properties = gravitino_client.create_catalog(name="example_catalog", - type=Catalog.Type.FILESET, - provider="hadoop", - comment="This is a ADLS fileset catalog", - properties=adls_properties) - + type=Catalog.Type.FILESET, + provider="hadoop", + comment="This is a ADLS fileset catalog", + properties=adls_properties) ``` @@ -320,8 +319,6 @@ pip install apache-gravitino==${GRAVITINO_VERSION} Then you can run the following code: ```python -import logging -from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient from pyspark.sql import SparkSession import os diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md index ada5e648a4e..a3eb034b4fe 100644 --- a/docs/hadoop-catalog-with-gcs.md +++ b/docs/hadoop-catalog-with-gcs.md @@ -105,11 +105,10 @@ gcs_properties = { } gcs_properties = gravitino_client.create_catalog(name="test_catalog", - type=Catalog.Type.FILESET, - provider="hadoop", - comment="This is a GCS fileset catalog", - properties=gcs_properties) - + type=Catalog.Type.FILESET, + provider="hadoop", + comment="This is a GCS fileset catalog", + properties=gcs_properties) ``` @@ -311,8 +310,6 @@ pip install apache-gravitino==${GRAVITINO_VERSION} Then you can run the following code: ```python -import logging -from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient from pyspark.sql import SparkSession import os diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md index 3ce4406e8e3..e63935c720a 100644 --- a/docs/hadoop-catalog-with-oss.md +++ b/docs/hadoop-catalog-with-oss.md @@ -114,11 +114,10 @@ oss_properties = { } oss_catalog = gravitino_client.create_catalog(name="test_catalog", - type=Catalog.Type.FILESET, - provider="hadoop", - comment="This is a OSS fileset catalog", - properties=oss_properties) - + type=Catalog.Type.FILESET, + provider="hadoop", + comment="This is a OSS fileset catalog", + properties=oss_properties) ``` @@ -327,8 +326,6 @@ pip install apache-gravitino==${GRAVITINO_VERSION} Then you can run the following code: ```python -import logging -from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient from pyspark.sql import SparkSession import os @@ -440,11 +437,11 @@ For OSS, you need to add `gravitino-filesystem-hadoop3-runtime-${gravitino-versi In order to access fileset with OSS using the GVFS Python client, apart from [basic GVFS configurations](./how-to-use-gvfs.md#configuration-1), you need to add the following configurations: -| Configuration item | Description | Default value | Required | Since version | -|----------------------------|-----------------------------------|---------------|----------|------------------| -| `oss_endpoint` | The endpoint of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | -| `oss_access_key_id` | The access key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | -| `oss_secret_access_key` | The secret key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|-------------------------|-----------------------------------|---------------|----------|------------------| +| `oss_endpoint` | The endpoint of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | +| `oss_access_key_id` | The access key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | +| `oss_secret_access_key` | The secret key of the Aliyun OSS. | (none) | Yes | 0.7.0-incubating | :::note If the catalog has enabled [credential vending](security/credential-vending.md), the properties above can be omitted. diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index 4546ff4fb56..466bdfdb7fc 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -118,7 +118,6 @@ s3_catalog = gravitino_client.create_catalog(name="test_catalog", provider="hadoop", comment="This is a S3 fileset catalog", properties=s3_properties) - ``` @@ -331,8 +330,6 @@ pip install apache-gravitino==${GRAVITINO_VERSION} Then you can run the following code: ```python -import logging -from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient from pyspark.sql import SparkSession import os @@ -443,11 +440,11 @@ For S3, you need to add `gravitino-filesystem-hadoop3-runtime-${gravitino-versio In order to access fileset with S3 using the GVFS Python client, apart from [basic GVFS configurations](./how-to-use-gvfs.md#configuration-1), you need to add the following configurations: -| Configuration item | Description | Default value | Required | Since version | -|----------------------------|-------------------------------|---------------|----------|------------------| -| `s3_endpoint` | The endpoint of the AWS S3. | (none) | No | 0.7.0-incubating | -| `s3_access_key_id` | The access key of the AWS S3. | (none) | Yes | 0.7.0-incubating | -| `s3_secret_access_key` | The secret key of the AWS S3. | (none) | Yes | 0.7.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|------------------------|-------------------------------|---------------|----------|------------------| +| `s3_endpoint` | The endpoint of the AWS S3. | (none) | No | 0.7.0-incubating | +| `s3_access_key_id` | The access key of the AWS S3. | (none) | Yes | 0.7.0-incubating | +| `s3_secret_access_key` | The secret key of the AWS S3. | (none) | Yes | 0.7.0-incubating | :::note - `s3_endpoint` is an optional configuration for AWS S3, however, it is required for other S3-compatible storage services like MinIO. From 30f4271b896c18372259c36e01443e70fabf482d Mon Sep 17 00:00:00 2001 From: yuqi Date: Tue, 14 Jan 2025 14:46:56 +0800 Subject: [PATCH 35/39] Fix incompleted description about endpoint for S3 --- docs/hadoop-catalog-with-s3.md | 38 +++++++++++++++++----------------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md index 466bdfdb7fc..7d56f2b9ab8 100644 --- a/docs/hadoop-catalog-with-s3.md +++ b/docs/hadoop-catalog-with-s3.md @@ -28,14 +28,14 @@ Once the server is up and running, you can proceed to configure the Hadoop catal In addition to the basic configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are necessary to configure a Hadoop catalog with S3: -| Configuration item | Description | Default value | Required | Since version | -|--------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------| -| `filesystem-providers` | The file system providers to add. Set it to `s3` if it's a S3 fileset, or a comma separated string that contains `s3` like `gs,s3` to support multiple kinds of fileset including `s3`. | (none) | Yes | 0.7.0-incubating | -| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for S3, if we set this value, we can omit the prefix 's3a://' in the location. | `builtin-local` | No | 0.7.0-incubating | -| `s3-endpoint` | The endpoint of the AWS S3. | (none) | Yes | 0.7.0-incubating | -| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes | 0.7.0-incubating | -| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes | 0.7.0-incubating | -| `credential-providers` | The credential provider types, separated by comma, possible value can be `s3-token`, `s3-secret-key`. As the default authentication type is using AKSK as the above, this configuration can enable credential vending provided by Gravitino server and client will no longer need to provide authentication information like AKSK to access S3 by GVFS. Once it's set, more configuration items are needed to make it works, please see [s3-credential-vending](security/credential-vending.md#s3-credentials) | (none) | No | 0.8.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------| +| `filesystem-providers` | The file system providers to add. Set it to `s3` if it's a S3 fileset, or a comma separated string that contains `s3` like `gs,s3` to support multiple kinds of fileset including `s3`. | (none) | Yes | 0.7.0-incubating | +| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for S3, if we set this value, we can omit the prefix 's3a://' in the location. | `builtin-local` | No | 0.7.0-incubating | +| `s3-endpoint` | The endpoint of the AWS S3. This configuration is optional for S3 service, but required for other S3-compatible storage services like MinIO. | (none) | No | 0.7.0-incubating | +| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes | 0.7.0-incubating | +| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes | 0.7.0-incubating | +| `credential-providers` | The credential provider types, separated by comma, possible value can be `s3-token`, `s3-secret-key`. As the default authentication type is using AKSK as the above, this configuration can enable credential vending provided by Gravitino server and client will no longer need to provide authentication information like AKSK to access S3 by GVFS. Once it's set, more configuration items are needed to make it works, please see [s3-credential-vending](security/credential-vending.md#s3-credentials) | (none) | No | 0.8.0-incubating | ### Configurations for a schema @@ -245,11 +245,11 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", "e To access fileset with S3 using the GVFS Java client, based on the [basic GVFS configurations](./how-to-use-gvfs.md#configuration-1), you need to add the following configurations: -| Configuration item | Description | Default value | Required | Since version | -|------------------------|-------------------------------|---------------|----------|------------------| -| `s3-endpoint` | The endpoint of the AWS S3. | (none) | No | 0.7.0-incubating | -| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes | 0.7.0-incubating | -| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes | 0.7.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|---------------|----------|------------------| +| `s3-endpoint` | The endpoint of the AWS S3. This configuration is optional for S3 service, but required for other S3-compatible storage services like MinIO. | (none) | No | 0.7.0-incubating | +| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes | 0.7.0-incubating | +| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes | 0.7.0-incubating | :::note - `s3-endpoint` is an optional configuration for AWS S3, however, it is required for other S3-compatible storage services like MinIO. @@ -440,11 +440,11 @@ For S3, you need to add `gravitino-filesystem-hadoop3-runtime-${gravitino-versio In order to access fileset with S3 using the GVFS Python client, apart from [basic GVFS configurations](./how-to-use-gvfs.md#configuration-1), you need to add the following configurations: -| Configuration item | Description | Default value | Required | Since version | -|------------------------|-------------------------------|---------------|----------|------------------| -| `s3_endpoint` | The endpoint of the AWS S3. | (none) | No | 0.7.0-incubating | -| `s3_access_key_id` | The access key of the AWS S3. | (none) | Yes | 0.7.0-incubating | -| `s3_secret_access_key` | The secret key of the AWS S3. | (none) | Yes | 0.7.0-incubating | +| Configuration item | Description | Default value | Required | Since version | +|------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|---------------|----------|------------------| +| `s3_endpoint` | The endpoint of the AWS S3. This configuration is optional for S3 service, but required for other S3-compatible storage services like MinIO. | (none) | No | 0.7.0-incubating | +| `s3_access_key_id` | The access key of the AWS S3. | (none) | Yes | 0.7.0-incubating | +| `s3_secret_access_key` | The secret key of the AWS S3. | (none) | Yes | 0.7.0-incubating | :::note - `s3_endpoint` is an optional configuration for AWS S3, however, it is required for other S3-compatible storage services like MinIO. @@ -525,7 +525,7 @@ Spark: ```python spark = SparkSession.builder - .appName("s3_fielset_test") + .appName("s3_fileset_test") .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") From 65c171c64d519454554324232862ec77ec559423 Mon Sep 17 00:00:00 2001 From: yuqi Date: Tue, 14 Jan 2025 14:52:33 +0800 Subject: [PATCH 36/39] Optimize ADLS descriptions --- docs/hadoop-catalog-with-adls.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index 2f390d7faee..96126c6fab9 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -6,7 +6,7 @@ keyword: Hadoop catalog ADLS license: "This software is licensed under the Apache License version 2." --- -This document describes how to configure a Hadoop catalog with ADLS (Azure Blob Storage). +This document describes how to configure a Hadoop catalog with ADLS (aka. Azure Blob Storage (ABS), or Azure Data Lake Storage (v2)). ## Prerequisites From 1e155d42cdc706562a27c67f04bb698ede6be4fe Mon Sep 17 00:00:00 2001 From: yuqi Date: Tue, 14 Jan 2025 16:08:23 +0800 Subject: [PATCH 37/39] Fix the problem in #5737 that does not change azure account-name and account key in python client accordingly. --- clients/client-python/gravitino/filesystem/gvfs_config.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/clients/client-python/gravitino/filesystem/gvfs_config.py b/clients/client-python/gravitino/filesystem/gvfs_config.py index 6fbd8a99d18..34db72adee0 100644 --- a/clients/client-python/gravitino/filesystem/gvfs_config.py +++ b/clients/client-python/gravitino/filesystem/gvfs_config.py @@ -42,8 +42,8 @@ class GVFSConfig: GVFS_FILESYSTEM_OSS_SECRET_KEY = "oss_secret_access_key" GVFS_FILESYSTEM_OSS_ENDPOINT = "oss_endpoint" - GVFS_FILESYSTEM_AZURE_ACCOUNT_NAME = "abs_account_name" - GVFS_FILESYSTEM_AZURE_ACCOUNT_KEY = "abs_account_key" + GVFS_FILESYSTEM_AZURE_ACCOUNT_NAME = "azure_storage_account_name" + GVFS_FILESYSTEM_AZURE_ACCOUNT_KEY = "azure_storage_account_key" # This configuration marks the expired time of the credential. For instance, if the credential # fetched from Gravitino server has expired time of 3600 seconds, and the credential_expired_time_ration is 0.5 From d2d2de3d41ed2dbcd56b8a576238980b18b8fd97 Mon Sep 17 00:00:00 2001 From: yuqi Date: Tue, 14 Jan 2025 17:05:53 +0800 Subject: [PATCH 38/39] fix --- docs/hadoop-catalog-index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hadoop-catalog-index.md b/docs/hadoop-catalog-index.md index f5c06607f02..96266af5bba 100644 --- a/docs/hadoop-catalog-index.md +++ b/docs/hadoop-catalog-index.md @@ -12,7 +12,7 @@ Gravitino Hadoop catalog index includes the following chapters: - [Hadoop catalog overview and features](./hadoop-catalog.md): This chapter provides an overview of the Hadoop catalog, its features, capabilities and related configurations. - [Manage Hadoop catalog with Gravitino API](./manage-fileset-metadata-using-gravitino.md): This chapter explains how to manage fileset metadata using Gravitino API and provides detailed examples. -- [Using Hadoop catalog with Gravitino virtual System](how-to-use-gvfs.md): This chapter explains how to use Hadoop catalog with Gravitino virtual System and provides detailed examples. +- [Using Hadoop catalog with Gravitino virtual System](how-to-use-gvfs.md): This chapter explains how to use Hadoop catalog with the Gravitino virtual file system and provides detailed examples. ### Hadoop catalog with cloud storage From e01b201caf39947d62837af3cb68778aea3de07e Mon Sep 17 00:00:00 2001 From: yuqi Date: Tue, 14 Jan 2025 17:29:52 +0800 Subject: [PATCH 39/39] fix again --- docs/hadoop-catalog-index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hadoop-catalog-index.md b/docs/hadoop-catalog-index.md index 96266af5bba..dfa7a187175 100644 --- a/docs/hadoop-catalog-index.md +++ b/docs/hadoop-catalog-index.md @@ -12,7 +12,7 @@ Gravitino Hadoop catalog index includes the following chapters: - [Hadoop catalog overview and features](./hadoop-catalog.md): This chapter provides an overview of the Hadoop catalog, its features, capabilities and related configurations. - [Manage Hadoop catalog with Gravitino API](./manage-fileset-metadata-using-gravitino.md): This chapter explains how to manage fileset metadata using Gravitino API and provides detailed examples. -- [Using Hadoop catalog with Gravitino virtual System](how-to-use-gvfs.md): This chapter explains how to use Hadoop catalog with the Gravitino virtual file system and provides detailed examples. +- [Using Hadoop catalog with Gravitino virtual file system](how-to-use-gvfs.md): This chapter explains how to use Hadoop catalog with the Gravitino virtual file system and provides detailed examples. ### Hadoop catalog with cloud storage