Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#5472] improvement(docs): Add example to use cloud storage fileset and polish hadoop-catalog document. #6059

Merged
merged 44 commits into from
Jan 14, 2025

Conversation

yuqi1129
Copy link
Contributor

@yuqi1129 yuqi1129 commented Jan 2, 2025

What changes were proposed in this pull request?

  1. Add full example about how to use cloud storage fileset like S3, GCS, OSS and ADLS
  2. Polish how-to-use-gvfs.md and hadoop-catalog-md.
  3. Add document how fileset using credential.

Why are the changes needed?

For better user experience.

Fix: #5472

Does this PR introduce any user-facing change?

N/A.

How was this patch tested?

N/A

@yuqi1129 yuqi1129 self-assigned this Jan 2, 2025
@yuqi1129 yuqi1129 requested a review from jerryshao January 2, 2025 11:42
docs/cloud-storage-fileset-example.md Outdated Show resolved Hide resolved
docs/cloud-storage-fileset-example.md Outdated Show resolved Hide resolved
docs/cloud-storage-fileset-example.md Outdated Show resolved Hide resolved
docs/cloud-storage-fileset-example.md Outdated Show resolved Hide resolved
docs/cloud-storage-fileset-example.md Outdated Show resolved Hide resolved
docs/cloud-storage-fileset-example.md Outdated Show resolved Hide resolved
docs/cloud-storage-fileset-example.md Outdated Show resolved Hide resolved
docs/cloud-storage-fileset-example.md Outdated Show resolved Hide resolved
@jerqi jerqi changed the title [#5472] improvement(docs): Add example to use cloud stroage fileset and polish hadoop-catalog document. [#5472] improvement(docs): Add example to use cloud storage fileset and polish hadoop-catalog document. Jan 3, 2025
docs/cloud-storage-fileset-example.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-gcs.md Show resolved Hide resolved
docs/hadoop-catalog-with-gcs.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-gcs.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-gcs.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-gcs.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-gcs.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-gcs.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-gcs.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-gcs.md Outdated Show resolved Hide resolved
@yuqi1129 yuqi1129 requested review from tengqm and FANNG1 and removed request for tengqm January 6, 2025 03:02
docs/hadoop-catalog-with-gcs.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-gcs.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-gcs.md Outdated Show resolved Hide resolved
## Prerequisites

In order to create a Hadoop catalog with S3, you need to place [`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle) in Gravitino Hadoop classpath located
at `${HADOOP_HOME}/share/hadoop/common/lib/`. After that, start Gravitino server with the following command:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in Gravitino Hadoop classpath located at ${HADOOP_HOME}/share/hadoop/common/lib/ , use hadoop catalog classpath not hadoop classpath?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user should place the jar in {GRAVITINO_HOME}/catalogs/hadoop/libs not ${HADOOP_HOME}/share/hadoop/common/lib/, YES?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems not fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have confirmed that this has been fixed, can you please reresh the web and see if it has been resolved

docs/hadoop-catalog-with-s3.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-s3.md Show resolved Hide resolved
docs/hadoop-catalog-with-s3.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-s3.md Show resolved Hide resolved
docs/hadoop-catalog-with-s3.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-s3.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-s3.md Outdated Show resolved Hide resolved
docs/how-to-use-gvfs.md Outdated Show resolved Hide resolved
docs/how-to-use-gvfs.md Outdated Show resolved Hide resolved
docs/how-to-use-gvfs.md Outdated Show resolved Hide resolved
@FANNG1
Copy link
Contributor

FANNG1 commented Jan 13, 2025

IMO, credential vending is the critical part for the fileset use example for cloud storage, but now it lays in some inconspicuous places :(

Do you have any suggestions? The credential part is already with Header 2. do you mean I should move forward this part?

Instead of placing it in a separate part at the last which seems optional and not important, place it in the catalog properties part like ### Configurations for S3 Hadoop Catalog, for most examples use credential vending by default.

@yuqi1129
Copy link
Contributor Author

IMO, credential vending is the critical part for the fileset use example for cloud storage, but now it lays in some inconspicuous places :(

Do you have any suggestions? The credential part is already with Header 2. do you mean I should move forward this part?

Instead of placing it in a separate part at the last which seems optional and not important, place it in the catalog properties part like ### Configurations for S3 Hadoop Catalog, for most examples use credential vending by default.

  • I don't believe putting them last means they are not important.
  • The configuration items for Hadoop catalog in hadoop-catalog.md and for GVFS in credential-vending.md, Do I need make a copy here?

@yuqi1129
Copy link
Contributor Author

IMO, credential vending is the critical part for the fileset use example for cloud storage, but now it lays in some inconspicuous places :(

Do you have any suggestions? The credential part is already with Header 2. do you mean I should move forward this part?

Instead of placing it in a separate part at the last which seems optional and not important, place it in the catalog properties part like ### Configurations for S3 Hadoop Catalog, for most examples use credential vending by default.

  • I don't believe putting them last means they are not important.
  • The configuration items for Hadoop catalog in hadoop-catalog.md and for GVFS in credential-vending.md, Do I need make a copy here?

Other points are

  • Credential vending may be quite an advanced feature and should be cautious in its usage. I'm afraid this will make primary users difficult to use the example.
  • I'm not very sure if we need to put it in the basic example just like I have put S3 example in the hadoop-catalog.md which is intentional for HDFS.

If you have any further thoughts on it, please let me know your thoughts and ideas.

@mchades
Copy link
Contributor

mchades commented Jan 13, 2025

Credential vending may be quite an advanced feature

Strongly agree that Credential vending is an advanced feature. We can distinguish it from simple examples, using different sections or even separate documents; otherwise, we will make simple examples not simple.

license: "This software is licensed under the Apache License version 2."
---

This document describes how to configure a Hadoop catalog with ADLS (Azure Blob Storage).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ADLS is based on Azure Blob Storage, but not Azure Blob Storage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we support Azure Blob Storage, ADLS, and ADLS (v2), I used Azure Blob Storage for all, but there are no abbreviations for "Azure Blob Storage", so I used ADLS to stands for storage service provided by Azure.

Is there a good term to describe all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems ADLS is enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use this sentences to replace it ADLS (aka. Azure Blob Storage (ABS), or Azure Data Lake Storage (v2))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ADLS couldn't represent ABS, IMO, azure hadoop connector just supports ADLS

@FANNG1
Copy link
Contributor

FANNG1 commented Jan 14, 2025

LGTM except minor comments

@FANNG1
Copy link
Contributor

FANNG1 commented Jan 14, 2025

@jerryshao @mchades any other comments?

@jerryshao
Copy link
Contributor

I don't have further comment, @mchades can also take a review.

FANNG1
FANNG1 previously approved these changes Jan 14, 2025

- [Hadoop catalog overview and features](./hadoop-catalog.md): This chapter provides an overview of the Hadoop catalog, its features, capabilities and related configurations.
- [Manage Hadoop catalog with Gravitino API](./manage-fileset-metadata-using-gravitino.md): This chapter explains how to manage fileset metadata using Gravitino API and provides detailed examples.
- [Using Hadoop catalog with Gravitino virtual System](how-to-use-gvfs.md): This chapter explains how to use Hadoop catalog with Gravitino virtual System and provides detailed examples.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gravitino virtual System -> Gravitino virtual file system

FANNG1
FANNG1 previously approved these changes Jan 14, 2025
@FANNG1 FANNG1 merged commit 5caa9de into apache:main Jan 14, 2025
25 checks passed
github-actions bot pushed a commit that referenced this pull request Jan 14, 2025
…nd polish hadoop-catalog document. (#6059)

### What changes were proposed in this pull request?

1. Add full example about how to use cloud storage fileset like S3, GCS,
OSS and ADLS
2. Polish how-to-use-gvfs.md and hadoop-catalog-md.
3. Add document how fileset using credential.

### Why are the changes needed?

For better user experience.

Fix: #5472


### Does this PR introduce _any_ user-facing change?

N/A.

### How was this patch tested?

N/A
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-0.8 Automatically cherry-pick commit to branch-0.8
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Improvement] Improve document about how to use S3, OSS , GCS and ADLS(ABS) filest
5 participants