Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#596] feat(hive): Hive catalog supports to impersonate users to execute operations in simple mode. #1450

Merged
merged 33 commits into from
Jan 24, 2024

Conversation

qqqttt123
Copy link
Contributor

@qqqttt123 qqqttt123 commented Jan 11, 2024

What changes were proposed in this pull request?

Hive catalog supports to impersonate users to execute operations in simple mode.
For Kerberos mode, I have created an new issue. I will finish it in the later pull request.
We use a Hive client cache pool referring to the Iceberg cache pool. We use user name as the key of cache pool.

Why are the changes needed?

Fix: #596

Does this PR introduce any user-facing change?

Yes, we will add a new document.

How was this patch tested?

Add a new IT

@qqqttt123 qqqttt123 marked this pull request as draft January 11, 2024 07:01
@qqqttt123 qqqttt123 changed the title [#596] feat(hive): Hive catalog supports to impersonate to execute operations in simple mode. [#596] feat(hive): Hive catalog supports to impersonate users to execute operations in simple mode. Jan 11, 2024
@BeforeAll
public static void startIntegrationTest() throws Exception {
originHadoopUser = System.getenv(HADOOP_USER_NAME);
setEnv(HADOOP_USER_NAME, null);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove HADOOP_USER_NAME, otherwise we will still use HADOOP_USER_NAME when we use unsecured mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the unsecured mode and does the user know about it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the unsecured mode and does the user know about it?

It's the concept of Hadoop. Unsecured mode is the simple mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hard-coded implement in the HiveClient actually.

@qqqttt123 qqqttt123 marked this pull request as ready for review January 11, 2024 13:36
@qqqttt123 qqqttt123 requested review from jerryshao, yuqi1129, xunliu, FANNG1, diqiu50, mchades and Clearvive and removed request for jerryshao January 11, 2024 13:37
@qqqttt123
Copy link
Contributor Author

@mchades @jerryshao @yuqi1129 Could you help me review this pr?

@qqqttt123
Copy link
Contributor Author

@FANNG1 @jerryshao @yuqi1129 @diqiu50 @Clearvive Could you help me review this pr?

*
* <p>A ClientPool that caches the underlying HiveClientPool instances.
*/
public class CachedClientPool implements ClientPool<IMetaStoreClient, TException> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between CachedClientPool and HiveClientPool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see the comment A ClientPool that caches the underlying HiveClientPool instances.

"client.pool.cache.eviction-interval-ms";

public static final long DEFAULT_CLIENT_POOL_CACHE_EVICTION_INTERVAL_MS =
TimeUnit.MINUTES.toMillis(5);;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra comma here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

static Key extractKey() {
List<Object> elements = Lists.newArrayList();
try {
elements.add(UserGroupInformation.getCurrentUser().getUserName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why the Key contains multiple user names. Shouldn't each user have their own client pool?

Copy link
Contributor Author

@qqqttt123 qqqttt123 Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elements contain one user name. Why do we use elements. It's because Iceberg uses elements. There are other config options in the elements. For Gravitino, our cached keys are only username. I reserve the elements. Because it's easy to extend other cache keys.

import java.lang.reflect.Proxy;
import java.util.Collections;

public class CatalogOperationsProxy implements InvocationHandler {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should include a Java doc on how you use the proxy mechanism here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

import java.security.Principal;
import java.util.Map;

public interface CatalogProxyPlugin {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class java doc and method java doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

public interface CatalogProxyPlugin {
Object doAs(
Principal principal, Executable<Object, Exception> action, Map<String, String> properties)
throws Throwable;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a more detailed exception class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. We can't.

| `metastore.uris` | The Hive metastore service URIs, separate multiple addresses with commas. Such as `thrift://127.0.0.1:9083` | (none) | Yes | 0.2.0 |
| `client.pool-size` | The maximum number of Hive metastore clients in the pool for Gravitino. | 1 | No | 0.2.0 |
| `gravitino.bypass.` | Property name with this prefix passed down to the underlying HMS client for use. Such as `gravitino.bypass.hive.metastore.failure.retries = 3` indicate 3 times of retries upon failure of Thrift metastore calls | (none) | No | 0.2.0 |
| `client.pool.cache.eviction-interval-ms` | The cache pool eviction interval. | | No | 0.4.0 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value of client.pool.cache.eviction-interval-ms is empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

} else if (innerException instanceof InvocationTargetException) {
throw innerException.getCause();
} else {
throw e.getCause();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw innerException

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

return CatalogOperationsProxy.getProxy(ops, plugin);
}

protected CatalogProxyPlugin newProxyPlugin(Map<String, String> config) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make CatalogProxyPlugin invisible to detailed catalogs? Making it as a method of catalog seems too invasive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every catalog should be proxied actually. So I choose to move CatalogProxyPlugin to here.

@qqqttt123 qqqttt123 reopened this Jan 23, 2024
@@ -17,6 +17,9 @@ dependencies {
compileOnly(libs.lombok)
annotationProcessor(libs.lombok)

compileOnly(libs.immutables.value)
annotationProcessor(libs.immutables.value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you should also update license.bin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -87,7 +87,13 @@ public CatalogOperations ops() {
if (ops == null) {
Preconditions.checkArgument(
entity != null && conf != null, "entity and conf must be set before calling ops()");
ops = newOps(conf);
CatalogOperations newOps = newOps(conf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the partition API is merged, how do you handle that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we will have TableOperations. We will create proxy object for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have to figure out a solution with @mchades ASAP before the interface is settle down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think twice. If the type of value is Table, I return a proxy object, too.

  public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
    Object object = plugin.doAs(
        PrincipalUtils.getCurrentPrincipal(),
        () -> method.invoke(ops, args),
        Collections.emptyMap());
    if (object instanceof Table) {
      return createProxy(object, plugin);
    }
    return object;
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recover the code after discussion with @mchades if we have TableOperation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have other pull requests relying on this pr. Maybe I can change this place to adapt @mchades 's solution in later pull requests.

}

protected ProxyPlugin newProxyPlugin(Map<String, String> config) {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better using optional to avoid null check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

this.ops = ops;
}

public static <T> T getProxy(T ops, ProxyPlugin plugin) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to rename to "createProxy"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

.withDataType(Types.StringType.get())
.withComment("col_3_comment")
.build();
return new ColumnDTO[] {col1, col2, col3};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change all the ColumnDTO to use ColumnImpl API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@qqqttt123 qqqttt123 closed this Jan 23, 2024
@qqqttt123 qqqttt123 reopened this Jan 23, 2024
@qqqttt123 qqqttt123 requested a review from jerryshao January 24, 2024 04:13
private static <T> T createProxyInternal(T ops, ProxyPlugin plugin, Class<?>[] interfaces) {
return (T)
Proxy.newProxyInstance(
ops.getClass().getClassLoader(), interfaces, new OperationsProxy(plugin, ops));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this feature in the deployed Gravitino, I'm not sure if there's a classloader issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have written an IT ProxyCatalogHiveIT to test this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not enough to test using mini gravitino, you should verify it manually in your local environment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified with the playground.

anotherSchemaName.toLowerCase()));
Assertions.assertThrows(
RuntimeException.class,
() -> anotherCatalog.asSchemas().createSchema(anotherIdent, comment, properties));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please verify the exception message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Database db = hiveClientPool.run(client -> client.getDatabase(schemaName));
Assertions.assertEquals(EXPECT_USER, db.getOwnerName());
Assertions.assertEquals(
EXPECT_USER, hdfs.getFileStatus(new Path(db.getLocationUri())).getOwner());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the default user for HDFS?

Copy link
Contributor Author

@qqqttt123 qqqttt123 Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

root is the default user for HDFS.

columns,
comment,
ImmutableMap.of(),
Partitioning.EMPTY_PARTITIONING));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add necessary white space and comments to the code blocks. You put everything together without any ws and comments, which is hard to review.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@qqqttt123 qqqttt123 requested a review from jerryshao January 24, 2024 09:16
@qqqttt123 qqqttt123 closed this Jan 24, 2024
@qqqttt123 qqqttt123 reopened this Jan 24, 2024
@qqqttt123 qqqttt123 added this to the Gravitino 0.4.0 milestone Jan 24, 2024
clientPool.close();
clientPool = null;
}
// We used a cached client pool, the pool will close clients after expiration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to close the all the dangling clients immediately when catalog is closed, otherwise it will be leaked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left one comment, please think about it @qqqttt123

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

jerryshao
jerryshao previously approved these changes Jan 24, 2024
@jerryshao jerryshao merged commit c045c31 into apache:main Jan 24, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Improvement] Cannot specify Hadoop username for catalog operation
4 participants