Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify Delta Lake tests' container handling with Hive and Iceberg #11915

Merged

Conversation

mdesmet
Copy link
Contributor

@mdesmet mdesmet commented Apr 12, 2022

Description

The MinioClient provides additional features on top of the standard S3 api that may be interesting to keep in our toolbox. So it has been integrated the HiveMinioDataLake container setup:

  • Uploading of files based on resources (similar to below missing HadoopHive feature)
  • Event notifications on when files are created or updated similar to AWS lambda (not currently used though)

BaseTestContainer has been extended for executing commands in a container and retrieving the result and loading resources.

Is this change a fix, improvement, new feature, refactoring, or other?

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Change only impacting tests for Hive, Iceberg and Delta Lake and common test libraries.

How would you describe this change to a non-technical end user or system administrator?

N/A

Related issues, pull requests, and links

Documentation

(x) No documentation is needed.

Release notes

(x) No release notes entries required.

@mdesmet mdesmet force-pushed the feature/use_hive_connector_containers_in_tests branch 5 times, most recently from 1975b27 to 9d7256b Compare April 14, 2022 21:43
@mdesmet mdesmet force-pushed the feature/use_hive_connector_containers_in_tests branch from 9d7256b to e76e463 Compare April 21, 2022 19:47
@mdesmet mdesmet requested a review from findinpath June 22, 2022 17:32
@findinpath
Copy link
Contributor

There would be a few follow-ups (very likely a different PR to avoid this one exploding):

  • Extract the HiveMinioDataLake and the other classes to trino-testing-containers.
  • Adapt trino-faulttolerant-tests to drop the dependency and use the infrastructure previously moved to trino-testing-containers instead

<dependency>
<groupId>io.trino</groupId>
<artifactId>trino-delta-lake</artifactId>
<type>test-jar</type>
<scope>test</scope>
</dependency>

@findinpath
Copy link
Contributor

Please rebase the PR.

@findinpath findinpath requested a review from aczajkowski June 23, 2022 03:10

import static java.util.Objects.requireNonNull;
import static org.testcontainers.containers.output.OutputFrame.OutputType.END;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logging functionality is rather useful.
Let's not drop it.

FYI #12947

@@ -355,6 +355,29 @@
<scope>test</scope>
</dependency>

<dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minio has been bumped to 7.x in the meantime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for that. But I assume it's a separate PR to bump this version.

@@ -89,6 +93,7 @@ public void start()
.build();
s3Client.createBucket(this.bucketName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have both s3client and minioclient?
I think that the s3client is now superfluos.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for that. One or another. My initial idea was to use S3Client as this would allow to have some base class in future which would handle real S3 tests and Minio based test.
For now it seems we should stick to one selected client.

@@ -118,11 +117,26 @@ abstract QueryRunner createDeltaLakeQueryRunner(Map<String, String> connectorPro

abstract List<String> listCheckpointFiles(String transactionLogDirectory);

protected void copyResources(String resourcePath, String target)
{
hiveMinioDataLake.getMinioClient().copyResourcePath(bucketName, resourcePath, target);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hiveMinioDataLake.getMinioClient().copyResourcePath keeps appearing multiple times in the PR.

Would it make sense to expose in HiveMinioDataLake (same as it was done in DockerizedMinioDataLake the methods :

  • copyResources
  • writeFile
  • listFiles

?

ImmutableMap.of("io/trino/plugin/deltalake/core-site.xml", "/etc/hadoop/conf/core-site.xml"),
ImmutableMap.of());
return dockerizedMinioDataLake;
hiveMinioDataLake = new HiveMinioDataLake(bucketName, ImmutableMap.of(), "ghcr.io/trinodb/testing/hdp2.6-hive");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls avoid using magic strings for the docker image - Let's introduce a constant for this purpose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I remember there is a class where we store docker image version.
io.trino.plugin.deltalake.util.DockerImages
Maybe let's extract it to test containers and keep those values there

HiveMinioDataLake hiveMinioDataLake = new HiveMinioDataLake(
bucketName,
ImmutableMap.of(hadoopCoreSiteXmlTempFile.normalize().toAbsolutePath().toString(), "/etc/hadoop/conf/core-site.xml"),
"ghcr.io/trinodb/testing/hdp3.1-hive");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls avoid using magic strings for the docker image - Let's introduce a constant for this purpose.

container = Failsafe.with(retryPolicy).get(() -> startHadoopContainer(image, ports, resourcesToMount, filesToMount, extraHosts, network));
closer.register(container::stop);

// HACK: even though Hadoop is accessed by proxy, Hadoop still tries to resolve hadoop-master
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@losipiuk I just want to make sure that this logic is not needed with the transition towards HiveHadoop where this setting is not made anymore.

Copy link
Member

@aczajkowski aczajkowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice % few nit comments

}
}
catch (IOException | InterruptedException e) {
throw new RuntimeException("Exception while running command: " + printableCommand(commandAndArgs), e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Just use String.join

String.join(" ", commandAndArgs)

public void runCommandInContainer(String... commandAndArgs)
{
try {
Container.ExecResult execResult = container.execInContainer(commandAndArgs);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: call public method from this class to not have redundant code.

Container.ExecResult execResult = executeInContainer(commandAndArgs)

@@ -76,7 +83,8 @@ protected void setupContainer()
for (int port : this.ports) {
container.addExposedPort(port);
}
filesToMount.forEach(this::copyFileToContainer);
resourcesToMount.forEach(this::copyFileToContainer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: If we made that change so let's maybe rename copyFileToContainer to copyResourceToContainer

HiveHadoop.Builder hiveHadoopBuilder = HiveHadoop.builder()
.withImage(hiveHadoopImage)
.withNetwork(network);
if (hiveHadoopFilesToMount.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure here but shouldn't we always copy resources "hive_minio_datalake/hive-core-site.xml"
and if there are any files to mount then also files like

HiveHadoop.Builder hiveHadoopBuilder = HiveHadoop.builder()
                .withImage(hiveHadoopImage)
                .withNetwork(network)
                .withResourcesToMount(ImmutableMap.of("hive_minio_datalake/hive-core-site.xml", "/etc/hadoop/conf/core-site.xml"));
if (!hiveHadoopFilesToMount.isEmpty()) {
    hiveHadoopBuilder.withFilesToMount(hiveHadoopFilesToMount);
}

We could potentially just blindly allways add withFilesToMount. It it's empty forEach will not iterate and do not mount them

@@ -89,6 +93,7 @@ public void start()
.build();
s3Client.createBucket(this.bucketName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for that. One or another. My initial idea was to use S3Client as this would allow to have some base class in future which would handle real S3 tests and Minio based test.
For now it seems we should stick to one selected client.

@@ -355,6 +355,29 @@
<scope>test</scope>
</dependency>

<dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for that. But I assume it's a separate PR to bump this version.

@@ -45,7 +45,7 @@
import static java.util.Objects.requireNonNull;
import static java.util.regex.Matcher.quoteReplacement;

class MinioClient
public class MinioClient
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: When we use MinioClient now directly in container. I'm wondering if we could maybe move client creation to MinIo container and fetch it from them anywhere it's needed ?

@@ -24,4 +24,15 @@
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>

<!-- Hive impersonation -->
<property>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need them now. I can't see changes that might require it ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I found it in next commit :D

ImmutableMap.of("io/trino/plugin/deltalake/core-site.xml", "/etc/hadoop/conf/core-site.xml"),
ImmutableMap.of());
return dockerizedMinioDataLake;
hiveMinioDataLake = new HiveMinioDataLake(bucketName, ImmutableMap.of(), "ghcr.io/trinodb/testing/hdp2.6-hive");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I remember there is a class where we store docker image version.
io.trino.plugin.deltalake.util.DockerImages
Maybe let's extract it to test containers and keep those values there

}

@AfterClass(alwaysRun = true)
public void removeTestData()
{
if (adlsDirectory != null && dockerizedDataLake.getTestingHadoop() != null) {
dockerizedDataLake.getTestingHadoop().runCommandInContainer("hadoop", "fs", "-rm", "-f", "-r", adlsDirectory);
if (adlsDirectory != null && hiveMinioDataLake.getHiveHadoop() != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think hiveMinioDataLake.getHiveHadoop() != null is not needed. It will never be null when using HibeMinioDataLake

@mdesmet mdesmet force-pushed the feature/use_hive_connector_containers_in_tests branch 3 times, most recently from 3f4549f to 2309e5f Compare June 27, 2022 11:10
Copy link
Member

@aczajkowski aczajkowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mdesmet mdesmet requested a review from ebyhr June 30, 2022 17:16

private final Map<String, Long> hiveMetastoreInvocationCounts = new ConcurrentHashMap<>();

@AfterClass(alwaysRun = true)
public final void close()
throws Exception
{
if (dockerizedMinioDataLake != null) {
dockerizedMinioDataLake.close();
if (hiveMinioDataLake != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls create a preparatory commit and remove this command because the dockerized data lake is already torn down in tearDown method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gentle reminder for handling this comment.

deltaLakeProperties.put("hive.s3.aws-secret-key", MINIO_SECRET_KEY);
deltaLakeProperties.put("hive.s3.endpoint", dockerizedMinioDataLake.getMinioAddress());
deltaLakeProperties.put("hive.metastore.uri", "thrift://" + hiveMinioDataLake.getHiveHadoop().getHiveMetastoreEndpoint());
deltaLakeProperties.put("hive.s3.aws-access-key", ACCESS_KEY);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls create an extra commit (probably after the present one) to rename ACCESS_KEY to MINIO_ACCESS_KEY and SECRET_KEY to MINIO_SECRET_KEY.

assertQuery("SHOW TABLES FROM hive." + schema, "VALUES 'hive_table', 'hive_view', 'delta_table'");

assertThatThrownBy(() -> query("SHOW CREATE TABLE delta." + schema + ".hive_table"))
assertThatThrownBy(() -> query("SHOW CREATE TABLE delta_lake." + schema + ".hive_table"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: Using io.trino.plugin.deltalake.DeltaLakeQueryRunner#DELTA_CATALOG would be also on option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could even omit delta_lake but thought this way is a bit more explicit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to use just delta as product tests (other connector use same catalog name between normal tests and product tests). No need to change in this PR though.

@@ -1802,7 +1802,7 @@ public static TrinoS3FileSystemStats getFileSystemStats()
return STATS;
}

private static String getMd5AsBase64(byte[] data, int offset, int length)
public static String getMd5AsBase64(byte[] data, int offset, int length)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The visibility of this method doesn't need to be changed in case you'd be using minioClient in HiveMinioDataLake.

@@ -44,6 +59,11 @@
private State state = State.INITIAL;
private AmazonS3 s3Client;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that, at the moment, given that this class is called HiveMinioDataLake and deals (as far as I see) only with MinIO, we can safely drop the AmazonS3 s3Client and use the minioClient instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for that. I would stick to one or another. Let's not maintain both.

public AmazonS3 getS3Client()
{
checkState(state == State.STARTED, "Can't provide client when MinIO state is: %s", state);
return s3Client;
}

public void stop()
throws Exception
public void copyResources(String resourcePath, String target)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use minioClient to perform the utility operations that you've reimplemented based on AmazonS3 client. The needed logic exists already in MinioClient class.

{
return forClasspathResource(resourcePath)
// Container fails to mount jar:file:/<host_path>!<resource_path> resources
// This assures that JAR resources are being copied out to tmp locations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assures -> ensures

@mdesmet mdesmet force-pushed the feature/use_hive_connector_containers_in_tests branch 2 times, most recently from 012af24 to 04b6af0 Compare July 4, 2022 19:56
@mdesmet mdesmet force-pushed the feature/use_hive_connector_containers_in_tests branch from 04b6af0 to 1c9ba74 Compare July 5, 2022 09:45
Copy link
Contributor

@findinpath findinpath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % comments

@ebyhr CPTAL?

HiveHadoop.Builder hiveHadoopBuilder = HiveHadoop.builder()
.withImage(hiveHadoopImage)
.withNetwork(network);
Map<String, String> defaultHiveHadoopResourcesToMount = new HashMap<>(Map.of(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Map<String, String> defaultHiveHadoopResourcesToMount = new HashMap<>(Map.of(
Map<String, String> defaultHiveHadoopResourcesToMount = ImmutableMap.<String, String>builder()
.put("/etc/hadoop/conf/core-site.xml", getPathFromClassPathResource("hive_minio_datalake/hive-core-site.xml"))
.putAll(hiveHadoopFilesToMount)
.buildOrThrow();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had also tried that, however Guava's ImmutableMap doesn't allow duplicate keys to be inserted. Here we sometimes want to override the core-site.xml with a new one. Hence the standard HashMap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you could move the default mount mapping to another constructor. e.g.

    public HiveMinioDataLake(String bucketName)
    {
        this(bucketName, ImmutableMap.of("/etc/hadoop/conf/core-site.xml", getPathFromClassPathResource("hive_minio_datalake/hive-core-site.xml")), HiveHadoop.DEFAULT_IMAGE);
    }
...


private final Map<String, Long> hiveMetastoreInvocationCounts = new ConcurrentHashMap<>();

@AfterClass(alwaysRun = true)
public final void close()
throws Exception
{
if (dockerizedMinioDataLake != null) {
dockerizedMinioDataLake.close();
if (hiveMinioDataLake != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gentle reminder for handling this comment.

@mdesmet mdesmet force-pushed the feature/use_hive_connector_containers_in_tests branch from 1c9ba74 to 040d015 Compare July 5, 2022 09:46
@findinpath
Copy link
Contributor

@mdesmet

Remaned ACCESS_KEY and SECRET_KEY to MINIO_ACCESS_KEY and MINIO_SECRET_KEY for readability

Use a shorter commit message header and explain the detail in the commit message content.

e.g. Renamed MinIO credentials constants

@mdesmet mdesmet force-pushed the feature/use_hive_connector_containers_in_tests branch from 040d015 to 5b2ba76 Compare July 5, 2022 10:10
assertQuery("SHOW TABLES FROM hive." + schema, "VALUES 'hive_table', 'hive_view', 'delta_table'");

assertThatThrownBy(() -> query("SHOW CREATE TABLE delta." + schema + ".hive_table"))
assertThatThrownBy(() -> query("SHOW CREATE TABLE delta_lake." + schema + ".hive_table"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to use just delta as product tests (other connector use same catalog name between normal tests and product tests). No need to change in this PR though.

HiveHadoop.Builder hiveHadoopBuilder = HiveHadoop.builder()
.withImage(hiveHadoopImage)
.withNetwork(network);
Map<String, String> defaultHiveHadoopResourcesToMount = new HashMap<>(Map.of(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you could move the default mount mapping to another constructor. e.g.

    public HiveMinioDataLake(String bucketName)
    {
        this(bucketName, ImmutableMap.of("/etc/hadoop/conf/core-site.xml", getPathFromClassPathResource("hive_minio_datalake/hive-core-site.xml")), HiveHadoop.DEFAULT_IMAGE);
    }
...

@mdesmet mdesmet force-pushed the feature/use_hive_connector_containers_in_tests branch from 5b2ba76 to c34049d Compare July 10, 2022 21:14
@mdesmet
Copy link
Contributor Author

mdesmet commented Jul 10, 2022

I have extracted the Builder in a separate commit. I also added catalogName in the DeltaLakeQueryRunner.Builder class as a follow-up on the delta vs delta_lake name discussion. That way the signature of the DeltaLakeQueryRunner factory methods doesn't change too much.

@mdesmet mdesmet force-pushed the feature/use_hive_connector_containers_in_tests branch from c34049d to 86728fc Compare July 11, 2022 00:48
.put("hive.s3.streaming.part-size", "5MB") //must be at least 5MB according to annotations on io.trino.plugin.hive.s3.HiveS3Config.getS3StreamingPartSize
.putAll(connectorProperties)
.buildOrThrow())
.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Variable assignment is not needed.


queryRunner.installPlugin(new TestingDeltaLakePlugin());
Map<String, String> deltaProperties = new HashMap<>(this.deltaProperties.buildOrThrow());
if (!deltaProperties.containsKey("hive.metastore.uri")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having business logic in setting up default values for properties is not a good idea.
In case we'll want to handle Glue here, we'll further complicate the logic.
I'd advocate that the client classes of DeltaQueryRunner should specify these properties themselves.

See for reference:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree with you. Seems like the IcebergQueryRunner also does this though. I would still consider putting it in the QueryRunner class itself as duplicating this logic everywhere seems a bit redundant. WDYT of adding a builder method withFileMetastore? Seems more explicit and hides the complexity for the clients of the QueryRunner classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say, let's keep this out of this PR for now.

Copy link
Contributor

@findinpath findinpath Jul 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with withFileMetastore because it abstracts the logic of setting the file hive metastore.

let's keep this out of this PR for now.

I would argue that this change is related to the current PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up PR is acceptable to me. Going to merger as-is.

@findinpath
Copy link
Contributor

I find it really great that we have a unified way of dealing with minio data lakes on tests. There's only a bit more left to land this awesome test infrastructure change.

Great job!!

@mdesmet mdesmet force-pushed the feature/use_hive_connector_containers_in_tests branch from 86728fc to a9c9dd6 Compare July 11, 2022 09:46
@ebyhr ebyhr merged commit cb703eb into trinodb:master Jul 13, 2022
@ebyhr
Copy link
Member

ebyhr commented Jul 13, 2022

Merged, thanks!

@github-actions github-actions bot added this to the 390 milestone Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Unify Delta Lake tests' container handling with Hive and Iceberg
4 participants