-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify Delta Lake tests' container handling with Hive and Iceberg #11915
Unify Delta Lake tests' container handling with Hive and Iceberg #11915
Conversation
...-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAdlsConnectorSmokeTest.java
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAdlsStorage.java
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java
Show resolved
Hide resolved
1975b27
to
9d7256b
Compare
9d7256b
to
e76e463
Compare
There would be a few follow-ups (very likely a different PR to avoid this one exploding):
trino/testing/trino-faulttolerant-tests/pom.xml Lines 148 to 153 in 6b70752
|
Please rebase the PR. |
|
||
import static java.util.Objects.requireNonNull; | ||
import static org.testcontainers.containers.output.OutputFrame.OutputType.END; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logging functionality is rather useful.
Let's not drop it.
FYI #12947
@@ -355,6 +355,29 @@ | |||
<scope>test</scope> | |||
</dependency> | |||
|
|||
<dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minio
has been bumped to 7.x in the meantime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for that. But I assume it's a separate PR to bump this version.
@@ -89,6 +93,7 @@ public void start() | |||
.build(); | |||
s3Client.createBucket(this.bucketName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have both s3client and minioclient?
I think that the s3client is now superfluos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for that. One or another. My initial idea was to use S3Client as this would allow to have some base class in future which would handle real S3 tests and Minio based test.
For now it seems we should stick to one selected client.
@@ -118,11 +117,26 @@ abstract QueryRunner createDeltaLakeQueryRunner(Map<String, String> connectorPro | |||
|
|||
abstract List<String> listCheckpointFiles(String transactionLogDirectory); | |||
|
|||
protected void copyResources(String resourcePath, String target) | |||
{ | |||
hiveMinioDataLake.getMinioClient().copyResourcePath(bucketName, resourcePath, target); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hiveMinioDataLake.getMinioClient().copyResourcePath
keeps appearing multiple times in the PR.
Would it make sense to expose in HiveMinioDataLake
(same as it was done in DockerizedMinioDataLake
the methods :
- copyResources
- writeFile
- listFiles
?
ImmutableMap.of("io/trino/plugin/deltalake/core-site.xml", "/etc/hadoop/conf/core-site.xml"), | ||
ImmutableMap.of()); | ||
return dockerizedMinioDataLake; | ||
hiveMinioDataLake = new HiveMinioDataLake(bucketName, ImmutableMap.of(), "ghcr.io/trinodb/testing/hdp2.6-hive"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls avoid using magic strings for the docker image - Let's introduce a constant for this purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: I remember there is a class where we store docker image version.
io.trino.plugin.deltalake.util.DockerImages
Maybe let's extract it to test containers and keep those values there
HiveMinioDataLake hiveMinioDataLake = new HiveMinioDataLake( | ||
bucketName, | ||
ImmutableMap.of(hadoopCoreSiteXmlTempFile.normalize().toAbsolutePath().toString(), "/etc/hadoop/conf/core-site.xml"), | ||
"ghcr.io/trinodb/testing/hdp3.1-hive"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls avoid using magic strings for the docker image - Let's introduce a constant for this purpose.
container = Failsafe.with(retryPolicy).get(() -> startHadoopContainer(image, ports, resourcesToMount, filesToMount, extraHosts, network)); | ||
closer.register(container::stop); | ||
|
||
// HACK: even though Hadoop is accessed by proxy, Hadoop still tries to resolve hadoop-master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@losipiuk I just want to make sure that this logic is not needed with the transition towards HiveHadoop
where this setting is not made anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice % few nit comments
} | ||
} | ||
catch (IOException | InterruptedException e) { | ||
throw new RuntimeException("Exception while running command: " + printableCommand(commandAndArgs), e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Just use String.join
String.join(" ", commandAndArgs)
public void runCommandInContainer(String... commandAndArgs) | ||
{ | ||
try { | ||
Container.ExecResult execResult = container.execInContainer(commandAndArgs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: call public method from this class to not have redundant code.
Container.ExecResult execResult = executeInContainer(commandAndArgs)
@@ -76,7 +83,8 @@ protected void setupContainer() | |||
for (int port : this.ports) { | |||
container.addExposedPort(port); | |||
} | |||
filesToMount.forEach(this::copyFileToContainer); | |||
resourcesToMount.forEach(this::copyFileToContainer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: If we made that change so let's maybe rename copyFileToContainer to copyResourceToContainer
HiveHadoop.Builder hiveHadoopBuilder = HiveHadoop.builder() | ||
.withImage(hiveHadoopImage) | ||
.withNetwork(network); | ||
if (hiveHadoopFilesToMount.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure here but shouldn't we always copy resources "hive_minio_datalake/hive-core-site.xml"
and if there are any files to mount then also files like
HiveHadoop.Builder hiveHadoopBuilder = HiveHadoop.builder()
.withImage(hiveHadoopImage)
.withNetwork(network)
.withResourcesToMount(ImmutableMap.of("hive_minio_datalake/hive-core-site.xml", "/etc/hadoop/conf/core-site.xml"));
if (!hiveHadoopFilesToMount.isEmpty()) {
hiveHadoopBuilder.withFilesToMount(hiveHadoopFilesToMount);
}
We could potentially just blindly allways add withFilesToMount. It it's empty forEach will not iterate and do not mount them
@@ -89,6 +93,7 @@ public void start() | |||
.build(); | |||
s3Client.createBucket(this.bucketName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for that. One or another. My initial idea was to use S3Client as this would allow to have some base class in future which would handle real S3 tests and Minio based test.
For now it seems we should stick to one selected client.
@@ -355,6 +355,29 @@ | |||
<scope>test</scope> | |||
</dependency> | |||
|
|||
<dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for that. But I assume it's a separate PR to bump this version.
@@ -45,7 +45,7 @@ | |||
import static java.util.Objects.requireNonNull; | |||
import static java.util.regex.Matcher.quoteReplacement; | |||
|
|||
class MinioClient | |||
public class MinioClient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: When we use MinioClient now directly in container. I'm wondering if we could maybe move client creation to MinIo container and fetch it from them anywhere it's needed ?
@@ -24,4 +24,15 @@ | |||
<name>fs.s3.impl</name> | |||
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> | |||
</property> | |||
|
|||
<!-- Hive impersonation --> | |||
<property> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need them now. I can't see changes that might require it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I found it in next commit :D
ImmutableMap.of("io/trino/plugin/deltalake/core-site.xml", "/etc/hadoop/conf/core-site.xml"), | ||
ImmutableMap.of()); | ||
return dockerizedMinioDataLake; | ||
hiveMinioDataLake = new HiveMinioDataLake(bucketName, ImmutableMap.of(), "ghcr.io/trinodb/testing/hdp2.6-hive"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: I remember there is a class where we store docker image version.
io.trino.plugin.deltalake.util.DockerImages
Maybe let's extract it to test containers and keep those values there
} | ||
|
||
@AfterClass(alwaysRun = true) | ||
public void removeTestData() | ||
{ | ||
if (adlsDirectory != null && dockerizedDataLake.getTestingHadoop() != null) { | ||
dockerizedDataLake.getTestingHadoop().runCommandInContainer("hadoop", "fs", "-rm", "-f", "-r", adlsDirectory); | ||
if (adlsDirectory != null && hiveMinioDataLake.getHiveHadoop() != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think hiveMinioDataLake.getHiveHadoop() != null
is not needed. It will never be null when using HibeMinioDataLake
3f4549f
to
2309e5f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
private final Map<String, Long> hiveMetastoreInvocationCounts = new ConcurrentHashMap<>(); | ||
|
||
@AfterClass(alwaysRun = true) | ||
public final void close() | ||
throws Exception | ||
{ | ||
if (dockerizedMinioDataLake != null) { | ||
dockerizedMinioDataLake.close(); | ||
if (hiveMinioDataLake != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls create a preparatory commit and remove this command because the dockerized data lake is already torn down in tearDown
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gentle reminder for handling this comment.
deltaLakeProperties.put("hive.s3.aws-secret-key", MINIO_SECRET_KEY); | ||
deltaLakeProperties.put("hive.s3.endpoint", dockerizedMinioDataLake.getMinioAddress()); | ||
deltaLakeProperties.put("hive.metastore.uri", "thrift://" + hiveMinioDataLake.getHiveHadoop().getHiveMetastoreEndpoint()); | ||
deltaLakeProperties.put("hive.s3.aws-access-key", ACCESS_KEY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls create an extra commit (probably after the present one) to rename ACCESS_KEY
to MINIO_ACCESS_KEY
and SECRET_KEY
to MINIO_SECRET_KEY
.
assertQuery("SHOW TABLES FROM hive." + schema, "VALUES 'hive_table', 'hive_view', 'delta_table'"); | ||
|
||
assertThatThrownBy(() -> query("SHOW CREATE TABLE delta." + schema + ".hive_table")) | ||
assertThatThrownBy(() -> query("SHOW CREATE TABLE delta_lake." + schema + ".hive_table")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional: Using io.trino.plugin.deltalake.DeltaLakeQueryRunner#DELTA_CATALOG
would be also on option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could even omit delta_lake
but thought this way is a bit more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to use just delta
as product tests (other connector use same catalog name between normal tests and product tests). No need to change in this PR though.
@@ -1802,7 +1802,7 @@ public static TrinoS3FileSystemStats getFileSystemStats() | |||
return STATS; | |||
} | |||
|
|||
private static String getMd5AsBase64(byte[] data, int offset, int length) | |||
public static String getMd5AsBase64(byte[] data, int offset, int length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The visibility of this method doesn't need to be changed in case you'd be using minioClient
in HiveMinioDataLake
.
@@ -44,6 +59,11 @@ | |||
private State state = State.INITIAL; | |||
private AmazonS3 s3Client; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that, at the moment, given that this class is called HiveMinioDataLake
and deals (as far as I see) only with MinIO, we can safely drop the AmazonS3
s3Client
and use the minioClient
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for that. I would stick to one or another. Let's not maintain both.
public AmazonS3 getS3Client() | ||
{ | ||
checkState(state == State.STARTED, "Can't provide client when MinIO state is: %s", state); | ||
return s3Client; | ||
} | ||
|
||
public void stop() | ||
throws Exception | ||
public void copyResources(String resourcePath, String target) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use minioClient
to perform the utility operations that you've reimplemented based on AmazonS3
client. The needed logic exists already in MinioClient
class.
{ | ||
return forClasspathResource(resourcePath) | ||
// Container fails to mount jar:file:/<host_path>!<resource_path> resources | ||
// This assures that JAR resources are being copied out to tmp locations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assures
-> ensures
012af24
to
04b6af0
Compare
04b6af0
to
1c9ba74
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % comments
@ebyhr CPTAL?
HiveHadoop.Builder hiveHadoopBuilder = HiveHadoop.builder() | ||
.withImage(hiveHadoopImage) | ||
.withNetwork(network); | ||
Map<String, String> defaultHiveHadoopResourcesToMount = new HashMap<>(Map.of( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Map<String, String> defaultHiveHadoopResourcesToMount = new HashMap<>(Map.of( | |
Map<String, String> defaultHiveHadoopResourcesToMount = ImmutableMap.<String, String>builder() | |
.put("/etc/hadoop/conf/core-site.xml", getPathFromClassPathResource("hive_minio_datalake/hive-core-site.xml")) | |
.putAll(hiveHadoopFilesToMount) | |
.buildOrThrow(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had also tried that, however Guava's ImmutableMap
doesn't allow duplicate keys to be inserted. Here we sometimes want to override the core-site.xml with a new one. Hence the standard HashMap
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose you could move the default mount mapping to another constructor. e.g.
public HiveMinioDataLake(String bucketName)
{
this(bucketName, ImmutableMap.of("/etc/hadoop/conf/core-site.xml", getPathFromClassPathResource("hive_minio_datalake/hive-core-site.xml")), HiveHadoop.DEFAULT_IMAGE);
}
...
|
||
private final Map<String, Long> hiveMetastoreInvocationCounts = new ConcurrentHashMap<>(); | ||
|
||
@AfterClass(alwaysRun = true) | ||
public final void close() | ||
throws Exception | ||
{ | ||
if (dockerizedMinioDataLake != null) { | ||
dockerizedMinioDataLake.close(); | ||
if (hiveMinioDataLake != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gentle reminder for handling this comment.
1c9ba74
to
040d015
Compare
Use a shorter commit message header and explain the detail in the commit message content. e.g. Renamed MinIO credentials constants |
040d015
to
5b2ba76
Compare
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/DeltaLakeQueryRunner.java
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/DeltaLakeQueryRunner.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/DeltaLakeQueryRunner.java
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/DeltaLakeQueryRunner.java
Outdated
Show resolved
Hide resolved
assertQuery("SHOW TABLES FROM hive." + schema, "VALUES 'hive_table', 'hive_view', 'delta_table'"); | ||
|
||
assertThatThrownBy(() -> query("SHOW CREATE TABLE delta." + schema + ".hive_table")) | ||
assertThatThrownBy(() -> query("SHOW CREATE TABLE delta_lake." + schema + ".hive_table")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to use just delta
as product tests (other connector use same catalog name between normal tests and product tests). No need to change in this PR though.
plugin/trino-hive/src/test/java/io/trino/plugin/hive/containers/HiveMinioDataLake.java
Outdated
Show resolved
Hide resolved
HiveHadoop.Builder hiveHadoopBuilder = HiveHadoop.builder() | ||
.withImage(hiveHadoopImage) | ||
.withNetwork(network); | ||
Map<String, String> defaultHiveHadoopResourcesToMount = new HashMap<>(Map.of( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose you could move the default mount mapping to another constructor. e.g.
public HiveMinioDataLake(String bucketName)
{
this(bucketName, ImmutableMap.of("/etc/hadoop/conf/core-site.xml", getPathFromClassPathResource("hive_minio_datalake/hive-core-site.xml")), HiveHadoop.DEFAULT_IMAGE);
}
...
5b2ba76
to
c34049d
Compare
I have extracted the Builder in a separate commit. I also added |
c34049d
to
86728fc
Compare
.put("hive.s3.streaming.part-size", "5MB") //must be at least 5MB according to annotations on io.trino.plugin.hive.s3.HiveS3Config.getS3StreamingPartSize | ||
.putAll(connectorProperties) | ||
.buildOrThrow()) | ||
.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Variable assignment is not needed.
|
||
queryRunner.installPlugin(new TestingDeltaLakePlugin()); | ||
Map<String, String> deltaProperties = new HashMap<>(this.deltaProperties.buildOrThrow()); | ||
if (!deltaProperties.containsKey("hive.metastore.uri")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having business logic in setting up default values for properties is not a good idea.
In case we'll want to handle Glue here, we'll further complicate the logic.
I'd advocate that the client classes of DeltaQueryRunner
should specify these properties themselves.
See for reference:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree with you. Seems like the IcebergQueryRunner
also does this though. I would still consider putting it in the QueryRunner class itself as duplicating this logic everywhere seems a bit redundant. WDYT of adding a builder method withFileMetastore
? Seems more explicit and hides the complexity for the clients of the QueryRunner classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say, let's keep this out of this PR for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with withFileMetastore
because it abstracts the logic of setting the file hive metastore.
let's keep this out of this PR for now.
I would argue that this change is related to the current PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up PR is acceptable to me. Going to merger as-is.
...lake/src/test/java/io/trino/plugin/deltalake/AbstractTestDeltaLakeCreateTableStatistics.java
Outdated
Show resolved
Hide resolved
I find it really great that we have a unified way of dealing with minio data lakes on tests. There's only a bit more left to land this awesome test infrastructure change. Great job!! |
Renamed `ACCESS_KEY` and `SECRET_KEY` to `MINIO_ACCESS_KEY` and `MINIO_SECRET_KEY` for readability
86728fc
to
a9c9dd6
Compare
Merged, thanks! |
Description
The
MinioClient
provides additional features on top of the standard S3 api that may be interesting to keep in our toolbox. So it has been integrated theHiveMinioDataLake
container setup:BaseTestContainer
has been extended for executing commands in a container and retrieving the result and loading resources.Change only impacting tests for Hive, Iceberg and Delta Lake and common test libraries.
N/A
Related issues, pull requests, and links
Documentation
(x) No documentation is needed.
Release notes
(x) No release notes entries required.