[Feature]: Do we have any plans to supporting spark cluster in test container? #7658

Rembrant777 · 2023-10-16T15:00:13Z

Rembrant777
Oct 16, 2023

Module

None

Problem

Hello everyone,

I'm a newcomer to using testcontainers. Due to my job requirements, I frequently need to develop Spark applications. For me, the most time-consuming part of working on Spark application, from design and development to debugging and deployment, is the need to recompile the code and generate a JAR file every time I make changes to the code logic and then submit it to the cluster to wait for the results.

While Spark provides relatively robust JUnit test cases, setting the master to local doesn't truly replicate the issues that may arise in a distributed environment encounters data skew when trying to consume data from Kafka Cluster with more than 3 partitions, or if I want to develop Spark Shuffle components further, the existing JUnit test cases don't cover the potential problems in a distributed environment.

So, when I first add testcontainer to my JUnit environment and execute it. I was very excited about the convenience it provides and really let me focus on the inner logics of the codes.

That's why I wonder if testcontainer group plans to support Spark in the future like based on Yarn, Mesos, or even Kubernetes?

While I am aware that AWS and Azure's platforms provide robust solutions for Spark with Databricks and related serverless API services. I still believe that there is a pressing need for heavy-duty computing frameworks like Spark, and Flink for beginners and their app developers. I also think that I'm certainly not the only one who has experienced reduced development efficiency due to environmental issues.

If the solution is feasible, I will actively participate in building this feature. And looking forward to your reply and plans in this direction.

Solution

After referencing the implementation code of KafkaContainerCluster, I suppose that its container construction method is quite similar to the deployment approach of a Spark Cluster. In KafkaContainerCluster, a cluster is built with one Zookeeper and three KafkaContainers.

Following the same solution approach, for the Spark Standalone deployment mode, we can refer to this folder's docker docker-compose.yml and Dockerfile(s) deployment method that to deploy different components into separate containers to achieve cluster deployment (Perhaps this solution is not very mature and requires more in-depth discussion about the details.).

Benefit

For Spark Beginners:

Simplify the process of running Spark code for beginners.

For Spark application developers:

Require minimal resources and datasets to validate the data processing logic of their Spark applications, enabling debugging and optimization in a local environment.
Reduce the compilation time of the Spark application JAR file after each code modification and minimize the time spent on submitting the Spark application to a remote or local cluster for execution.
Enable quick regression testing of Spark application functionality improvements through JUnit test cases, reducing the need for additional maintenance costs in DevOps for operations personnel (as the setup is already established at the JUnit test level).

For Spark Infra developers:

Some of the underlying logic, which originally required the development of a MockServer, can be completely achieved through the use of testcontainers. This saves a significant amount of development time for creating mock cases.

Alternatives

Nope, since it is a new module. As far as I know, it will not affect other modules.

Would you like to help contributing this feature?

Yes

eddumelendez · 2023-10-16T20:08:07Z

eddumelendez
Oct 16, 2023
Maintainer

Hi, thanks for raising this request. Testcontainers modules are a best effort implementation. Currently, the team doesn't have the knowledge to create or maintain a new Spark module. However, we would encourage you to create the module and make it available at the Testcontainers Module Catalog, opening a PR here. Feel free to ask question about Testcontainers or release process to maven central, we will be really happy to support you.

2 replies

Rembrant777 Oct 17, 2023
Author

Hi eddumelendez,
thank you for your response. I will conduct thorough research in the early stages and add more details to the solution before discussing it with you in the discussion.

tomek-kontakt Jun 4, 2024

Any progress on above @Rembrant777 ? :) Happy to help wherever I am useful. More and more projects rely directly or indirectly on Spark workers so having it as a part of a test toolkit would be great.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Do we have any plans to supporting spark cluster in test container? #7658

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

[Feature]: Do we have any plans to supporting spark cluster in test container? #7658

Rembrant777 Oct 16, 2023

Module

Problem

Solution

Benefit

For Spark Beginners:

For Spark application developers:

For Spark Infra developers:

Alternatives

Would you like to help contributing this feature?

Replies: 1 comment · 2 replies

eddumelendez Oct 16, 2023 Maintainer

Rembrant777 Oct 17, 2023 Author

tomek-kontakt Jun 4, 2024

Rembrant777
Oct 16, 2023

Replies: 1 comment 2 replies

eddumelendez
Oct 16, 2023
Maintainer

Rembrant777 Oct 17, 2023
Author