[Feature]: Do we have any plans to supporting spark cluster in test container? #7658
Rembrant777
started this conversation in
Ideas
Replies: 1 comment 2 replies
-
Hi, thanks for raising this request. Testcontainers modules are a best effort implementation. Currently, the team doesn't have the knowledge to create or maintain a new Spark module. However, we would encourage you to create the module and make it available at the Testcontainers Module Catalog, opening a PR here. Feel free to ask question about Testcontainers or release process to maven central, we will be really happy to support you. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Module
None
Problem
Hello everyone,
I'm a newcomer to using testcontainers. Due to my job requirements, I frequently need to develop Spark applications. For me, the most time-consuming part of working on Spark application, from design and development to debugging and deployment, is the need to recompile the code and generate a JAR file every time I make changes to the code logic and then submit it to the cluster to wait for the results.
While Spark provides relatively robust JUnit test cases, setting the master to
local
doesn't truly replicate the issues that may arise in a distributed environment encounters data skew when trying to consume data from Kafka Cluster with more than 3 partitions, or if I want to develop Spark Shuffle components further, the existing JUnit test cases don't cover the potential problems in a distributed environment.So, when I first add testcontainer to my JUnit environment and execute it. I was very excited about the convenience it provides and really let me focus on the inner logics of the codes.
That's why I wonder if testcontainer group plans to support Spark in the future like based on Yarn, Mesos, or even Kubernetes?
While I am aware that AWS and Azure's platforms provide robust solutions for Spark with Databricks and related serverless API services. I still believe that there is a pressing need for heavy-duty computing frameworks like Spark, and Flink for beginners and their app developers. I also think that I'm certainly not the only one who has experienced reduced development efficiency due to environmental issues.
If the solution is feasible, I will actively participate in building this feature. And looking forward to your reply and plans in this direction.
Solution
After referencing the implementation code of
KafkaContainerCluster
, I suppose that its container construction method is quite similar to the deployment approach of a Spark Cluster. InKafkaContainerCluster
, a cluster is built with one Zookeeper and three KafkaContainers.Following the same solution approach, for the Spark Standalone deployment mode, we can refer to this folder's docker
docker-compose.yml
andDockerfile
(s) deployment method that to deploy different components into separate containers to achieve cluster deployment (Perhaps this solution is not very mature and requires more in-depth discussion about the details.).Benefit
For Spark Beginners:
For Spark application developers:
Require minimal resources and datasets to validate the data processing logic of their Spark applications, enabling debugging and optimization in a local environment.
Reduce the compilation time of the Spark application JAR file after each code modification and minimize the time spent on submitting the Spark application to a remote or local cluster for execution.
Enable quick regression testing of Spark application functionality improvements through JUnit test cases, reducing the need for additional maintenance costs in DevOps for operations personnel (as the setup is already established at the JUnit test level).
For Spark Infra developers:
Alternatives
Nope, since it is a new module. As far as I know, it will not affect other modules.
Would you like to help contributing this feature?
Yes
Beta Was this translation helpful? Give feedback.
All reactions