Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ZEPPELIN-6091] Drop support for Spark 3.2 #4834

Merged
merged 6 commits into from
Sep 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 0 additions & 8 deletions .github/workflows/core.yml
Original file line number Diff line number Diff line change
Expand Up @@ -397,14 +397,6 @@ jobs:
- name: Make IRkernel available to Jupyter
run: |
R -e "IRkernel::installspec()"
- name: run spark-3.2 tests with scala-2.12 and python-${{ matrix.python }}
run: |
rm -rf spark/interpreter/metastore_db
./mvnw verify -pl spark-submit,spark/interpreter -am -Dtest=org/apache/zeppelin/spark/* -Pspark-3.2 -Pspark-scala-2.12 -Phadoop3 -Pintegration -DfailIfNoTests=false ${MAVEN_ARGS}
- name: run spark-3.2 tests with scala-2.13 and python-${{ matrix.python }}
run: |
rm -rf spark/interpreter/metastore_db
./mvnw verify -pl spark-submit,spark/interpreter -am -Dtest=org/apache/zeppelin/spark/* -Pspark-3.2 -Pspark-scala-2.13 -Phadoop3 -Pintegration -DfailIfNoTests=false ${MAVEN_ARGS}
- name: run spark-3.3 tests with scala-2.12 and python-${{ matrix.python }}
run: |
rm -rf spark/interpreter/metastore_db
Expand Down
4 changes: 2 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ ENV MAVEN_OPTS="-Xms1024M -Xmx2048M -XX:MaxMetaspaceSize=1024m -XX:-UseGCOverhea
# Allow npm and bower to run with root privileges
RUN echo "unsafe-perm=true" > ~/.npmrc && \
echo '{ "allow_root": true }' > ~/.bowerrc && \
./mvnw -B package -DskipTests -Pbuild-distr -Pspark-3.3 -Pinclude-hadoop -Phadoop3 -Pspark-scala-2.12 -Pweb-classic -Pweb-dist && \
./mvnw -B package -DskipTests -Pbuild-distr -Pspark-3.4 -Pinclude-hadoop -Phadoop3 -Pspark-scala-2.12 -Pweb-classic -Pweb-dist && \
Copy link
Member Author

@pan3793 pan3793 Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, we use Spark 3.4 as the default version, it should be consistent everywhere.

# Example with doesn't compile all interpreters
# ./mvnw -B package -DskipTests -Pbuild-distr -Pspark-3.2 -Pinclude-hadoop -Phadoop3 -Pspark-scala-2.12 -Pweb-classic -Pweb-dist -pl '!groovy,!livy,!hbase,!file,!flink' && \
# ./mvnw -B package -DskipTests -Pbuild-distr -Pspark-3.4 -Pinclude-hadoop -Phadoop3 -Pspark-scala-2.12 -Pweb-classic -Pweb-dist -pl '!groovy,!livy,!hbase,!file,!flink' && \
mv /workspace/zeppelin/zeppelin-distribution/target/zeppelin-*-bin/zeppelin-*-bin /opt/zeppelin/ && \
# Removing stuff saves time, because docker creates a temporary layer
rm -rf ~/.m2 && \
Expand Down
2 changes: 1 addition & 1 deletion conf/zeppelin-env.cmd.template
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ REM however, it is not encouraged when you can define SPARK_HOME
REM
REM Options read in YARN client mode
REM set HADOOP_CONF_DIR REM yarn-site.xml is located in configuration directory in HADOOP_CONF_DIR.
REM Pyspark (supported with Spark 1.2.1 and above)
REM Pyspark (supported with Spark 3.3 and above)
REM To configure pyspark, you need to set spark distribution's path to 'spark.home' property in Interpreter setting screen in Zeppelin GUI
REM set PYSPARK_PYTHON REM path to the python command. must be the same path on the driver(Zeppelin) and all workers.
REM set PYTHONPATH
Expand Down
2 changes: 1 addition & 1 deletion conf/zeppelin-env.sh.template
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@
##
# Options read in YARN client mode
# export HADOOP_CONF_DIR # yarn-site.xml is located in configuration directory in HADOOP_CONF_DIR.
# Pyspark (supported with Spark 1.2.1 and above)
# Pyspark (supported with Spark 3.3 and above)
# To configure pyspark, you need to set spark distribution's path to 'spark.home' property in Interpreter setting screen in Zeppelin GUI
# export PYSPARK_PYTHON # path to the python command. must be the same path on the driver(Zeppelin) and all workers.
# export PYTHONPATH
Expand Down
2 changes: 1 addition & 1 deletion docs/interpreter/spark.md
Original file line number Diff line number Diff line change
Expand Up @@ -385,7 +385,7 @@ You can also choose `scoped` mode. For `scoped` per note mode, Zeppelin creates

## SparkContext, SQLContext, SparkSession, ZeppelinContext

SparkContext, SQLContext, SparkSession (for spark 2.x, 3.x) and ZeppelinContext are automatically created and exposed as variable names `sc`, `sqlContext`, `spark` and `z` respectively, in Scala, Python and R environments.
SparkContext, SparkSession and ZeppelinContext are automatically created and exposed as variable names `sc`, `spark` and `z` respectively, in Scala, Python and R environments.


> Note that Scala/Python/R environment shares the same SparkContext, SQLContext, SparkSession and ZeppelinContext instance.
Expand Down
3 changes: 1 addition & 2 deletions docs/setup/basics/how_to_build.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ You can directly start Zeppelin by running the following command after successfu

To be noticed, the spark profiles here only affect the unit test (no need to specify `SPARK_HOME`) of spark interpreter.
Zeppelin doesn't require you to build with different spark to make different versions of spark work in Zeppelin.
You can run different versions of Spark in Zeppelin as long as you specify `SPARK_HOME`. Actually Zeppelin supports all the versions of Spark from 3.2 to 3.5.
You can run different versions of Spark in Zeppelin as long as you specify `SPARK_HOME`. Actually Zeppelin supports all the versions of Spark from 3.3 to 3.5.

To build with a specific Spark version or scala versions, define one or more of the following profiles and options:

Expand All @@ -97,7 +97,6 @@ Available profiles are
-Pspark-3.5
-Pspark-3.4
-Pspark-3.3
-Pspark-3.2
```

minor version can be adjusted by `-Dspark.version=x.x.x`
Expand Down
63 changes: 25 additions & 38 deletions docs/setup/deployment/flink_and_spark_cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,8 @@ Assuming the minimal install, there are several programs that we will need to in

- git
- openssh-server
- OpenJDK 7
- Maven 3.1+
- OpenJDK 11
- Maven

For git, openssh-server, and OpenJDK 7 we will be using the apt package manager.

Expand All @@ -60,17 +60,10 @@ sudo apt-get install git
sudo apt-get install openssh-server
```

##### OpenJDK 7
##### OpenJDK 11

```bash
sudo apt-get install openjdk-7-jdk openjdk-7-jre-lib
```
*A note for those using Ubuntu 16.04*: To install `openjdk-7` on Ubuntu 16.04, one must add a repository. [Source](http://askubuntu.com/questions/761127/ubuntu-16-04-and-openjdk-7)

```bash
sudo add-apt-repository ppa:openjdk-r/ppa
sudo apt-get update
sudo apt-get install openjdk-7-jdk openjdk-7-jre-lib
sudo apt-get install openjdk-11-jdk
```

### Installing Zeppelin
Expand All @@ -92,26 +85,23 @@ cd zeppelin
Package Zeppelin.

```bash
./mvnw clean package -DskipTests -Pspark-3.2 -Dflink.version=1.1.3 -Pscala-2.11
./mvnw clean package -DskipTests -Pspark-3.5 -Pflink-1.17
```

`-DskipTests` skips build tests- you're not developing (yet), so you don't need to do tests, the clone version *should* build.

`-Pspark-3.2` tells maven to build a Zeppelin with Spark 3.2. This is important because Zeppelin has its own Spark interpreter and the versions must be the same.
`-Pspark-3.5` tells maven to build a Zeppelin with Spark 3.5. This is important because Zeppelin has its own Spark interpreter and the versions must be the same.

`-Dflink.version=1.1.3` tells maven specifically to build Zeppelin with Flink version 1.1.3.
`-Pflink-1.17` tells maven to build a Zeppelin with Flink 1.17.

-`-Pscala-2.11` tells maven to build with Scala v2.11.


**Note:** You can build against any version of Spark that has a Zeppelin build profile available. The key is to make sure you check out the matching version of Spark to build. At the time of this writing, Spark 3.2 was the most recent Spark version available.
**Note:** You can build against any version of Spark that has a Zeppelin build profile available. The key is to make sure you check out the matching version of Spark to build. At the time of this writing, Spark 3.5 was the most recent Spark version available.

**Note:** On build failures. Having installed Zeppelin close to 30 times now, I will tell you that sometimes the build fails for seemingly no reason.
As long as you didn't edit any code, it is unlikely the build is failing because of something you did. What does tend to happen, is some dependency that maven is trying to download is unreachable. If your build fails on this step here are some tips:

- Don't get discouraged.
- Scroll up and read through the logs. There will be clues there.
- Retry (that is, run the `./mvnw clean package -DskipTests -Pspark-3.2` again)
- Retry (that is, run the `./mvnw clean package -DskipTests -Pspark-3.5` again)
- If there were clues that a dependency couldn't be downloaded wait a few hours or even days and retry again. Open source software when compiling is trying to download all of the dependencies it needs, if a server is off-line there is nothing you can do but wait for it to come back.
- Make sure you followed all of the steps carefully.
- Ask the community to help you. Go [here](http://zeppelin.apache.org/community.html) and join the user mailing list. People are there to help you. Make sure to copy and paste the build output (everything that happened in the console) and include that in your message.
Expand Down Expand Up @@ -225,16 +215,16 @@ Building from source is recommended where possible, for simplicity in this tuto
To download the Flink Binary use `wget`

```bash
wget "http://mirror.cogentco.com/pub/apache/flink/flink-1.16.2/flink-1.16.2-bin-scala_2.12.tgz"
tar -xzvf flink-1.16.2-bin-scala_2.12.tgz
wget "https://archive.apache.org/dist/flink/flink-1.17.1/flink-1.17.1-bin-scala_2.12.tgz"
tar -xzvf flink-1.17.1-bin-scala_2.12.tgz
```

This will download Flink 1.16.2.
This will download Flink 1.17.1.

Start the Flink Cluster.

```bash
flink-1.16.2/bin/start-cluster.sh
flink-1.17.1/bin/start-cluster.sh
```

###### Building From source
Expand All @@ -243,13 +233,13 @@ If you wish to build Flink from source, the following will be instructive. Note

See the [Flink Installation guide](https://github.com/apache/flink/blob/master/README.md) for more detailed instructions.

Return to the directory where you have been downloading, this tutorial assumes that is `$HOME`. Clone Flink, check out release-1.1.3-rc2, and build.
Return to the directory where you have been downloading, this tutorial assumes that is `$HOME`. Clone Flink, check out release-1.17.1, and build.

```bash
cd $HOME
git clone https://github.com/apache/flink.git
cd flink
git checkout release-1.1.3-rc2
git checkout release-1.17.1
mvn clean install -DskipTests
```

Expand All @@ -271,8 +261,8 @@ If no task managers are present, restart the Flink cluster with the following co
(if binaries)

```bash
flink-1.1.3/bin/stop-cluster.sh
flink-1.1.3/bin/start-cluster.sh
flink-1.17.1/bin/stop-cluster.sh
flink-1.17.1/bin/start-cluster.sh
```


Expand All @@ -284,7 +274,7 @@ build-target/bin/start-cluster.sh
```


##### Spark 1.6 Cluster
##### Spark Cluster

###### Download Binaries

Expand All @@ -295,34 +285,31 @@ Using binaries is also
To download the Spark Binary use `wget`

```bash
wget "https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz"
tar -xzvf spark-3.4.1-bin-hadoop3.tgz
mv spark-3.4.1-bin-hadoop3 spark
wget "https://archive.apache.org/dist/spark/spark-3.5.2/spark-3.5.2-bin-hadoop3.tgz"
tar -xzvf spark-3.5.2-bin-hadoop3.tgz
mv spark-3.5.2-bin-hadoop3 spark
```

This will download Spark 3.4.1, compatible with Hadoop 3. You do not have to install Hadoop for this binary to work, but if you are using Hadoop, please change `3` to your appropriate version.
This will download Spark 3.5.2, compatible with Hadoop 3. You do not have to install Hadoop for this binary to work, but if you are using Hadoop, please change `3` to your appropriate version.

###### Building From source

Spark is an extraordinarily large project, which takes considerable time to download and build. It is also prone to build failures for similar reasons listed in the Flink section. If the user wishes to attempt to build from source, this section will provide some reference. If errors are encountered, please contact the Apache Spark community.

See the [Spark Installation](https://github.com/apache/spark/blob/master/README.md) guide for more detailed instructions.

Return to the directory where you have been downloading, this tutorial assumes that is $HOME. Clone Spark, check out branch-1.6, and build.
**Note:** Recall, we're only checking out 1.6 because it is the most recent Spark for which a Zeppelin profile exists at
the time of writing. You are free to check out other version, just make sure you build Zeppelin against the correct version of Spark. However if you use Spark 2.0, the word count example will need to be changed as Spark 2.0 is not compatible with the following examples.

Return to the directory where you have been downloading, this tutorial assumes that is $HOME. Clone Spark, check out branch-3.5, and build.

```bash
cd $HOME
```

Clone, check out, and build Spark version 1.6.x.
Clone, check out, and build Spark version 3.5.x.

```bash
git clone https://github.com/apache/spark.git
cd spark
git checkout branch-1.6
git checkout branch-3.5
mvn clean package -DskipTests
```

Expand Down
35 changes: 0 additions & 35 deletions spark/interpreter/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,6 @@
<maven.aeither.provider.version>3.0.3</maven.aeither.provider.version>
<wagon.version>2.7</wagon.version>

<datanucleus.rdbms.version>4.1.19</datanucleus.rdbms.version>
<datanucleus.apijdo.version>4.2.4</datanucleus.apijdo.version>
<datanucleus.core.version>4.1.17</datanucleus.core.version>

<!-- spark versions -->
<spark.version>3.4.1</spark.version>
<protobuf.version>3.21.12</protobuf.version>
Expand Down Expand Up @@ -222,27 +218,6 @@
</dependency>

<!--test libraries-->
<dependency>
<groupId>org.datanucleus</groupId>
<artifactId>datanucleus-core</artifactId>
<version>${datanucleus.core.version}</version>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to manage datanucleus explicitly, as it will be pulled by spark-hive automatically, all supported spark versions(3.3~3.5) depend on hive 2.3.9, thus depend on the same versions of datanucleus

<scope>test</scope>
</dependency>

<dependency>
<groupId>org.datanucleus</groupId>
<artifactId>datanucleus-api-jdo</artifactId>
<version>${datanucleus.apijdo.version}</version>
<scope>test</scope>
</dependency>

<dependency>
<groupId>org.datanucleus</groupId>
<artifactId>datanucleus-rdbms</artifactId>
<version>${datanucleus.rdbms.version}</version>
<scope>test</scope>
</dependency>

<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-core</artifactId>
Expand Down Expand Up @@ -589,16 +564,6 @@
<py4j.version>0.10.9.5</py4j.version>
</properties>
</profile>

<profile>
<id>spark-3.2</id>
<properties>
<spark.version>3.2.4</spark.version>
<protobuf.version>2.5.0</protobuf.version>
<py4j.version>0.10.9.5</py4j.version>
</properties>
</profile>

</profiles>

</project>
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@
"envName": null,
"propertyName": "zeppelin.spark.deprecatedMsg.show",
"defaultValue": true,
"description": "Whether show the spark deprecated message, spark 2.2 and before are deprecated. Zeppelin will display warning message by default",
"description": "Whether show the spark deprecated message, prior Spark 3.3 are deprecated. Zeppelin will display warning message by default",
"type": "checkbox"
}
},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ public void setUp() {
when(mockContext.getIntpEventClient()).thenReturn(mockIntpEventClient);

try {
sparkShims = SparkShims.getInstance(SparkVersion.SPARK_3_2_0.toString(), new Properties(), null);
sparkShims = SparkShims.getInstance(SparkVersion.SPARK_3_3_0.toString(), new Properties(), null);
} catch (Throwable e1) {
throw new RuntimeException("All SparkShims are tried, but no one can be created.");
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -292,14 +292,7 @@ void testDDL() throws InterpreterException, IOException {
InterpreterContext context = getInterpreterContext();
InterpreterResult ret = sqlInterpreter.interpret("create table t1(id int, name string)", context);
assertEquals(InterpreterResult.Code.SUCCESS, ret.code(), context.out.toString());
// spark 1.x will still return DataFrame with non-empty columns.
// org.apache.spark.sql.DataFrame = [result: string]
if (!sparkInterpreter.getSparkContext().version().startsWith("1.")) {
assertTrue(ret.message().isEmpty());
} else {
assertEquals(Type.TABLE, ret.message().get(0).getType());
assertEquals("result\n", ret.message().get(0).getData());
}
assertTrue(ret.message().isEmpty());

// create the same table again
ret = sqlInterpreter.interpret("create table t1(id int, name string)", context);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,14 +48,14 @@ void testSparkVersion() {
assertEquals(SparkVersion.SPARK_3_5_0, SparkVersion.fromVersionString("3.5.0.2.5.0.0-1245"));

// test newer than
assertTrue(SparkVersion.SPARK_3_5_0.newerThan(SparkVersion.SPARK_3_2_0));
assertTrue(SparkVersion.SPARK_3_5_0.newerThan(SparkVersion.SPARK_3_3_0));
assertTrue(SparkVersion.SPARK_3_5_0.newerThanEquals(SparkVersion.SPARK_3_5_0));
assertFalse(SparkVersion.SPARK_3_2_0.newerThan(SparkVersion.SPARK_3_5_0));
assertFalse(SparkVersion.SPARK_3_3_0.newerThan(SparkVersion.SPARK_3_5_0));

// test older than
assertTrue(SparkVersion.SPARK_3_2_0.olderThan(SparkVersion.SPARK_3_5_0));
assertTrue(SparkVersion.SPARK_3_2_0.olderThanEquals(SparkVersion.SPARK_3_2_0));
assertFalse(SparkVersion.SPARK_3_5_0.olderThan(SparkVersion.SPARK_3_2_0));
assertTrue(SparkVersion.SPARK_3_3_0.olderThan(SparkVersion.SPARK_3_5_0));
assertTrue(SparkVersion.SPARK_3_5_0.olderThanEquals(SparkVersion.SPARK_3_5_0));
assertFalse(SparkVersion.SPARK_3_5_0.olderThan(SparkVersion.SPARK_3_3_0));

// test newerThanEqualsPatchVersion
assertTrue(SparkVersion.fromVersionString("2.3.1")
Expand Down
5 changes: 0 additions & 5 deletions spark/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -33,11 +33,6 @@
<description>Zeppelin Spark Support</description>

<properties>
<datanucleus.rdbms.version>3.2.9</datanucleus.rdbms.version>
<datanucleus.apijdo.version>3.2.6</datanucleus.apijdo.version>
<datanucleus.core.version>3.2.10</datanucleus.core.version>

<!-- spark versions -->
<spark.version>3.4.1</spark.version>
<protobuf.version>2.5.0</protobuf.version>
<py4j.version>0.10.9.7</py4j.version>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,15 +25,13 @@
public class SparkVersion {
private static final Logger LOGGER = LoggerFactory.getLogger(SparkVersion.class);

public static final SparkVersion SPARK_3_2_0 = SparkVersion.fromVersionString("3.2.0");

public static final SparkVersion SPARK_3_3_0 = SparkVersion.fromVersionString("3.3.0");

public static final SparkVersion SPARK_3_5_0 = SparkVersion.fromVersionString("3.5.0");

public static final SparkVersion SPARK_4_0_0 = SparkVersion.fromVersionString("4.0.0");

public static final SparkVersion MIN_SUPPORTED_VERSION = SPARK_3_2_0;
public static final SparkVersion MIN_SUPPORTED_VERSION = SPARK_3_3_0;
public static final SparkVersion UNSUPPORTED_FUTURE_VERSION = SPARK_4_0_0;

private int version;
Expand Down
Loading
Loading