Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31400][ML]The catalogString doesn't distinguish Vectors in ml and mllib #28346

Conversation

TJX2014
Copy link
Contributor

@TJX2014 TJX2014 commented Apr 26, 2020

#27872 # What changes were proposed in this pull request?
1.Add class info output in org.apache.spark.ml.util.SchemaUtils#checkColumnType to distinct Vectors in ml and mllib
2.Add unit test

Why are the changes needed?

the catalogString doesn't distinguish Vectors in ml and mllib when mllib vector misused in ml
https://issues.apache.org/jira/browse/SPARK-31400

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test is added

HeartSaVioR and others added 30 commits September 14, 2019 14:35
…w receiver has been initialized, but with hard timeout

### What changes were proposed in this pull request?

This patch fixes the flaky test failure from StreamingContextSuite "stop slow receiver gracefully", via putting flag whether initializing slow receiver is completed, and wait for such flag to be true. As receiver should be submitted via job and initialized in executor, 500ms might not be enough for covering all cases.

### Why are the changes needed?

We got some reports for test failure on this test. Please refer [SPARK-24663](https://issues.apache.org/jira/browse/SPARK-24663)

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Modified UT. I've artificially made delay on handling job submission via adding below code in `DAGScheduler.submitJob`:

```
if (rdd != null && rdd.name != null && rdd.name.startsWith("Receiver")) {
  println(s"Receiver Job! rdd name: ${rdd.name}")
  Thread.sleep(1000)
}
```

and the test "stop slow receiver gracefully" failed on current master and passed on the patch.

Closes apache#25725 from HeartSaVioR/SPARK-24663.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
…icsSuite

### What changes were proposed in this pull request?

In method `SQLMetricsTestUtils.testMetricsDynamicPartition()`, there is a CREATE TABLE sentence without `withTable` block. It causes test failure if use same table name in other unit tests.

### Why are the changes needed?
To avoid "table already exists" in tests.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Exist UT

Closes apache#25752 from LantaoJin/SPARK-29045.

Authored-by: LantaoJin <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
This PR enables GitHub Action on PRs.

So far, we detect JDK11 compilation error after merging.
This PR aims to prevent JDK11 compilation error at PR stage.

No.

Manual. See the GitHub Action on this PR.

Closes apache#25786 from dongjoon-hyun/SPARK-29079.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: DB Tsai <[email protected]>
…e fully processed before checking recorded values

### What changes were proposed in this pull request?

This patch ensures accessing recorded values in listener is always after letting listeners fully process all events. To ensure this, this patch adds new class to hide these values and access with methods which will ensure above condition. Without this guard, two threads are running concurrently - 1) listeners process thread 2) test main thread - and race condition would occur.

That's why we also see very odd thing, error message saying condition is met but test failed:
```
- Barrier task failures from the same stage attempt don't trigger multiple stage retries *** FAILED ***
  ArrayBuffer(0) did not equal List(0) (DAGSchedulerSuite.scala:2656)
```
which means verification failed, and condition is met just before constructing error message.

The guard is properly placed in many spots, but missed in some places. This patch enforces that it can't be missed.

### Why are the changes needed?

UT fails intermittently and this patch will address the flakyness.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Modified UT.

Also made the flaky tests artificially failing via applying 50ms of sleep on each onXXX method.

![Screen Shot 2019-09-07 at 7 44 15 AM](https://user-images.githubusercontent.com/1317309/64465178-1747ad00-d146-11e9-92f6-f4ed4a1f4b08.png)

I found 3 methods being failed. (They've marked as X. Just ignore ! as they failed on waiting listener in given timeout and these tests don't deal with these recorded values - it uses other timeout value 1000ms than 10000ms for this listener so affected via side-effect.)

When I applied same in this patch all tests marked as X passed.

Closes apache#25794 from HeartSaVioR/SPARK-26989-branch-2.4.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
… in org.apache.spark.ui.WebUI

### What changes were proposed in this pull request?

This is a backport of apache#24088 to avoid CCE.

### Why are the changes needed?

When we run YarnSchedulerBackendSuite, the class path seems to be made from the classes folder(`resource-managers/yarn/target/scala-2.12/classes`) instead of jar (`resource-managers/yarn/target/spark-yarn_2.12-*-SNAPSHOT.jar`) . `ui.getHandlers` is in spark-core and its loaded from spark-core.jar which is shaded and hence refers to `org.spark_project.jetty.servlet.ServletContextHandler`

Here in  org.apache.spark.scheduler.cluster.YarnSchedulerBackend, as its not shaded, it expects org.eclipse.jetty.servlet.ServletContextHandler

Refer discussion  https://issues.apache.org/jira/browse/SPARK-27122?focusedCommentId=16792318&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16792318

Hence as a fix, org.apache.spark.ui.WebUI must only return a wrapper class instance or references so that Jetty classes can be avoided in getters which are accessed outside spark-core

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with the existing tests.

Closes apache#25793 from dongjoon-hyun/SPARK-27122.

Authored-by: Ajith <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
… avoid CCE

### What changes were proposed in this pull request?

[SPARK-27122](apache#24088) fixes `ClassCastException` at `yarn` module by introducing `DelegatingServletContextHandler`. Initially, this was discovered with JDK9+, but the class path issues affected JDK8 environment, too. After [SPARK-28709](apache#25439), I also hit the similar issue at `streaming` module.

This PR aims to fix `streaming` module by adding `getContextPath` to `DelegatingServletContextHandler` and using it.

### Why are the changes needed?

Currently, when we test `streaming` module independently, it fails like the following.
```
$ build/mvn test -pl streaming
...
UISeleniumSuite:
- attaching and detaching a Streaming tab *** FAILED ***
  java.lang.ClassCastException: org.sparkproject.jetty.servlet.ServletContextHandler cannot be cast to org.eclipse.jetty.servlet.ServletContextHandler
...
Tests: succeeded 337, failed 1, canceled 0, ignored 1, pending 0
*** 1 TEST FAILED ***
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with the modified tests. And do the following manually.
Since you can observe this when you run `streaming` module test only (instead of running all), you need to install the changed `core` module and use it.

```
$ java -version
openjdk version "1.8.0_222"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_222-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.222-b10, mixed mode)
$ build/mvn install -DskipTests
$ build/mvn test -pl streaming
```

Closes apache#25791 from dongjoon-hyun/SPARK-29087.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 729b318)
Signed-off-by: Dongjoon Hyun <[email protected]>
…r static metrics

## What changes were proposed in this pull request?

YARN applicationMaster metrics registration introduced in SPARK-24594 causes further registration of static metrics (Codegenerator and HiveExternalCatalog) and of JVM metrics, which I believe do not belong in this context.
This looks like an unintended side effect of using the start method of [[MetricsSystem]].
A possible solution proposed here, is to introduce startNoRegisterSources to avoid these additional registrations of static sources and of JVM sources in the case of YARN applicationMaster metrics (this could be useful for other metrics that may be added in the future).

## How was this patch tested?

Manually tested on a YARN cluster,

Closes apache#22279 from LucaCanali/YarnMetricsRemoveExtraSourceRegistration.

Lead-authored-by: Luca Canali <[email protected]>
Co-authored-by: LucaCanali <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
…xt is stopping

### What changes were proposed in this pull request?

This patch fixes the bug regarding NPE in SQLConf.get, which is only possible when SparkContext._dagScheduler is null due to stopping SparkContext. The logic doesn't seem to consider active SparkContext could be in progress of stopping.

Note that it can't be encountered easily as SparkContext.stop() blocks the main thread, but there're many cases which SQLConf.get is accessed concurrently while SparkContext.stop() is executing - users run another threads, or listener is accessing SQLConf.get after dagScheduler is set to null (this is the case what I encountered.)

### Why are the changes needed?

The bug brings NPE.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Previous patch apache#25753 was tested with new UT, and due to disruption with other tests in concurrent test run, the test is excluded in this patch.

Closes apache#25798 from HeartSaVioR/SPARK-29046-branch-2.4.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…ask is finished

### What changes were proposed in this pull request?
Manually release stdin writer and stderr reader thread when task is finished. This is the backport of apache#23638 including apache#25049.

### Why are the changes needed?
This is a bug fix. PipedRDD's IO threads may hang even the corresponding task is already finished. Without this fix,  it would leak resource(memory specially).

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Add new test

Closes apache#25825 from advancedxy/SPARK-26713_for_2.4.

Authored-by: Xianjin YE <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…heck thread termination

### What changes were proposed in this pull request?

`PipedRDD` will invoke `stdinWriterThread.interrupt()` at task completion, and `obj.wait` will get `InterruptedException`. However, there exists a possibility which the thread termination gets delayed because the thread starts from `obj.wait()` with that exception. To prevent test flakiness, we need to use `eventually`. Also, This PR fixes the typo in code comment and variable name.

### Why are the changes needed?

```
- stdin writer thread should be exited when task is finished *** FAILED ***
  Some(Thread[stdin writer for List(cat),5,]) was not empty (PipedRDDSuite.scala:107)
```

- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6867/testReport/junit/org.apache.spark.rdd/PipedRDDSuite/stdin_writer_thread_should_be_exited_when_task_is_finished/

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manual.

We can reproduce the same failure like Jenkins if we catch `InterruptedException` and sleep longer than the `eventually` timeout inside the test code. The following is the example to reproduce it.
```scala
val nums = sc.makeRDD(Array(1, 2, 3, 4), 1).map { x =>
  try {
    obj.synchronized {
      obj.wait() // make the thread waits here.
    }
  } catch {
    case ie: InterruptedException =>
      Thread.sleep(15000)
      throw ie
  }
  x
}
```

Closes apache#25808 from dongjoon-hyun/SPARK-29104.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 34915b2)
Signed-off-by: Dongjoon Hyun <[email protected]>
…f `bytesHash(data)`

This PR changes `bytesHash(data)` API invocation with the underlaying `byteHash(data, arraySeed)` invocation.
```scala
def bytesHash(data: Array[Byte]): Int = bytesHash(data, arraySeed)
```

The original API is changed between Scala versions by the following commit. From Scala 2.12.9, the semantic of the function is changed. If we use the underlying form, we are safe during Scala version migration.
- scala/scala@846ee2b#diff-ac889f851e109fc4387cd738d52ce177

No.

This is a kind of refactoring.

Pass the Jenkins with the existing tests.

Closes apache#25821 from dongjoon-hyun/SPARK-SCALA-HASH.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 3ece8ee)
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

Previous comment was true for Apache Spark 2.3.0. The 2.4.0 release brought multiple watermark policy and therefore stating that the 'min' is always chosen is misleading.

This PR updates the comments about multiple watermark policy. They aren't true anymore since in case of multiple watermarks, we can configure which one will be applied to the query. This change was brought with Apache Spark 2.4.0 release.

### Why are the changes needed?

It introduces some confusion about the real execution of the commented code.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

The tests weren't added because the change is only about the documentation level. I affirm that the contribution is my original work and that I license the work to the project under the project's open source license.

Closes apache#25832 from bartosz25/fix_comments_multiple_watermark_policy.

Authored-by: bartosz25 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit b4b2e95)
Signed-off-by: Dongjoon Hyun <[email protected]>
…ut should be INDETERMINATE

<!--
Thanks for sending a pull request!  Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review.
-->

### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
  1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
  2. If you fix some SQL features, you can provide some references of other DBMSes.
  3. If there is design documentation, please add the link.
  4. If there is a discussion in the mailing list, please add the link.
-->

We already have found and fixed the correctness issue before when RDD output is INDETERMINATE. One missing part is sampling-based RDD. This kind of RDDs is order sensitive to its input. A sampling-based RDD with unordered input, should be INDETERMINATE.

Note that this is backport of original PR to branch-2.4.

### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
-->

A sampling-based RDD with unordered input is just like MapPartitionsRDD with isOrderSensitive parameter as true. The RDD output can be different after a rerun.

It is a problem in ML applications.

In ML, sample is used to prepare training data. ML algorithm fits the model based on the sampled data. If rerun tasks of sample produce different output during model fitting, ML results will be unreliable and also buggy.

Each sample is random output, but once you sampled, the output should be determinate.

### Does this PR introduce any user-facing change?
<!--
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If no, write 'No'.
-->

Previously, a sampling-based RDD can possibly come with different output after a rerun.
After this patch, sampling-based RDD is INDETERMINATE. For an INDETERMINATE map stage, currently Spark scheduler will re-try all the tasks of the failed stage.

### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->

Added test.

Closes apache#25826 from viirya/sample-order-sensitive-2.4.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
…d strip trailing dots

## What changes were proposed in this pull request?

This PR aims to improve the `merge-spark-pr` script in the following two ways.
1. `[WIP]` is useful when we show that a PR is not ready for merge. Apache Spark allows merging `WIP` PRs. However, sometime, we accidentally forgot to clean up the title for the completed PRs. We had better warn once more during merging stage and get a confirmation from the committers.
2. We have two kinds of PR titles in terms of the ending period. This PR aims to remove the trailing `dot` since the shorter is the better in the commit title. Also, the PR titles without the trailing `dot` is dominant in the Apache Spark commit logs.
```
$ git log --oneline | grep '[.]$' | wc -l
    4090
$ git log --oneline | grep '[^.]$' | wc -l
   20747
```

## How was this patch tested?

Manual.
```
$ dev/merge_spark_pr.py
git rev-parse --abbrev-ref HEAD
Which pull request would you like to merge? (e.g. 34): 25157

The PR title has `[WIP]`:
[WIP][SPARK-28396][SQL] Add PathCatalog for data source V2
Continue? (y/n):
```

```
$ dev/merge_spark_pr.py
git rev-parse --abbrev-ref HEAD
Which pull request would you like to merge? (e.g. 34): 25304
I've re-written the title as follows to match the standard format:
Original: [SPARK-28570][CORE][SHUFFLE] Make UnsafeShuffleWriter use the new API.
Modified: [SPARK-28570][CORE][SHUFFLE] Make UnsafeShuffleWriter use the new API
Would you like to use the modified title? (y/n):
```

Closes apache#25356 from dongjoon-hyun/SPARK-28616.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit ae08387)
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

This PR aims to clean up the commit logs by removing the comments of our PR template.

### Why are the changes needed?

Apache Spark PR template has comments. Sometime we forget to clean up them because GitHub hides them nicely. It would be great if we clean up this. Otherwise, this makes the commit logs too verbose. (There are a few commits already.)

### Does this PR introduce any user-facing change?

No. (only for committers)

### How was this patch tested?

Manually with Python2/Python3.

Closes apache#25564 from dongjoon-hyun/SPARK-28857.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 1fd7f29)
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

This PR backport apache#25404 to branch-2.4.

### Why are the changes needed?
Backport from master.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing tests

Closes apache#25839 from wangyum/SPARK-28683-branch-2.4.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…er upgrading genjavadoc to 0.14

### What changes were proposed in this pull request?
This PR backport apache#24443 to fix java doc generation after upgrading genjavadoc to 0.14.

### Why are the changes needed?
Fix javadoc generation issue.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?
```
./build/sbt  -Phadoop-2.7 -Pkubernetes -Phive-thriftserver -Pkinesis-asl -Pyarn -Pspark-ganglia-lgpl -Phive -Pmesos clean test:package streaming-kinesis-asl-assembly/assembly
./build/sbt  -Phadoop-2.7 -Pkubernetes -Phive-thriftserver -Pkinesis-asl -Pyarn -Pspark-ganglia-lgpl -Phive -Pmesos unidoc
```

Closes apache#25841 from wangyum/genjavadoc-0.13-branch-2.4.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
This PR aims to increase the JVM CodeCacheSize from 0.5G to 1G.

After upgrading to `Scala 2.12.10`, the following is observed during building.
```
2019-09-18T20:49:23.5030586Z OpenJDK 64-Bit Server VM warning: CodeCache is full. Compiler has been disabled.
2019-09-18T20:49:23.5032920Z OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=
2019-09-18T20:49:23.5034959Z CodeCache: size=524288Kb used=521399Kb max_used=521423Kb free=2888Kb
2019-09-18T20:49:23.5035472Z  bounds [0x00007fa62c000000, 0x00007fa64c000000, 0x00007fa64c000000]
2019-09-18T20:49:23.5035781Z  total_blobs=156549 nmethods=155863 adapters=592
2019-09-18T20:49:23.5036090Z  compilation: disabled (not enough contiguous free space left)
```

No.

Manually check the Jenkins or GitHub Action build log (which should not have the above).

Closes apache#25836 from dongjoon-hyun/SPARK-CODE-CACHE-1G.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 3bf43fb)
Signed-off-by: Dongjoon Hyun <[email protected]>
… in case of compile error on generated code in UT

### What changes were proposed in this pull request?

This patch proposes to change the log level of logging generated code in case of compile error being occurred in UT. This would help to investigate compilation issue of generated code easier, as currently we got exception message of line number but there's no generated code being logged actually (as in most cases of UT the threshold of log level is at least WARN).

### Why are the changes needed?

This would help investigating issue on compilation error for generated code in UT.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

N/A

Closes apache#25835 from HeartSaVioR/MINOR-always-log-generated-code-on-fail-to-compile-in-unit-testing.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit eee2e02)
Signed-off-by: Dongjoon Hyun <[email protected]>
… mode is selected

### What changes were proposed in this pull request?
#DataSet
fruit,color,price,quantity
apple,red,1,3
banana,yellow,2,4
orange,orange,3,5
xxx

This PR aims to fix the below
```
scala> spark.conf.set("spark.sql.csv.parser.columnPruning.enabled", false)
scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").count
res1: Long = 4
```

This is caused by the issue [SPARK-24645](https://issues.apache.org/jira/browse/SPARK-24645).
SPARK-24645 issue can also be solved by [SPARK-25387](https://issues.apache.org/jira/browse/SPARK-25387)

### Why are the changes needed?

SPARK-24645 caused this regression, so reverted the code as it can also be solved by SPARK-25387

### Does this PR introduce any user-facing change?
No,

### How was this patch tested?
Added UT, and also tested the bug SPARK-24645

**SPARK-24645 regression**
![image](https://user-images.githubusercontent.com/35216143/65067957-4c08ff00-d9a5-11e9-8d43-a4a23a61e8b8.png)

Closes apache#25843 from sandeep-katta/SPARK-29101_branch2.4.

Authored-by: sandeep katta <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
This addresses 4 miscellaneous warnings that appear in the current build.

Back-port of some of apache#25852

Closes apache#25857 from srowen/BuildWarnings2.4.

Authored-by: Sean Owen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
… performance

This PR is to cherry-pick apache#23002 to Spark 2.4

---

## What changes were proposed in this pull request?

In `SQLAppStatusListener.aggregateMetrics`, we use the `metricIds` only to filter the relevant metrics. And this is a Seq which is also sorted. When there are many metrics involved, this can be pretty inefficient. The PR proposes to use a Set for it.

## How was this patch tested?

NA

Closes apache#23002 from mgaido91/SPARK-26003.

Closes apache#25860 from gatorsmile/cherrypickSPARK-26003.

Authored-by: Marco Gaido <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…rked JVMs for higher parallelism

## What changes were proposed in this pull request?

This is a backport of apache#24373 , apache#24404 and apache#24434

This patch modifies SparkBuild so that the largest / slowest test suites (or collections of suites) can run in their own forked JVMs, allowing them to be run in parallel with each other. This opt-in / whitelisting approach allows us to increase parallelism without having to fix a long-tail of flakiness / brittleness issues in tests which aren't performance bottlenecks.

See comments in SparkBuild.scala for information on the details, including a summary of why we sometimes opt to run entire groups of tests in a single forked JVM .

The time of full new pull request test in Jenkins is reduced by around 53%:
before changes: 4hr 40min
after changes: 2hr 13min

## How was this patch tested?

Unit test

Closes apache#25861 from dongjoon-hyun/SPARK-27460.

Lead-authored-by: Gengliang Wang <[email protected]>
Co-authored-by: gatorsmile <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
This PR use `java-version` instead of `version` for GitHub Action. More details:
actions/setup-java@204b974
actions/setup-java@ac25aee

The `version` property will not be supported after October 1, 2019.

No

N/A

Closes apache#25866 from wangyum/java-version.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 9e234a5)
Signed-off-by: Dongjoon Hyun <[email protected]>
…itHub Action

This PR aims to add linters and license/dependency checkers to GitHub Action. This excludes `lint-r` intentionally because https://github.com/actions/setup-r is not ready. We can add that later when it becomes available.

This will help the PR reviews.

No.

See the GitHub Action result on this PR.

Closes apache#25879 from dongjoon-hyun/SPARK-29199.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
… stopped

### What changes were proposed in this pull request?

TransportClientFactory.createClient() is called by task and TransportClientFactory.close() is called by executor.
When stop the executor, close() will set workerGroup = null, NPE will occur in createClient which generate many exception in log.
For exception occurs after close(), treated it as an expected Exception
and transform it to InterruptedException which can be processed by Executor.

### Why are the changes needed?

The change can reduce the exception stack trace in log file, and user won't be confused by these excepted exception.

### Does this PR introduce any user-facing change?

N/A

### How was this patch tested?

New tests are added in TransportClientFactorySuite and ExecutorSuite

Closes apache#25759 from colinmjj/spark-19147.

Authored-by: colinma <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit 076186e)
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

Correct a word in a log message.

### Why are the changes needed?

Log message will be more clearly.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Test is not needed.

Closes apache#25880 from mdianjun/fix-a-word.

Authored-by: madianjun <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit e2c4787)
Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request?

This PR aims to add `Hadoop 2.6` combination to `branch-2.4` GitHub Action.

### Why are the changes needed?

This will help `branch-2.4` maintenance to prevent Hadoop-2.6 related issue at PR stages.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

This PR's GitHub Action should show both Hadoop 2.6/2.7.

Closes apache#25886 from dongjoon-hyun/SPARK-29201.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request?
Do task handling even the task exceeds maxResultSize configured. More details are in the jira description https://issues.apache.org/jira/browse/SPARK-29177 .

### Why are the changes needed?
Without this patch, the zombie tasks will prevent yarn from recycle those containers running these tasks, which will affect other applications.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
unit test and production test with a very large `SELECT` in spark thriftserver.

Closes apache#25850 from adrian-wang/zombie.

Authored-by: Daoyuan Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit c08bc37)
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?
Setting custom sort key for duration and execution time column.

### Why are the changes needed?
Sorting on duration and execution time columns consider time as a string after converting into readable form which is the reason for wrong sort results as mentioned in [SPARK-29053](https://issues.apache.org/jira/browse/SPARK-29053).

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Test manually.

Back-port of commit 93ac4e1

Closes apache#25882 from amanomer/BP29053.

Authored-by: aman_omer <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
Smeb and others added 22 commits March 31, 2020 15:17
…exclusive upper bound

### What changes were proposed in this pull request?
A small documentation change to clarify that the `rand()` function produces values in `[0.0, 1.0)`.

### Why are the changes needed?
`rand()` uses `Rand()` - which generates values in [0, 1) ([documented here](https://github.com/apache/spark/blob/a1dbcd13a3eeaee50cc1a46e909f9478d6d55177/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala#L71)). The existing documentation suggests that 1.0 is a possible value returned by rand (i.e for a distribution written as `X ~ U(a, b)`, x can be a or b, so `U[0.0, 1.0]` suggests the value returned could include 1.0).

### Does this PR introduce any user-facing change?
Only documentation changes.

### How was this patch tested?
Documentation changes only.

Closes apache#28071 from Smeb/master.

Authored-by: Ben Ryves <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
…HiveFunctionWrapper

### What changes were proposed in this pull request?

This patch proposes to cache Class instance for the UDF instance in HiveFunctionWrapper to fix the case where Hive simple UDF is somehow transformed (expression is copied) and evaluated later with another classloader (for the case current thread context classloader is somehow changed). In this case, Spark throws CNFE as of now.

It's only occurred for Hive simple UDF, as HiveFunctionWrapper caches the UDF instance whereas it doesn't do for `UDF` type. The comment says Spark has to create instance every time for UDF, so we cannot simply do the same. This patch caches Class instance instead, and switch current thread context classloader to which loads the Class instance.

This patch extends the test boundary as well. We only tested with GenericUDTF for SPARK-26560, and this patch actually requires only UDF. But to avoid regression for other types as well, this patch adds all available types (UDF, GenericUDF, AbstractGenericUDAFResolver, UDAF, GenericUDTF) into the boundary of tests.

Credit to cloud-fan as he discovered the problem and proposed the solution.

### Why are the changes needed?

Above section describes why it's a bug and how it's fixed.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

New UTs added.

Closes apache#28086 from HeartSaVioR/SPARK-31312-branch-2.4.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?

This PR unsets `setuptools` version in CI. This was fixed in the 0.46.1.2+ `setuptools` - pypa/setuptools#2046. `setuptools` 0.46.1.0 and 0.46.1.1 still have this problem.

### Why are the changes needed?

To test the latest setuptools out to see if users can actually install and use it.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Jenkins will test.

Closes apache#28111 from HyukjinKwon/SPARK-31231-revert.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 9f58f03)
Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request?

This is a backport of apache#22932 .

Currently, Spark writes Spark version number into Hive Table properties with `spark.sql.create.version`.
```
parameters:{
  spark.sql.sources.schema.part.0={
    "type":"struct",
    "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
  },
  transient_lastDdlTime=1541142761,
  spark.sql.sources.schema.numParts=1,
  spark.sql.create.version=2.4.0
}
```

This PR aims to write Spark versions to ORC/Parquet file metadata with `org.apache.spark.sql.create.version` because we used `org.apache.` prefix in Parquet metadata already. It's different from Hive Table property key `spark.sql.create.version`, but it seems that we cannot change Hive Table property for backward compatibility.

After this PR, ORC and Parquet file generated by Spark will have the following metadata.

**ORC (`native` and `hive` implmentation)**
```
$ orc-tools meta /tmp/o
File Version: 0.12 with ...
...
User Metadata:
  org.apache.spark.sql.create.version=3.0.0
```

**PARQUET**
```
$ parquet-tools meta /tmp/p
...
creator:     parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
extra:       org.apache.spark.sql.create.version = 3.0.0
extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
```

### Why are the changes needed?

This backport helps us handle this files differently in Apache Spark 3.0.0.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with newly added test cases.

Closes apache#28142 from dongjoon-hyun/SPARK-25102-2.4.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Backport apache@6b1ca88, similar to apache#28142

### What changes were proposed in this pull request?

Write Spark version into Avro file metadata

### Why are the changes needed?

The version info is very useful for backward compatibility. This is also done in parquet/orc.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

new test

Closes apache#28150 from cloud-fan/pick.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
… and pip installation mistake

### What changes were proposed in this pull request?

This PR proposes to show a better error message when a user mistakenly installs `pyspark` from PIP but the default `python` does not point out the corresponding `pip`. See https://stackoverflow.com/questions/46286436/running-pyspark-after-pip-install-pyspark/49587560 as an example.

It can be reproduced as below:

I have two Python executables. `python` is Python 3.7, `pip` binds with Python 3.7 and `python2.7` is Python 2.7.

```bash
pip install pyspark
```

```bash
pyspark
```

```
...
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.5
      /_/

Using Python version 3.7.3 (default, Mar 27 2019 09:23:15)
SparkSession available as 'spark'.
...
```

```bash
PYSPARK_PYTHON=python2.7 pyspark
```

```
Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin']
/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 24: /bin/load-spark-env.sh: No such file or directory
/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 77: /bin/spark-submit: No such file or directory
/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 77: exec: /bin/spark-submit: cannot execute: No such file or directory
```

### Why are the changes needed?

There are multiple questions outside about this error and they have no idea what's going on. See:

- https://stackoverflow.com/questions/46286436/running-pyspark-after-pip-install-pyspark/49587560
- https://stackoverflow.com/questions/45991888/path-issue-could-not-find-valid-spark-home-while-searching
- https://stackoverflow.com/questions/49707239/pyspark-could-not-find-valid-spark-home
- https://stackoverflow.com/questions/55569985/pyspark-could-not-find-valid-spark-home
- https://stackoverflow.com/questions/48296474/error-could-not-find-valid-spark-home-while-searching-pycharm-in-windows
- ContinuumIO/anaconda-issues#8076

The answer is usually setting `SPARK_HOME`; however this isn't completely correct.

It works if you set `SPARK_HOME` because `pyspark` executable script directly imports the library by using `SPARK_HOME` (see https://github.com/apache/spark/blob/master/bin/pyspark#L52-L53) instead of the default package location specified via `python` executable. So, this way you use a package installed in a different Python, which isn't ideal.

### Does this PR introduce any user-facing change?

Yes, it fixes the error message better.

**Before:**

```
Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin']
...
```

**After:**

```
Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin']

Did you install PySpark via a package manager such as pip or Conda? If so,
PySpark was not found in your Python environment. It is possible your
Python environment does not properly bind with your package manager.

Please check your default 'python' and if you set PYSPARK_PYTHON and/or
PYSPARK_DRIVER_PYTHON environment variables, and see if you can import
PySpark, for example, 'python -c 'import pyspark'.

If you cannot import, you can install by using the Python executable directly,
for example, 'python -m pip install pyspark [--user]'. Otherwise, you can also
explicitly set the Python executable, that has PySpark installed, to
PYSPARK_PYTHON or PYSPARK_DRIVER_PYTHON environment variables, for example,
'PYSPARK_PYTHON=python3 pyspark'.
...
```

### How was this patch tested?

Manually tested as described above.

Closes apache#28152 from HyukjinKwon/SPARK-31382.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 0248b32)
Signed-off-by: HyukjinKwon <[email protected]>
…ckManagerMaster stops

### What changes were proposed in this pull request?

This PR (SPARK-31422) aims to return empty result in order to avoid `NullPointerException` at `getStorageStatus` and `getMemoryStatus` which happens after `BlockManagerMaster` stops. The empty result is consistent with the current status of `SparkContext` because `BlockManager` and `BlockManagerMaster` is already stopped.

### Why are the changes needed?

In `SparkEnv.stop`, the following stop sequence is used and `metricsSystem.stop` invokes `sink.stop`.
```
blockManager.master.stop()
metricsSystem.stop() --> sinks.foreach(_.stop)
```

However, some sink can invoke `BlockManagerSource` and ends up with `NullPointerException` because `BlockManagerMaster` is already stopped and `driverEndpoint` became `null`.
```
java.lang.NullPointerException
at org.apache.spark.storage.BlockManagerMaster.getStorageStatus(BlockManagerMaster.scala:170)
at org.apache.spark.storage.BlockManagerSource$$anonfun$10.apply(BlockManagerSource.scala:63)
at org.apache.spark.storage.BlockManagerSource$$anonfun$10.apply(BlockManagerSource.scala:63)
at org.apache.spark.storage.BlockManagerSource$$anon$1.getValue(BlockManagerSource.scala:31)
at org.apache.spark.storage.BlockManagerSource$$anon$1.getValue(BlockManagerSource.scala:30)
```

Since `SparkContext` registers and forgets `BlockManagerSource` without deregistering, we had better avoid `NullPointerException` inside `BlockManagerMaster` preventively.
```scala
_env.metricsSystem.registerSource(new BlockManagerSource(_env.blockManager))
```

### Does this PR introduce any user-facing change?

Yes. This will remove NPE for the users who uses `BlockManagerSource`.

### How was this patch tested?

Pass the Jenkins with the newly added test cases.

Closes apache#28187 from dongjoon-hyun/SPARK-31422.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit a6e6fbf)
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

This PR backports apache#28192 to branch-2.4.

### Why are the changes needed?

SPARK-31420 affects branch-2.4 too.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Same as apache#28192

Closes apache#28214 from sarutak/SPARK-31420-branch-2.4.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
…e column names

### What changes were proposed in this pull request?

When `toPandas` API works on duplicate column names produced from operators like join, we see the error like:

```
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
```

This patch fixes the error in `toPandas` API.

This is the backport of original patch to branch-2.4.

### Why are the changes needed?

To make `toPandas` work on dataframe with duplicate column names.

### Does this PR introduce any user-facing change?

Yes. Previously calling `toPandas` API on a dataframe with duplicate column names will fail. After this patch, it will produce correct result.

### How was this patch tested?

Unit test.

Closes apache#28219 from viirya/SPARK-31186-2.4.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
…toPandas with arrow execution

### What changes were proposed in this pull request?

This is to backport apache#28210.

This PR is adding support duplicated column names for `toPandas` with Arrow execution.

### Why are the changes needed?

When we execute `toPandas()` with Arrow execution, it fails if the column names have duplicates.

```py
>>> spark.sql("select 1 v, 1 v").toPandas()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 2132, in toPandas
    pdf = table.to_pandas()
  File "pyarrow/array.pxi", line 441, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 653, in table_to_blockmanager
    columns = _deserialize_column_index(table, all_columns, column_indexes)
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 704, in _deserialize_column_index
    columns = _flatten_single_level_multiindex(columns)
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 937, in _flatten_single_level_multiindex
    raise ValueError('Found non-unique column index')
ValueError: Found non-unique column index
```

### Does this PR introduce any user-facing change?

Yes, previously we will face an error above, but after this PR, we will see the result:

```py
>>> spark.sql("select 1 v, 1 v").toPandas()
   v  v
0  1  1
```

### How was this patch tested?

Added and modified related tests.

Closes apache#28221 from ueshin/issues/SPARK-31441/2.4/to_pandas.

Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: Takuya UESHIN <[email protected]>
<!--
Thanks for sending a pull request!  Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review.
  7. If you want to add a new configuration, please read the guideline first for naming configurations in
     'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
-->

### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
  1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
  2. If you fix some SQL features, you can provide some references of other DBMSes.
  3. If there is design documentation, please add the link.
  4. If there is a discussion in the mailing list, please add the link.
-->
This change explicitly set locale of timeline view to 'en' to be the same appearance as before upgrading vis-timeline.

### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
-->
We upgraded vis-timeline in apache#28192 and the upgraded version is different from before we used in the notation of dates.
The notation seems to be dependent on locale. The following is appearance in my Japanese environment.
<img width="557" alt="locale-changed" src="https://user-images.githubusercontent.com/4736016/79265314-de886700-7ed0-11ea-8641-fa76b993c0d9.png">

Although the notation is in Japanese, the default format is a little bit unnatural (e.g. 4月9日 05:39 is natural rather than 9 四月 05:39).

I found we can get the same appearance as before by explicitly set locale to 'en'.

### Does this PR introduce any user-facing change?
<!--
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If no, write 'No'.
-->
No.

### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->
I visited JobsPage, JobPage and StagePage and confirm that timeline view shows dates with 'en' locale.
<img width="735" alt="fix-date-appearance" src="https://user-images.githubusercontent.com/4736016/79267107-8bfc7a00-7ed3-11ea-8a25-f6681d04a83c.png">

NOTE: apache#28192 will be backported to branch-2.4 and branch-3.0 so this PR should be follow apache#28214 and apache#28213 .

Closes apache#28218 from sarutak/fix-locale-issue.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Kousuke Saruta <[email protected]>
(cherry picked from commit df27350)
Signed-off-by: Kousuke Saruta <[email protected]>
…figuration

### What changes were proposed in this pull request?
This PR is to backport the fix of apache#28003, add a migration guide, update the PR description and add an end-to-end test case.

Before this PR, the SQL `RESET` command will reset the values of static SQL configuration to the default and remove the cached values of Spark Context Configurations in the current session. This PR fixes the bugs. After this PR, the `RESET` command follows its definition and only updates the runtime SQL configuration values to the default.

### Why are the changes needed?
When we introduced the feature of Static SQL Configuration, we did not update the implementation of  SQL `RESET` command.

The static SQL configuration should not be changed by any command at runtime. However, the `RESET` command resets the values to the default. We should fix them.

### Does this PR introduce any user-facing change?
Before Spark 2.4.6, the `RESET` command resets both the runtime and static SQL configuration values to the default. It also removes the cached values of Spark Context Configurations in the current session, although these configuration values are for displaying/querying only.

### How was this patch tested?
Added an end-to-end test and a unit test

Closes apache#28262 from gatorsmile/spark-31234followup2.4.

Authored-by: gatorsmile <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…olumns

For example, for the following `df`:
```
val schema = new StructType()
  .add("c1", new StructType()
    .add("c1-1", StringType)
    .add("c1-2", StringType))
val data = Seq(Row(Row(null, "a2")), Row(Row("b1", "b2")), Row(null))
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.show
+--------+
|      c1|
+--------+
|  [, a2]|
|[b1, b2]|
|    null|
+--------+
```
In Spark 2.4.4,
```
df.na.drop("any", Seq("c1.c1-1")).show
+--------+
|      c1|
+--------+
|[b1, b2]|
+--------+
```
In Spark 2.4.5 or Spark 3.0.0-preview2, if nested columns are specified, they are ignored.
```
df.na.drop("any", Seq("c1.c1-1")).show
+--------+
|      c1|
+--------+
|  [, a2]|
|[b1, b2]|
|    null|
+--------+
```

This seems like a regression.

Now, the nested column can be specified:
```
df.na.drop("any", Seq("c1.c1-1")).show
+--------+
|      c1|
+--------+
|[b1, b2]|
+--------+
```

Also, if `*` is specified as a column, it will throw an `AnalysisException` that `*` cannot be resolved, which was the behavior in 2.4.4. Currently, in master, it has no effect.

Updated existing tests.

Closes apache#28266 from imback82/SPARK-31256.

Authored-by: Terry Kim <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit d7499ae)
Signed-off-by: Wenchen Fan <[email protected]>
backport apache#28281 to 2.4

This backport has one difference: there is no `EXTRACT(... FROM ...)` SQL syntax in 2.4, so this PR just uses the common function call syntax.

Closes apache#28299 from cloud-fan/pick.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…eckpoint

### What changes were proposed in this pull request?

This is a backport of apache#26827.

This PR aims to recover `spark.ui.port` and `spark.blockManager.port` from checkpoint like `spark.driver.port`.

### Why are the changes needed?

When the user configures these values, we can respect them.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with the newly added test cases.

Closes apache#28320 from dongjoon-hyun/SPARK-30199-2.4.

Authored-by: Aaruna <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
… the existing active or default SparkSession

### What changes were proposed in this pull request?

SparkSessionBuilder shoud not propagate static sql configurations to the existing active/default SparkSession
This seems a long-standing bug.

```scala
scala> spark.sql("set spark.sql.warehouse.dir").show
+--------------------+--------------------+
|                 key|               value|
+--------------------+--------------------+
|spark.sql.warehou...|file:/Users/kenty...|
+--------------------+--------------------+

scala> spark.sql("set spark.sql.warehouse.dir=2");
org.apache.spark.sql.AnalysisException: Cannot modify the value of a static config: spark.sql.warehouse.dir;
  at org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:154)
  at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:42)
  at org.apache.spark.sql.execution.command.SetCommand.$anonfun$x$7$6(SetCommand.scala:100)
  at org.apache.spark.sql.execution.command.SetCommand.run(SetCommand.scala:156)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:229)
  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
  ... 47 elided

scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession

scala> SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").get
getClass   getOrCreate

scala> SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").getOrCreate
20/04/23 23:49:13 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
res7: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession6403d574

scala> spark.sql("set spark.sql.warehouse.dir").show
+--------------------+-----+
|                 key|value|
+--------------------+-----+
|spark.sql.warehou...|  xyz|
+--------------------+-----+

scala>
OptionsAttachments
```

### Why are the changes needed?
bugfix as shown in the previous section

### Does this PR introduce any user-facing change?

Yes, static SQL configurations with SparkSession.builder.config do not propagate to any existing or new SparkSession instances.

### How was this patch tested?

new ut.

Closes apache#28316 from yaooqinn/SPARK-31532.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
(cherry picked from commit 8424f55)
Signed-off-by: Takeshi Yamamuro <[email protected]>
…BAL_TEMP_DATABASE config in SparkSessionBuilderSuite

### What changes were proposed in this pull request?

This PR intends to fix test code for using lowercases for the `GLOBAL_TEMP_DATABASE` config in `SparkSessionBuilderSuite`. The handling of the config is different between branch-3.0+ and branch-2.4. In branch-3.0+, Spark always lowercases a value in the config, so I think we had better always use lowercases for it in the test.

This comes from the dongjoon-hyun
 comment: apache#28316 (comment)

### Why are the changes needed?

To fix the test failure in branch-2.4.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Fixed the test.

Closes apache#28339 from maropu/SPARK-31532.

Authored-by: Takeshi Yamamuro <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…st's internal types

In the PR, I propose to fix the `InSet.sql` method for the cases when input collection contains values of internal Catalyst's types, for instance `UTF8String`. Elements of the input set `hset` are converted to Scala types, and wrapped by `Literal` to properly form SQL view of the input collection.

The changes fixed the bug in `InSet.sql` that makes wrong assumption about types of collection elements. See more details in SPARK-31563.

Highly likely, not.

Added a test to `ColumnExpressionSuite`

Closes apache#28343 from MaxGekk/fix-InSet-sql.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 7d8216a)
Signed-off-by: Dongjoon Hyun <[email protected]>
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@TJX2014 TJX2014 closed this Apr 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.