[SPARK-5068] [SQL] Fix bug query data when path doesn't exist for HiveContext #4356

chenghao-intel · 2015-02-04T06:43:47Z

This is a follow up for #3907 & #3891 .

Hive actually support the not existed path(either table or partition path) by yielding an empty row, but Spark SQL will throws exception.

Ideally, we need to check the path existence during the partition processing, however, the InputFormat always computes the file splits before that, hence exception will raised if the specified path doesn't exists.

This PR backs to the solution of #3891, and check the partition/table paths existence in spark plan generation. And of course we can move that logic into HadoopRDD if it support the non exist path in the future.

@jeanlyn, @marmbrus, @srowen

SparkQA · 2015-02-04T06:47:50Z

Test build #26729 has started for PR 4356 at commit 1f033cd.

This patch merges cleanly.

SparkQA · 2015-02-04T07:57:58Z

Test build #26729 has finished for PR 4356 at commit 1f033cd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-04T07:58:01Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26729/
Test PASSed.

jeanlyn · 2015-02-04T08:05:18Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala

-        filteredFiles.mkString(",")
-      case None => path.toString
+  private def applyFilterIfNeeded(path: Path, filterOpt: Option[PathFilter]): Option[String] = {
+    if (fs.exists(path)) {


I think we'd better get fs from the path,because in the hadoop namenode federation we may get some problems like Wrong FS exception if we use the FileSystem.get(sc.hiveconf) to get fs.

SparkQA · 2015-02-04T11:12:47Z

Test build #26747 has started for PR 4356 at commit d3a4d3c.

This patch merges cleanly.

SparkQA · 2015-02-04T12:27:21Z

Test build #26747 has finished for PR 4356 at commit d3a4d3c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-04T12:27:24Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26747/
Test PASSed.

liancheng · 2015-02-05T08:27:06Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala

-      case None => path.toString
+  private def applyFilterIfNeeded(path: Path, filterOpt: Option[PathFilter]): Option[String] = {
+    val fs = path.getFileSystem(sc.hiveconf)
+    if (fs.exists(path)) {


My concern is similar to what @marmbrus mentioned in #3981. It's pretty expensive to check each path in serial for tables with lots of partitions. Especially when the data reside on S3. Can we use listStatus or globStatus to retrieve all FileStatus objects under some path(s), and then do the filtering locally?

marmbrus · 2015-03-18T03:40:08Z

Can you reconcile this with #5059 and if that looks good close this issue? If we decide to go with the other one it would be good to include your test cases if you think they are valuable.

chenghao-intel · 2015-04-06T01:16:00Z

Sorry for the delay, I am closing it.

…ontext This PR follow up PR #3907 & #3891 & #4356. According to marmbrus liancheng 's comments, I try to use fs.globStatus to retrieve all FileStatus objects under path(s), and then do the filtering locally. [1]. get pathPattern by path, and put it into pathPatternSet. (hdfs://cluster/user/demo/2016/08/12 -> hdfs://cluster/user/demo/*/*/*) [2]. retrieve all FileStatus objects ,and cache them by undating existPathSet. [3]. do the filtering locally [4]. if we have new pathPattern,do 1,2 step again. (external table maybe have more than one partition pathPattern) chenghao-intel jeanlyn Author: lazymam500 <[email protected]> Author: lazyman <[email protected]> Closes #5059 from lazyman500/SPARK-5068 and squashes the following commits: 5bfcbfd [lazyman] move spark.sql.hive.verifyPartitionPath to SQLConf,fix scala style e1d6386 [lazymam500] fix scala style f23133f [lazymam500] bug fix 47e0023 [lazymam500] fix scala style,add config flag,break the chaining 04c443c [lazyman] SPARK-5068: fix bug when partition path doesn't exists #2 41f60ce [lazymam500] Merge pull request #1 from apache/master

jeanlyn and others added 5 commits February 3, 2015 18:38

SPARK-5068: fix bug query data when path doesn't exists

0033ed2

add the Licensed

1a65548

fix code style

76df33f

Return empty row when table / partition path doesn't exist

6958312

Move the FileSystem variable as class member

1f033cd

jeanlyn reviewed Feb 4, 2015
View reviewed changes

Fix the potential bug of referencing FileSystem

d3a4d3c

liancheng reviewed Feb 5, 2015
View reviewed changes

This was referenced Feb 17, 2015

[SPARK-5068][SQL]fix bug query data when path doesn't exists #3891

Closed

[SPARK-5068][SQL]fix bug query data when path doesn't exists #3907

Closed

lazyman500 mentioned this pull request Mar 17, 2015

[Spark-5068][SQL]Fix bug query data when path doesn't exist for HiveContext #5059

Closed

chenghao-intel closed this Apr 6, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-5068] [SQL] Fix bug query data when path doesn't exist for HiveContext #4356

[SPARK-5068] [SQL] Fix bug query data when path doesn't exist for HiveContext #4356

chenghao-intel commented Feb 4, 2015

SparkQA commented Feb 4, 2015

SparkQA commented Feb 4, 2015

AmplabJenkins commented Feb 4, 2015

jeanlyn Feb 4, 2015

SparkQA commented Feb 4, 2015

SparkQA commented Feb 4, 2015

AmplabJenkins commented Feb 4, 2015

liancheng Feb 5, 2015

marmbrus commented Mar 18, 2015

chenghao-intel commented Apr 6, 2015

[SPARK-5068] [SQL] Fix bug query data when path doesn't exist for HiveContext #4356

[SPARK-5068] [SQL] Fix bug query data when path doesn't exist for HiveContext #4356

Conversation

chenghao-intel commented Feb 4, 2015

SparkQA commented Feb 4, 2015

SparkQA commented Feb 4, 2015

AmplabJenkins commented Feb 4, 2015

jeanlyn Feb 4, 2015

Choose a reason for hiding this comment

SparkQA commented Feb 4, 2015

SparkQA commented Feb 4, 2015

AmplabJenkins commented Feb 4, 2015

liancheng Feb 5, 2015

Choose a reason for hiding this comment

marmbrus commented Mar 18, 2015

chenghao-intel commented Apr 6, 2015