Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Access test files in resources from Spark source code when running Spark UT #10875

Closed
thirtiseven opened this issue May 23, 2024 · 2 comments
Assignees
Labels
test Only impacts tests

Comments

@thirtiseven
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
Some Spark UT will try to read files in resources from Spark's code. So when we introduce Spark UT to plugin, we can't read those files directly.

For example, "SPARK-31716: inferring should handle malformed input" in RapidsJsonSuite got following error (if included):

- SPARK-31716: inferring should handle malformed input *** FAILED ***
  org.apache.spark.sql.AnalysisException: Path does not exist: file:/home/haoyangl/spark-rapids/tests/src/test/resources/test-data/malformed_utf8.json
  at org.apache.spark.sql.errors.QueryCompilationErrors$.dataPathNotExistError(QueryCompilationErrors.scala:1011)
  at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4(DataSource.scala:785)
  at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4$adapted(DataSource.scala:782)
  at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:372)
  at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
  at scala.util.Success.$anonfun$map$1(Try.scala:255)
  at scala.util.Success.map(Try.scala:213)
  at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
  at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
  at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
  ...

I hope we can find a way to read test files in Spark's resources so we can really test related Spark UT cases.

Describe the solution you'd like

Gluten overrides testFile

  /** Returns full path to the given file in the resource folder */
  override protected def testFile(fileName: String): String = {
    getWorkspaceFilePath("sql", "core", "src", "test", "resources").toString + "/" + fileName
  }

where getWorkspaceFilePath in Spark

/**
 * Get a Path relative to the root project. It is assumed that a spark home is set.
 */
protected final def getWorkspaceFilePath(first: String, more: String*): Path = {
  if (!(sys.props.contains("spark.test.home") || sys.env.contains("SPARK_HOME"))) {
    fail("spark.test.home or SPARK_HOME is not set.")
  }
  val sparkHome = sys.props.getOrElse("spark.test.home", sys.env("SPARK_HOME"))
  java.nio.file.Paths.get(sparkHome, first +: more: _*)
}

Gluten is leveraging the system property "spark.test.home" (check here). For running Spark UT on shim 3.x.y, Gluten will prepare the docker container with a source folder filled with 3.x.y's code.

To do the same thing, we need to set up a Spark source folder before running Spark UT CI, and noted how to set up the env when running spark UT locally.

Describe alternatives you've considered
Those resource files are also included in jars, it is possible to read them from the jars if we know where the files are located.

Additional context
copy files to plugin pr: #10864
Spark UT RapidsJsonSuite issue: #10773

@thirtiseven thirtiseven added feature request New feature or request ? - Needs Triage Need team to review and classify labels May 23, 2024
@mattahrens mattahrens added the test Only impacts tests label May 28, 2024
@mattahrens
Copy link
Collaborator

Preference is to use a test jar artifact to include rather than depending on full Spark source.

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label May 28, 2024
@sameerz sameerz removed the feature request New feature or request label May 28, 2024
@gerashegalov gerashegalov self-assigned this May 29, 2024
gerashegalov added a commit to gerashegalov/spark-rapids that referenced this issue May 29, 2024
Closes NVIDIA#10875
Contributes to NVIDIA#10773

Spark UTs need to be able to spark.read data

Signed-off-by: Gera Shegalov <[email protected]>
gerashegalov added a commit that referenced this issue May 30, 2024
Closes #10875
Contributes to #10773
    
Unjar, cache, and share the test jar content among all test suites from the same jar

Test:
```bash
mvn package -Dbuildver=330 -pl tests -am -Dsuffixes='.*\.RapidsJsonSuite'
```

Signed-off-by: Gera Shegalov <[email protected]>
@gerashegalov
Copy link
Collaborator

Closed by #10946

SurajAralihalli pushed a commit to SurajAralihalli/spark-rapids that referenced this issue Jul 12, 2024
Closes NVIDIA#10875
Contributes to NVIDIA#10773
    
Unjar, cache, and share the test jar content among all test suites from the same jar

Test:
```bash
mvn package -Dbuildver=330 -pl tests -am -Dsuffixes='.*\.RapidsJsonSuite'
```

Signed-off-by: Gera Shegalov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

No branches or pull requests

4 participants