Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

修复了在 Notebook 中加载 JuiceFS 文本文件失败的问题。Fix the issue that fail to load JuiceFS text files in Notebook. #73

Closed
chncaesar opened this issue Feb 12, 2022 · 1 comment
Assignees
Labels
Milestone

Comments

@chncaesar
Copy link
Contributor

chncaesar commented Feb 12, 2022

Issue Description

When running byzer-lang with juiceFS, loading JuiceFS text file in Notebook LOAD text.jfs://test/access.log AS nginx_raw_access_log; failed with an exception

2022-02-12 20:37:51,607 INFO job.DefaultMLSQLJobProgressListener: [owner] [admin] [groupId] [6908c4e6-17fb-4c89-b256-5d692a25ed82] __MMMMMM__ Total jobs: 1 current job:1 job script:LOAD text.`jfs://test/access.log` AS nginx_raw_access_log
org.apache.spark.sql.AnalysisException: Path does not exist: file:/mlsql/admin/jfs:/test/access.log
    at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4(DataSource.scala:803)
    at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4$adapted(DataSource.scala:800)
    at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:372)
    at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
    at scala.util.Success.$anonfun$map$1(Try.scala:255)
    at scala.util.Success.map(Try.scala:213)
    at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
    at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
    at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
    at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
    at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1067)
    at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1703)
    at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:172)

Cause analysis

The exception shows real path is file:/mlsql/admin/jfs:/test/access.log. /mlsql/admin is user admin's home directory, and jfs://test is JuiceFS's scheme name, which is defined in core-site.xml

<property>
  <name>juicefs.test.meta</name>
  <value>mysql://zjc:zjc@(localhost:13306)/jfs</value>
</property>

The realPath logic is in

DslAdaptor.scala
def withPathPrefix(prefix: String, path: String): String = {

    val newPath = cleanStr(path)
    if (prefix.isEmpty) return newPath

    if (path.contains("..")) {
      throw new RuntimeException("path should not contains ..")
    }
    if (path.startsWith("/")) {
      return prefix + path.substring(1, path.length)
    }
    return prefix + newPath
  }

This works for path starting with "/"; but breaks if path starts with jfs:// hdfs:// wasb:// etc.

Proposed Solutions

Code Change

Since Byzer-lang uses HDFS-compatible APi to access 3rd-party storages, The real path format should be <storage_type>://<scheme>/<user_home_path>/<original_path> . In the case of juicefs, the real path should be jfs://test/mlsql/admin/access.log for JuiceFS.

For local fileSystem, the real path is /<user_home_path>/<original_path> .

So the new logic should be:

  1. If original path does not start with "/", generate real path like: <stroage_type>:///<user_home_path>/<original_path>
  2. If the original path starts with "/", it's local file system. generate path like: /<user_home_path>/<original_path>.

Personally, I prefer this solution.

Configuration Change

Add config to core-site.xml and change code to LOAD text./access.log;`

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>juicefs://test/</value>
    </property>
</configuration>
@ZhengshuaiPENG ZhengshuaiPENG transferred this issue from byzer-org/byzer-lang Feb 13, 2022
@ZhengshuaiPENG ZhengshuaiPENG added the bug Something isn't working label Feb 13, 2022
@ZhengshuaiPENG ZhengshuaiPENG added this to the Sprint-02/25 milestone Feb 13, 2022
@chncaesar
Copy link
Contributor Author

Conclusion:
The storage scheme should not exposed to users. Second solution(Configuration Change) is used.

@Lindsaylin Lindsaylin changed the title Failed to load JuiceFS text file in Notebook 修复了在 Notebook 中加载 JuiceFS 文本文件失败的问题。Fix the issue that fail to load JuiceFS text files in Notebook. Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants