-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting jep.JepException: java.util.concurrent.TimeoutException: Futures timed out after [100 seconds] exception in starting a training from ORCA pytorch estimation with BigDL backend #5560
Comments
@qiuxin2012 could you help take a look at this issue? |
Same issue as #4800? |
@amardeepjaiman As the notebook is for colab, so you should have done a lots of changes. Could you tell us the detail steps?
|
I am using Azure Databricks Environment to execute it in notebook. #!/bin/bash I am trying to runn the BigDL code in databricks notebook not command line. Please let me know if you need more information. |
@PatrickkZ Please try to reproduce the error, or find a right way to run the notebook. |
@amardeepjaiman, hi, we have already reproduce the same error on databricks. we are finding a way to solve this problem, We'll let you know when we have more information. |
the same error |
How long will it take? |
Layer info: TorchModel[5d5e341e]
|
@amardeepjaiman @xbinglzh I failed to enable jep backend pytorch estimator. But I run pyspark backend pytorch estimator successfully. See example https://github.com/intel-analytics/BigDL/blob/v2.0.0/python/orca/example/learn/pytorch/fashion_mnist/fashion_mnist.py
Databricks's Spark Conf: (spark.executor.cores, spark.cores.max should match your cluster, my is one 4 cores executor.)
You need to delete the argments parser in the notebook and use following arguments:
|
You can use below code in your notebook directly:
|
ok. let me try and get back to you. |
Hi @qiuxin2012 , I tried to use the init script shared by you , but I am getting init script faliure while starting the databricks cluster. Which databricks runtime version you are using ? Please check the attached error snapshot and cluster configuration. |
Cluster is up with init script. When i run the given source code with Spark backend, training seems to be started but getting following error with model save directory in save_pkl function. java.io.FileNotFoundException: /databricks/driver/state.pklPy4JJavaError Traceback (most recent call last) in main() /databricks/python/lib/python3.8/site-packages/bigdl/orca/learn/pytorch/pytorch_pyspark_estimator.py in fit(self, data, epochs, batch_size, profile, reduce_results, info, feature_cols, label_cols, callbacks) /databricks/python/lib/python3.8/site-packages/bigdl/orca/learn/pytorch/pytorch_pyspark_estimator.py in _get_state_dict_from_remote(remote_dir) /databricks/python/lib/python3.8/site-packages/bigdl/dllib/utils/file_utils.py in get_remote_file_to_local(remote_path, local_path, over_write) /databricks/python/lib/python3.8/site-packages/bigdl/dllib/utils/file_utils.py in callZooFunc(bigdl_type, name, *args) /databricks/python/lib/python3.8/site-packages/bigdl/dllib/utils/file_utils.py in callZooFunc(bigdl_type, name, *args) /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in call(self, *args) /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) Py4JJavaError: An error occurred while calling o372.getRemoteFileToLocal. It seems model_dir path we are giving in params , it is not taking it on databrick. I have also tried to give shared location path on DBFS as '/dbfs/FileStore' and 'dbfs:/FileStore' but it is also not taking it. Thanks, |
Hi @qiuxin2012 , I was able to solve the above issue using the latest nightly build of bigdl-spark3 from BigDL repos. Now the training is running with above configuration where I have 1 min worker (with4 cores) assigned and training is running on single worker node. org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(5, 1) finished unsuccessfully.Py4JJavaError Traceback (most recent call last) in /local_disk0/spark-535f5fcb-a5b3-4f65-978d-b2904ffacebc/userFiles-3b1eb76a-8e9c-40ee-85cc-cd7442a8bd6b/addedFile9081478138653500889bigdl_spark_3_1_2_2_1_0_SNAPSHOT_python_api-57d27.egg/bigdl/orca/learn/pytorch/pytorch_pyspark_estimator.py in fit(self, data, epochs, batch_size, profile, reduce_results, info, feature_cols, label_cols, validation_data, callbacks) /databricks/spark/python/pyspark/rdd.py in collect(self) /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in call(self, *args) /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
|
@amardeepjaiman Sorry for the late response. We have reproduced your new error, I will info you when we find a solution. |
@amardeepjaiman, hi, you can fix this by add a environment variable this works for me when I have 2 workers. here is my init script # use the latest version of orca
/databricks/python/bin/pip install --pre --upgrade bigdl-orca-spark3
/databricks/python/bin/pip install tqdm
/databricks/python/bin/pip install torch==1.11.0+cpu torchvision==0.12.0+cpu tensorboard -f https://download.pytorch.org/whl/torch_stable.html
/databricks/python/bin/pip install cloudpickle
cp /databricks/python/lib/python3.8/site-packages/bigdl/share/*/lib/*.jar /databricks/jars As for the elif backend in ["ray", "spark"]:
orca_estimator = Estimator.from_torch(model=model_creator,
optimizer=optimizer_creator,
loss=criterion,
metrics=[Accuracy()],
model_dir=None,
use_tqdm=True,
backend=backend) if |
Hi,
I am trying to run a Fashion MNIST sample code given in BigDL repo on Azure Databricks spark cluster environment. Sample code link is here :
https://github.com/intel-analytics/BigDL/blob/main/python/orca/colab-notebook/examples/fashion_mnist_bigdl.ipynb
Cluster Configuration:
I have 1 Azure D4_V5 Based driver node and 2 Azure standard D4_V5 based worker nodes setup in my spark cluster.
Azure Databricks Runtime : 9.1 LTS ML (Scala 2.12, Spark 3.1.2)
Spark Configuration is below :
spark.executorEnv.PYTHONHOME /databricks/python3/lib/python3.8
spark.serializer org.apache.spark.serializer.JavaSerializer
spark.executorEnv.KMP_BLOCKTIME 0
spark.databricks.delta.preview.enabled true
spark.rpc.message.maxSize 2047
spark.executor.cores 3
spark.executor.memory 8g
spark.files.fetchTimeout 100000s
spark.network.timeout 100000s
spark.databricks.conda.condaMagic.enabled true
spark.driver.memory 8g
spark.scheduler.minRegisteredResourcesRatio 1.0
spark.scheduler.maxRegisteredResourcesWaitingTime 60s
spark.executor.heartbeatInterval 1000000
spark.cores.max 6
spark.default.parallelism 1000
spark.executorEnv.OMP_NUM_THREADS 1
spark.driver.cores 3
I create the estimator using
orca_estimator = Estimator.from_torch(model=net, optimizer=optimizer, loss=criterion, metrics=[Accuracy()], backend="bigdl")
and geting exception in following line :
from bigdl.orca.learn.trigger import EveryEpoch
orca_estimator.fit(data=trainloader, epochs=epochs, validation_data=testloader, checkpoint_trigger=EveryEpoch())
Please find below the full stacktrace of the error I am getting
jep.JepException: java.util.concurrent.TimeoutException: Futures timed out after [100 seconds]
at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.threadExecute(PythonInterpreter.scala:98)
at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.createInterpreter(PythonInterpreter.scala:82)
at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.init(PythonInterpreter.scala:63)
at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.check(PythonInterpreter.scala:56)
at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.exec(PythonInterpreter.scala:104)
at com.intel.analytics.bigdl.orca.net.PythonFeatureSet$.$anonfun$loadPythonSet$1(PythonFeatureSet.scala:90)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:868)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:868)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:380)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:344)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:150)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:119)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:91)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:813)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1657)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:816)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:672)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [100 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
at scala.concurrent.Await$.$anonfun$result$1(package.scala:220)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:57)
at scala.concurrent.Await$.result(package.scala:146)
at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.$anonfun$threadExecute$2(PythonInterpreter.scala:91)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.threadExecute(PythonInterpreter.scala:90)
... 28 more
org.apache.spark.rdd.RDD.count(RDD.scala:1263)
com.intel.analytics.bigdl.orca.net.PythonFeatureSet$.loadPythonSet(PythonFeatureSet.scala:86)
com.intel.analytics.bigdl.orca.net.PythonFeatureSet.(PythonFeatureSet.scala:168)
com.intel.analytics.bigdl.orca.net.PythonFeatureSet$.python(PythonFeatureSet.scala:61)
com.intel.analytics.bigdl.orca.net.python.PythonZooNet.createFeatureSetFromPyTorch(PythonZooNet.scala:283)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
py4j.Gateway.invoke(Gateway.java:295)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.GatewayConnection.run(GatewayConnection.java:251)
java.lang.Thread.run(Thread.java:748)
Please let me know if someone has already faced this issue in past.
Also requesting BigDL official team to support on this issue as i want to use the BigDL library for my deep learning training on Spark cluster for distributed training.
Thanks in Adavance
The text was updated successfully, but these errors were encountered: