Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve logs for spark task failure #106

Closed
jdries opened this issue Jan 10, 2023 · 6 comments
Closed

improve logs for spark task failure #106

jdries opened this issue Jan 10, 2023 · 6 comments
Assignees

Comments

@jdries
Copy link
Contributor

jdries commented Jan 10, 2023

openEO now returns logs like (j-dc9f81221c3f4b9cb3bb46bc97832213):
{
"id": "[1673357256568, 6457]",
"time": "2023-01-10T13:27:36.568Z",
"level": "error",
"message": "Exception in task 0.0 in stage 45.0 (TID 503)"
},

This is a task failure, not a stage failure, and the final job was succesfull.
The real error was:
Error while reading GridBounds(0,0,1023,1023) from: https://services.terrascope.be/download/WORLDCOVER/ESA_WORLDCOVER_10M_2021_V200/MAP/ESA_WorldCover_10m_2021_v200_N39W006_Map/ESA_WorldCover_10m_2021_v200_N39W006_Map.tif

Can we improve this to log the real message, and perhaps reduce task failures to 'warning' level? Users tend to think that an error is always a hard failure.

Note that the useful message is in ES, under exc_info, maybe we can somehow append that to the message, or the first line?

@jdries
Copy link
Contributor Author

jdries commented Jan 10, 2023

Real stage failures are equally useless in the logs:

`  {
    "id": "[1673362908789, 647909]",
    "time": "2023-01-10T15:01:48.789Z",
    "level": "error",
    "message": "Error processing batch job: Py4JJavaError('An error occurred while calling z:org.openeo.geotrellis.geotiff.package.saveRDD.\\n', JavaObject id=o4863)"
  },
  {
    "id": "[1673362908873, 685069]",
    "time": "2023-01-10T15:01:48.873Z",
    "level": "error",
    "message": "Error processing batch job: Py4JJavaError('An error occurred while calling z:org.openeo.geotrellis.geotiff.package.saveRDD.\\n', JavaObject id=o4863)"
  },
  {
    "id": "[1673362908958, 698722]",
    "time": "2023-01-10T15:01:48.958Z",
    "level": "error",
    "message": "Unhandled Py4JJavaError exception: Py4JJavaError('An error occurred while calling z:org.openeo.geotrellis.geotiff.package.saveRDD.\\n', JavaObject id=o4863)"
  }`

The part about org.openeo.geotrellis.geotiff.package.saveRDD is not really necessary. The real error is hidden deeper in the stack:

raise Py4JJavaError(\npy4j.protocol.Py4JJavaError: An error occurred while calling z:org.openeo.geotrellis.geotiff.package.saveRDD.\n: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 46.0 failed 4 times, most recent failure: Lost task 0.3 in stage 46.0 (TID 869) (10.42.67.152 executor 3): org.apache.spark.api.python.PythonException: Traceback (most recent call last):\n File \"/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py\", line 619, in main\n process()\n File \"/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py\", line 611, in process\n serializer.dump_stream(out_iter, outfile)\n File \"/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py\", line 132, in dump_stream\n for obj in iterator:\n File \"/usr/local/spark/python/lib/pyspark.zip/pyspark/util.py\", line 74, in wrapper\n return f(*args, **kwargs)\n File \"/opt/openeo/lib/python3.8/site-packages/epsel.py\", line 44, in wrapper\n return _FUNCTION_POINTERS[key](*args, **kwargs)\n File \"/opt/openeo/lib/python3.8/site-packages/epsel.py\", line 37, in first_time\n return f(*args, **kwargs)\n File \"/opt/openeo/lib/python3.8/site-packages/openeo/util.py\", line 362, in wrapper\n return f(*args, **kwargs)\n File \"/opt/openeo/lib/python3.8/site-packages/openeogeotrellis/collections/s1backscatter_orfeo.py\", line 787, in process_product\n dem_dir_context = S1BackscatterOrfeo._get_dem_dir_context(\n File \"/opt/openeo/lib64/python3.8/site-packages/openeogeotrellis/collections/s1backscatter_orfeo.py\", line 273, in _get_dem_dir_context\n raise FeatureUnsupportedException(\nopeneo_driver.errors.FeatureUnsupportedException: Unsupported elevation model 'COPERNICUS_30'\

A spark listener could also be an option? Or can we walk down the exception stack and print the error of the root cause?

jdries added a commit to Open-EO/openeo-geopyspark-driver that referenced this issue Jan 10, 2023
@bossie
Copy link
Collaborator

bossie commented Jan 17, 2023

FYI I removed raise SystemExit() because its exit status 0 tends to mark a batch job as successful when it wasn't:

openeo.rest.OpenEoClientException: No assets in result.

@jdries jdries added this to the sap05-usability milestone Feb 8, 2023
@bossie
Copy link
Collaborator

bossie commented Feb 13, 2023

The repr_truncate(..., width=200) is too narrow:

Your openEO batch job failed during Spark execution: 'Job aborted due to stage failure: Task 5 in stage 25.0 failed 4 times, most recent failure: Lost task 5.3 in stage 25.0 (TID 404) (epod182.vgt.vito.be executor 77): java.lang.IllegalArgumentExcep...'

@jdries
Copy link
Contributor Author

jdries commented Mar 7, 2023

@EmileSonneveld This is the method that improves exceptions for synchronous calls:
openeogeotrellis.backend.GeoPySparkBackendImplementation.summarize_exception

I suggest reusing it for batch jobs, and maybe extending if necessary.

@jdries
Copy link
Contributor Author

jdries commented Mar 7, 2023

Here's another example that we could deal with better:
Open-EO/openeo-geopyspark-driver#363

@jdries
Copy link
Contributor Author

jdries commented May 9, 2023

done

@jdries jdries closed this as completed May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants