Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SparkKubernetesOperator when using initContainers #38119

Merged
merged 3 commits into from
May 1, 2024

Conversation

ShelRoman
Copy link
Contributor

@ShelRoman ShelRoman commented Mar 13, 2024

I found this error while trying to use initContainers with spark job after upgrading apache-airflow-providers-cncf-kubernetes to 8.0.1 (reproducible on 8.0.0 as well).

[2024-03-13, 11:36:49 UTC] {custom_object_launcher.py:312} ERROR - Exception when attempting to create spark job
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/operators/custom_object_launcher.py", line 305, in start_spark_job
    self.check_pod_start_failure()
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/operators/custom_object_launcher.py", line 348, in check_pod_start_failure
    raise AirflowException(f"Spark Job Failed. Status: {waiting_reason}, Error: {waiting_message}")
airflow.exceptions.AirflowException: Spark Job Failed. Status: PodInitializing, Error: None
[2024-03-13, 11:36:49 UTC] {taskinstance.py:2728} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 439, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 414, in _execute_callable
    return execute_callable(context=context, **execute_callable_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/operators/spark_kubernetes.py", line 265, in execute
    self.pod = self.get_or_create_spark_crd(self.launcher, context)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/operators/spark_kubernetes.py", line 223, in get_or_create_spark_crd
    driver_pod, spark_obj_spec = launcher.start_spark_job(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/tenacity/__init__.py", line 289, in wrapped_f
    return self(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/tenacity/__init__.py", line 379, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/tenacity/__init__.py", line 314, in iter
    return fut.result()
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/airflow/.local/lib/python3.11/site-packages/tenacity/__init__.py", line 382, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/operators/custom_object_launcher.py", line 313, in start_spark_job
    raise e
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/operators/custom_object_launcher.py", line 305, in start_spark_job
    self.check_pod_start_failure()
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/operators/custom_object_launcher.py", line 348, in check_pod_start_failure
    raise AirflowException(f"Spark Job Failed. Status: {waiting_reason}, Error: {waiting_message}")
airflow.exceptions.AirflowException: Spark Job Failed. Status: PodInitializing, Error: None

Using a patched operator with these changes helped to overcome the issue.
There might be other statutes to be included in this check.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@eladkal
Copy link
Contributor

eladkal commented Mar 13, 2024

Can you add unit test to avoid regression?

@ShelRoman
Copy link
Contributor Author

Can you add unit test to avoid regression?

Sure, will do

Copy link
Contributor

@amoghrajesh amoghrajesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good, add a few tests

@eladkal
Copy link
Contributor

eladkal commented Apr 3, 2024

@ShelRoman kind reminder, can you please add tests to cover this change?

@ShelRoman
Copy link
Contributor Author

@ShelRoman kind reminder, can you please add tests to cover this change?

done

Copy link
Contributor

@amoghrajesh amoghrajesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me +1
@eladkal @hussein-awala WDYT?

@eladkal eladkal merged commit 97871a0 into apache:main May 1, 2024
49 checks passed
@ShelRoman ShelRoman deleted the patch-1 branch May 1, 2024 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants