-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
What happened?
Currently, when the Spark driver pod fails to pull container images and enters ErrImagePull/ImagePullBackOff states, the SparkApplication remains stuck in the "submitted" state indefinitely. The operator doesn't provide any mechanism to detect or resolve this situation.
Is there currently any built-in mechanism in Spark Operator to handle driver image pull failures?
If not, are there any recommended workarounds to address this issue?
Would the maintainers be open to implementing proper failure detection for image pull errors?
✋ I also found the same closed issue without any answers and solutions: #1737
Reproduction Code
Simply provide an invalid image name, for example, an invalid tag
Expected behavior
SparkApplication should transition to "failed" state with appropriate error message when image pull failures occur
Actual behavior
SparkApplication hangs in "submitted" state when driver pod cannot pull images
Environment & Versions
- Kubernetes Version: 1.25.11
- Spark Operator Version: 2.3.0
- Apache Spark Version: >= 3.5.0
Additional context
No response
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍