SparkApplication not resubmitted after spec update when previous run failed

### What question do you want to ask?

- [x] ✋ I have searched the open/closed issues and my issue is not listed.

When a SparkApplication is in a FAILED state (e.g., due to driver/executor pod failures), updating the spec (such as changing annotations for the driver or executor) does not trigger a re-submission of the application.

According to the [documentation](https://www.kubeflow.org/docs/components/spark-operator/user-guide/working-with-sparkapplication/#updating-a-sparkapplication), the operator should detect spec changes and re-submit the application, even terminating the running one if necessary.

My question is:

1. Are there any gotchas in this behaviour?
2. What kind of spec changes trigger re-submission?
3. Is there a configuration that would prevent this?



### Additional context

My ScheduledSparkApplication was not operating correctly due to a driver/executor pod error.

- I updated the application spec by modifying driver and executor annotations and redeployed my Helm chart.
- The spec was updated successfully, but no new driver or executor pods were created with this updated spec.
- I expected the operator to re-submit the application based on the updated spec.

My SparkApplication manifest for reference:

```yaml
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: ScheduledSparkApplication
metadata:
  name: "my-spark-job"
  namespace: default
  annotations:
    app: custom-spark-job
    prometheus.io/scrape: "true"
    prometheus.io/port: "8090"
  labels:
    ...
spec:
  concurrencyPolicy: Forbid
  schedule: "@every 5m"
  successfulRunHistoryLimit: 1
  failedRunHistoryLimit: 3
  template:
    image: "my-image"
    imagePullPolicy: IfNotPresent
    mode: cluster
    restartPolicy:
      type: Always
      onFailureRetryInterval: 10
      onSubmissionFailureRetryInterval: 10
    sparkVersion: "3.5.3"
    mainApplicationFile: "local:///opt/spark/src/job.py"
    type: "Python"
    pythonVersion: "3"
    sparkConf:
      spark.delta.logStore.class: org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
      spark.kubernetes.container.image.pullSecrets: "regcred"
      spark.kubernetes.memoryOverheadFactor: "0.25"
      spark.sql.broadcastTimeout: "3600"
      spark.sql.streaming.metricsEnabled: "true"
      spark.metrics.appStatusSource.enabled: "true"
      spark.dynamicAllocation.schedulerBacklogTimeout: "5s"
      spark.dynamicAllocation.sustainedSchedulerBacklogTimeout: "5s"
      spark.dynamicAllocation.executorIdleTimeout: "120s"
      spark.dynamicAllocation.cachedExecutorIdleTimeout: "240s"
      spark.kubernetes.driver.label.sidecar.istio.io/inject: "false"
      spark.kubernetes.executor.label.sidecar.istio.io/inject: "false"
      spark.driver.extraClassPath: "/opt/docker/lib/*"
      spark.executor.extraClassPath: "/opt/docker/lib/*"
    driver:
      annotations:
        vault.hashicorp.com/agent-init-first: "true"
        vault.hashicorp.com/agent-inject: "true"
        vault.hashicorp.com/agent-inject-status: "update"
        vault.hashicorp.com/agent-pre-populate-only: "true"
        vault.hashicorp.com/role: "some-vault-role"
        vault.hashicorp.com/template-static-secret-render-interval: 15s
        vault.hashicorp.com/agent-inject-secret-client-secrets.sh: "foo"
        vault.hashicorp.com/agent-inject-template-client-secrets.sh: |
          {{- with secret "some/secret/name" -}}
          ... secrets
          {{- end }}
      cores: 1
      coreLimit: "1000m"
      memory: "500m"
      podSecurityContext:
        runAsGroup: 185
        runAsNonRoot: true
        runAsUser: 185
        seccompProfile:
          type: RuntimeDefault
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop:
          - ALL
        readOnlyRootFilesystem: true
      labels:
        ...
      serviceAccount: default
      envVars:
        ...
      javaOptions: "-Dlogback.configurationFile=/opt/app/resources/logback.xml"
      volumeMounts:
        - mountPath: /tmp
          name: tmp
    executor:
      annotations:
        vault.hashicorp.com/agent-init-first: "true"
        vault.hashicorp.com/agent-inject: "true"
        vault.hashicorp.com/agent-inject-status: "update"
        vault.hashicorp.com/agent-pre-populate-only: "true"
        vault.hashicorp.com/role: "some-vault-role"
        vault.hashicorp.com/template-static-secret-render-interval: 15s
        vault.hashicorp.com/agent-inject-secret-client-secrets.sh: "foo"
        vault.hashicorp.com/agent-inject-template-client-secrets.sh: |
          {{- with secret "some/secret/name" -}}
          ... secrets
          {{- end }}
      coreLimit: "1000m"
      cores: 1
      instances: 2
      memory: "1g"
      podSecurityContext:
        runAsGroup: 185
        runAsNonRoot: true
        runAsUser: 185
        seccompProfile:
          type: RuntimeDefault
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop:
          - ALL
        readOnlyRootFilesystem: true
      labels:
        ...
      envVars:
        ...
      javaOptions: "-Dlogback.configurationFile=/opt/app/resources/logback.xml"
      volumeMounts:
        - mountPath: /tmp
          name: tmp
    monitoring:
      exposeDriverMetrics: true
      exposeExecutorMetrics: true
      metricsPropertiesFile: "/etc/metrics/conf/metrics.properties"
      prometheus:
        configFile: "/etc/metrics/conf/prometheus.yaml"
        jmxExporterJar: "/prometheus/jmx_prometheus_javaagent.jar"
        port: 8090
    volumes:
      - emptyDir: {}
        name: tmp
```

Thanks in advance for the clarification and for all the work on this great project!

### Have the same question?

Give it a 👍 We prioritize the question with most 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SparkApplication not resubmitted after spec update when previous run failed #2618

What question do you want to ask?

Additional context

Have the same question?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SparkApplication not resubmitted after spec update when previous run failed #2618

Description

What question do you want to ask?

Additional context

Have the same question?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions