Skip to content

SparkApplication not resubmitted after spec update when previous run failed #2618

@stepanen

Description

@stepanen

What question do you want to ask?

  • ✋ I have searched the open/closed issues and my issue is not listed.

When a SparkApplication is in a FAILED state (e.g., due to driver/executor pod failures), updating the spec (such as changing annotations for the driver or executor) does not trigger a re-submission of the application.

According to the documentation, the operator should detect spec changes and re-submit the application, even terminating the running one if necessary.

My question is:

  1. Are there any gotchas in this behaviour?
  2. What kind of spec changes trigger re-submission?
  3. Is there a configuration that would prevent this?

Additional context

My ScheduledSparkApplication was not operating correctly due to a driver/executor pod error.

  • I updated the application spec by modifying driver and executor annotations and redeployed my Helm chart.
  • The spec was updated successfully, but no new driver or executor pods were created with this updated spec.
  • I expected the operator to re-submit the application based on the updated spec.

My SparkApplication manifest for reference:

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: ScheduledSparkApplication
metadata:
  name: "my-spark-job"
  namespace: default
  annotations:
    app: custom-spark-job
    prometheus.io/scrape: "true"
    prometheus.io/port: "8090"
  labels:
    ...
spec:
  concurrencyPolicy: Forbid
  schedule: "@every 5m"
  successfulRunHistoryLimit: 1
  failedRunHistoryLimit: 3
  template:
    image: "my-image"
    imagePullPolicy: IfNotPresent
    mode: cluster
    restartPolicy:
      type: Always
      onFailureRetryInterval: 10
      onSubmissionFailureRetryInterval: 10
    sparkVersion: "3.5.3"
    mainApplicationFile: "local:///opt/spark/src/job.py"
    type: "Python"
    pythonVersion: "3"
    sparkConf:
      spark.delta.logStore.class: org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
      spark.kubernetes.container.image.pullSecrets: "regcred"
      spark.kubernetes.memoryOverheadFactor: "0.25"
      spark.sql.broadcastTimeout: "3600"
      spark.sql.streaming.metricsEnabled: "true"
      spark.metrics.appStatusSource.enabled: "true"
      spark.dynamicAllocation.schedulerBacklogTimeout: "5s"
      spark.dynamicAllocation.sustainedSchedulerBacklogTimeout: "5s"
      spark.dynamicAllocation.executorIdleTimeout: "120s"
      spark.dynamicAllocation.cachedExecutorIdleTimeout: "240s"
      spark.kubernetes.driver.label.sidecar.istio.io/inject: "false"
      spark.kubernetes.executor.label.sidecar.istio.io/inject: "false"
      spark.driver.extraClassPath: "/opt/docker/lib/*"
      spark.executor.extraClassPath: "/opt/docker/lib/*"
    driver:
      annotations:
        vault.hashicorp.com/agent-init-first: "true"
        vault.hashicorp.com/agent-inject: "true"
        vault.hashicorp.com/agent-inject-status: "update"
        vault.hashicorp.com/agent-pre-populate-only: "true"
        vault.hashicorp.com/role: "some-vault-role"
        vault.hashicorp.com/template-static-secret-render-interval: 15s
        vault.hashicorp.com/agent-inject-secret-client-secrets.sh: "foo"
        vault.hashicorp.com/agent-inject-template-client-secrets.sh: |
          {{- with secret "some/secret/name" -}}
          ... secrets
          {{- end }}
      cores: 1
      coreLimit: "1000m"
      memory: "500m"
      podSecurityContext:
        runAsGroup: 185
        runAsNonRoot: true
        runAsUser: 185
        seccompProfile:
          type: RuntimeDefault
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop:
          - ALL
        readOnlyRootFilesystem: true
      labels:
        ...
      serviceAccount: default
      envVars:
        ...
      javaOptions: "-Dlogback.configurationFile=/opt/app/resources/logback.xml"
      volumeMounts:
        - mountPath: /tmp
          name: tmp
    executor:
      annotations:
        vault.hashicorp.com/agent-init-first: "true"
        vault.hashicorp.com/agent-inject: "true"
        vault.hashicorp.com/agent-inject-status: "update"
        vault.hashicorp.com/agent-pre-populate-only: "true"
        vault.hashicorp.com/role: "some-vault-role"
        vault.hashicorp.com/template-static-secret-render-interval: 15s
        vault.hashicorp.com/agent-inject-secret-client-secrets.sh: "foo"
        vault.hashicorp.com/agent-inject-template-client-secrets.sh: |
          {{- with secret "some/secret/name" -}}
          ... secrets
          {{- end }}
      coreLimit: "1000m"
      cores: 1
      instances: 2
      memory: "1g"
      podSecurityContext:
        runAsGroup: 185
        runAsNonRoot: true
        runAsUser: 185
        seccompProfile:
          type: RuntimeDefault
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop:
          - ALL
        readOnlyRootFilesystem: true
      labels:
        ...
      envVars:
        ...
      javaOptions: "-Dlogback.configurationFile=/opt/app/resources/logback.xml"
      volumeMounts:
        - mountPath: /tmp
          name: tmp
    monitoring:
      exposeDriverMetrics: true
      exposeExecutorMetrics: true
      metricsPropertiesFile: "/etc/metrics/conf/metrics.properties"
      prometheus:
        configFile: "/etc/metrics/conf/prometheus.yaml"
        jmxExporterJar: "/prometheus/jmx_prometheus_javaagent.jar"
        port: 8090
    volumes:
      - emptyDir: {}
        name: tmp

Thanks in advance for the clarification and for all the work on this great project!

Have the same question?

Give it a 👍 We prioritize the question with most 👍

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions