-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Open
Description
What question do you want to ask?
- ✋ I have searched the open/closed issues and my issue is not listed.
When a SparkApplication is in a FAILED state (e.g., due to driver/executor pod failures), updating the spec (such as changing annotations for the driver or executor) does not trigger a re-submission of the application.
According to the documentation, the operator should detect spec changes and re-submit the application, even terminating the running one if necessary.
My question is:
- Are there any gotchas in this behaviour?
- What kind of spec changes trigger re-submission?
- Is there a configuration that would prevent this?
Additional context
My ScheduledSparkApplication was not operating correctly due to a driver/executor pod error.
- I updated the application spec by modifying driver and executor annotations and redeployed my Helm chart.
- The spec was updated successfully, but no new driver or executor pods were created with this updated spec.
- I expected the operator to re-submit the application based on the updated spec.
My SparkApplication manifest for reference:
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: ScheduledSparkApplication
metadata:
name: "my-spark-job"
namespace: default
annotations:
app: custom-spark-job
prometheus.io/scrape: "true"
prometheus.io/port: "8090"
labels:
...
spec:
concurrencyPolicy: Forbid
schedule: "@every 5m"
successfulRunHistoryLimit: 1
failedRunHistoryLimit: 3
template:
image: "my-image"
imagePullPolicy: IfNotPresent
mode: cluster
restartPolicy:
type: Always
onFailureRetryInterval: 10
onSubmissionFailureRetryInterval: 10
sparkVersion: "3.5.3"
mainApplicationFile: "local:///opt/spark/src/job.py"
type: "Python"
pythonVersion: "3"
sparkConf:
spark.delta.logStore.class: org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
spark.kubernetes.container.image.pullSecrets: "regcred"
spark.kubernetes.memoryOverheadFactor: "0.25"
spark.sql.broadcastTimeout: "3600"
spark.sql.streaming.metricsEnabled: "true"
spark.metrics.appStatusSource.enabled: "true"
spark.dynamicAllocation.schedulerBacklogTimeout: "5s"
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout: "5s"
spark.dynamicAllocation.executorIdleTimeout: "120s"
spark.dynamicAllocation.cachedExecutorIdleTimeout: "240s"
spark.kubernetes.driver.label.sidecar.istio.io/inject: "false"
spark.kubernetes.executor.label.sidecar.istio.io/inject: "false"
spark.driver.extraClassPath: "/opt/docker/lib/*"
spark.executor.extraClassPath: "/opt/docker/lib/*"
driver:
annotations:
vault.hashicorp.com/agent-init-first: "true"
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/agent-inject-status: "update"
vault.hashicorp.com/agent-pre-populate-only: "true"
vault.hashicorp.com/role: "some-vault-role"
vault.hashicorp.com/template-static-secret-render-interval: 15s
vault.hashicorp.com/agent-inject-secret-client-secrets.sh: "foo"
vault.hashicorp.com/agent-inject-template-client-secrets.sh: |
{{- with secret "some/secret/name" -}}
... secrets
{{- end }}
cores: 1
coreLimit: "1000m"
memory: "500m"
podSecurityContext:
runAsGroup: 185
runAsNonRoot: true
runAsUser: 185
seccompProfile:
type: RuntimeDefault
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
labels:
...
serviceAccount: default
envVars:
...
javaOptions: "-Dlogback.configurationFile=/opt/app/resources/logback.xml"
volumeMounts:
- mountPath: /tmp
name: tmp
executor:
annotations:
vault.hashicorp.com/agent-init-first: "true"
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/agent-inject-status: "update"
vault.hashicorp.com/agent-pre-populate-only: "true"
vault.hashicorp.com/role: "some-vault-role"
vault.hashicorp.com/template-static-secret-render-interval: 15s
vault.hashicorp.com/agent-inject-secret-client-secrets.sh: "foo"
vault.hashicorp.com/agent-inject-template-client-secrets.sh: |
{{- with secret "some/secret/name" -}}
... secrets
{{- end }}
coreLimit: "1000m"
cores: 1
instances: 2
memory: "1g"
podSecurityContext:
runAsGroup: 185
runAsNonRoot: true
runAsUser: 185
seccompProfile:
type: RuntimeDefault
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
labels:
...
envVars:
...
javaOptions: "-Dlogback.configurationFile=/opt/app/resources/logback.xml"
volumeMounts:
- mountPath: /tmp
name: tmp
monitoring:
exposeDriverMetrics: true
exposeExecutorMetrics: true
metricsPropertiesFile: "/etc/metrics/conf/metrics.properties"
prometheus:
configFile: "/etc/metrics/conf/prometheus.yaml"
jmxExporterJar: "/prometheus/jmx_prometheus_javaagent.jar"
port: 8090
volumes:
- emptyDir: {}
name: tmp
Thanks in advance for the clarification and for all the work on this great project!
Have the same question?
Give it a 👍 We prioritize the question with most 👍
stepanen and aolear-ss
Metadata
Metadata
Assignees
Labels
No labels