Skip to content

[🐛 Bug]: No resilient option to override variable SE_DRAIN_AFTER_SESSION_COUNT when using FluxCD HelmReleases #2901

@lukaszmatura-sa

Description

@lukaszmatura-sa

What happened?

Hi,

I've been facing with issues regarding the way the variables like SE_DRAIN_AFTER_SESSION_COUNT are handled in selenium-grid source code versus the way we manage the versions upgrades in our Kuberenets clusters.
The problem is with automation because each time we push a new version of "selenium-grid" to our Kubernetes cluster, the corresponding Helm Releases complain about "patch" operation as they cannot match the environment variable (in our case) "SE_DRAIN_AFTER_SESSION_COUNT".

It looks like this parameter is set by default to the "0" value, and we want to override in with the value "30" and this is how we do it in the code:

Image

The problem is with this setting:
- name: SE_DRAIN_AFTER_SESSION_COUNT value: "30"

Although it should work correctly, it doesn't. The reason for the issue is that the Helm Release reconciliation when a new version is detected, it tries to patch the existing resources but it cannot match that env variable because it has two different definitions of it (with the values "0" and "30" in our case) and it fails:

Image

Therefore, each time we upgrade selenium-grid, we have to manually remove all underlying selenium Deployments from all our clusters, and then resume the Helm Release manually so it creates everything (all the resources) from scratch without using "patch" to any existing Deployments.
This is cumbersome and is causing downtimes.

We have been investigating the logic of:
https://github.com/SeleniumHQ/docker-selenium/blob/selenium-grid-0.45.1/charts/selenium-grid/templates/_helpers.tpl#L381

and it turns out that there's no way to disable it (so we can safely use our definition with the value of "30") and it's related to another setting nodeMaxSessions:

- name: SE_DRAIN_AFTER_SESSION_COUNT
   value: {{ and (eq (include "seleniumGrid.useKEDA" $) "true") (eq .Values.autoscaling.scalingType "job") | ternary $nodeMaxSessions 0 | quote }}

Since we don't want to change `nodeMaxSessions' to enforce the value of "30", I'm wondering if there's a chance to fix this behavior and just expose an option to only define the value of "SE_DRAIN_AFTER_SESSION_COUNT" so we don't need to re-define it in the way we do it now.
Another option would be to have a setting to disable this variable so it's not set by default at all - so we can define it on our side which shouldn't cause the conflicts with Helm "patch" operation.

Command used to start Selenium Grid with Docker (or Kubernetes)

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: comp-tests-selenium
spec:
  releaseName: comp-tests-selenium
  chart:
    spec:
      chart: selenium-grid
      sourceRef:
        kind: HelmRepository
        name: selenium-grid
      version: "0.45.1"
  interval: 10m
  timeout: 9m30s
  install:
    remediation:
      retries: 3
  # https://github.com/SeleniumHQ/docker-selenium/blob/trunk/charts/selenium-grid/values.yaml
  values:
    global:
      seleniumGrid:
        imagePullSecret: artifactory
        kubectlImage: docker.company.com/bitnami/kubectl:1.31
        imageRegistry: docker.company.com/selenium
    isolateComponents: false
    chromeNode:
      scaledObjectOptions:
        scaleTargetRef:
          name: selenium-chrome-node
      securityContext:
        allowPrivilegeEscalation: false
        runAsNonRoot: true
        capabilities:
          drop: [ "ALL" ]
        seccompProfile:
          type: RuntimeDefault
      imageName: node-chrome
      dshmVolumeSizeLimit: 1.5Gi
      replicas: 2
      resources:
        limits:
          cpu: 2 #by default from helm charts defined to 1
          memory: 1.5Gi
        requests:
          memory: 1Gi
          cpu: 1
      startupProbe:
        httpGet:
          path: /status
          port: 5555
        failureThreshold: 120
        periodSeconds: 5
      terminationGracePeriodSeconds: 90
#       Allow pod correctly shutdown
      deregisterLifecycle:
        preStop:
          exec:
            command: [ "bash", "-c", "/opt/bin/nodePreStop.sh" ]
      extraEnvironmentVariables: # Custom environment variables for chromeNode
        - name: SCREEN_WIDTH
          value: "1920"
        - name: SCREEN_HEIGHT
          value: "1080"
        - name: SCREEN_DEPTH
          value: "24"
        - name: SCREEN_DPI
          value: "74"
        - name: SE_DRAIN_AFTER_SESSION_COUNT
          value: "30"
        - name: SE_NODE_SESSION_TIMEOUT # The Node will automatically kill a session that has not had any activity in the last X seconds. This will release the slot for other tests
          value: "60"
        - name: SE_NODE_GRID_URL
          value: "http://comp-tests-selenium-selenium-hub.comp-tests-selenium${namespace_suffix}.svc:4444" #hrName-selenium-hub.namespace
        - name: SE_EVENT_BUS_HOST
          value: "comp-tests-selenium-selenium-hub.comp-tests-selenium${namespace_suffix}" #hrName-selenium-hub.namespace
      nodeSelector:
        qa: "true"
      tolerations:
        - key: qa
          value: "true"
          effect: NoSchedule
    firefoxNode:
      enabled: false
    edgeNode:
      enabled: false
    hub:
      securityContext:
        allowPrivilegeEscalation: false
        runAsNonRoot: true
        capabilities:
          drop: [ "ALL" ]
        seccompProfile:
          type: RuntimeDefault
      #      affinity: consider podAntiAffinity with hub and nodes, from newer versions chart provides this possibility
      imageName: hub
      serviceType: ClusterIP
      resources:
        limits:
          memory: 2Gi
        requests:
          memory: 1Gi
          cpu: 0.2
      annotations:
        karpenter.sh/do-not-disrupt: "true"
      extraEnvironmentVariables:  # Custom environment variables for hub
        - name: SCREEN_WIDTH
          value: "1920"
        - name: SCREEN_HEIGHT
          value: "1080"
        - name: SCREEN_DEPTH
          value: "24"
        - name: SCREEN_DPI
          value: "74"
        - name: SE_SESSION_REQUEST_TIMEOUT # A new incoming session request is added to the queue. Requests sitting in the queue for longer than the configured time will timeout.
          value: "180"
      nodeSelector:
        qa: "true"
      tolerations:
        - key: qa
          value: "true"
          effect: NoSchedule
    ingress:
      className: private-nginx
      annotations:
        nginx.ingress.kubernetes.io/service-upstream: "true"
        nginx.ingress.kubernetes.io/backend-protocol: HTTP
        external-dns.alpha.kubernetes.io/private: "true"
        cert-manager.io/cluster-issuer: letsencrypt
      hostname: "comp-tests-selenium${namespace_suffix}.tools.${cluster_region}.${cluster_domain}"
      tls:
        - secretName: comp-tests-selenium-private-ingress-tls-selenium
          hosts:
            - "comp-tests-selenium${namespace_suffix}.tools.${cluster_region}.${cluster_domain}"
    autoscaling:
      patchObjectFinalizers:
        enabled: true  #https://github.com/SeleniumHQ/docker-selenium/issues/2196
      enabled: false
      enableWithExistingKEDA: true
      scalingType: deployment
      scaledOptions:
        minReplicaCount: 0
        maxReplicaCount: 5
        pollingInterval: 10
      scaledObjectOptions:
        #        triggers: #consider this section when connection to hub is not properly set
        advanced:
          horizontalPodAutoscalerConfig:
            behavior:
              scaleUp:
                stabilizationWindowSeconds: 30
                policies:
                  - type: Pods
                    value: 4
                    periodSeconds: 10
              scaleDown:
                stabilizationWindowSeconds: 360
                policies:
                  - type: Pods
                    value: 1
                    periodSeconds: 150

Relevant log output

Status:                                                                                                                                                                                                   
│   Conditions:                                                                                                                                                                                             
│     Last Transition Time:  2025-07-17T08:39:26Z                                                                                                                                                           
│     Message:               Failed to upgrade after 1 attempt(s)                                                                                                                                           
│     Observed Generation:   54                                                                                                                                                                             
│     Reason:                RetriesExceeded                                                                                                                                                                
│     Status:                True                                                                                                                                                                           
│     Type:                  Stalled                                                                                                                                                                        
│     Last Transition Time:  2025-07-17T08:07:07Z                                                                                                                                                           
│     Message:               Helm upgrade failed for release comp-tests-selenium/comp-tests-selenium with chart selenium-grid@0.45.1: cannot patch "comp-tests-selenium-selenium-node-chrome" with kind Deploy │
│ ment: The order in patch list:                                                                                                                                                                            │
│ [map[name:SE_NODE_STEREOTYPE_EXTRA value:] map[name:SE_DRAIN_AFTER_SESSION_COUNT value:0] map[name:SE_DRAIN_AFTER_SESSION_COUNT value:30] map[name:SE_NODE_BROWSER_VERSION value:] map[name:SE_NODE_PLATF │
│ RM_NAME value:[] map[name:SE_OTEL_RESOURCE_ATTRIBUTES value:app.kubernetes.io/component=selenium-grid-4.34.0-20250707,app.kubernetes.io/instance=comp-tests-selenium,app.kubernetes.io/managed-by=helm,app │
│ .kubernetes.io/version=4.34.0-20250707,helm.sh/chart=selenium-grid-0.45.1]]                                                                                                                               │
│  doesn't match $setElementOrder list:                                                                                                                                                                     │
│ [map[name:KUBERNETES_NODE_HOST_IP] map[name:SE_NODE_MAX_SESSIONS] map[name:SE_NODE_ENABLE_MANAGED_DOWNLOADS] map[name:SE_NODE_STEREOTYPE_EXTRA] map[name:SE_DRAIN_AFTER_SESSION_COUNT] map[name:SE_NODE_B │
│ OWSER_NAME[] map[name:SE_NODE_BROWSER_VERSION] map[name:SE_NODE_PLATFORM_NAME] map[name:SE_NODE_CONTAINER_NAME] map[name:SE_OTEL_SERVICE_NAME] map[name:SE_OTEL_RESOURCE_ATTRIBUTES] map[name:SE_NODE_HOS │
│ [] map[name:SE_NODE_PORT] map[name:SE_NODE_REGISTER_PERIOD] map[name:SE_NODE_REGISTER_CYCLE] map[name:SCREEN_WIDTH] map[name:SCREEN_HEIGHT] map[name:SCREEN_DEPTH] map[name:SCREEN_DPI] map[name:SE_DRAIN │
│ AFTER_SESSION_COUNT[] map[name:SE_NODE_SESSION_TIMEOUT] map[name:SE_NODE_GRID_URL] map[name:SE_EVENT_BUS_HOST]]                                                                                           
│     Observed Generation:   54                                                                                                                                                                             
│     Reason:                UpgradeFailed                                                                                                                                                                  
│     Status:                False                                                                                                                                                                          
│     Type:                  Ready

Operating System

Kubernetes EKS

Docker Selenium version (image tag)

4.34.0-20250707

Selenium Grid chart version (chart version)

0.45.1

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions