Skip to content
This repository was archived by the owner on Sep 21, 2021. It is now read-only.
This repository was archived by the owner on Sep 21, 2021. It is now read-only.

Multithreaded, concurrent and intensive use often cause stale containers to remain up forever #808

@mad-p

Description

@mad-p

Zalenium Image Version(s): 3.14.0g ( also reproducible with 3.14.0c )
Docker Version: 18.09.0, build 4d60db4
If using docker-compose, version: 1.23.1, build b02f1306
OS: OSX High Sierra ( also reproducible on CentOS 7.2.1511 and latest Arch Linux )
Docker Command to start Zalenium: Executing through docker-compose.yml

Expected Behavior

Stale containers will get removed after idle timeout.
Thread AutoStartProxyPoolPoller works as expected.

Actual Behavior

Stale containers won't get removed and remain up forever even after idle timeout.
Thread AutoStartProxyPoolPoller hangs forever.
Note that those containers can still be reused as normal.

Minimal code to reproduce the problem

docker-compose.yml

  • --desiredContainers 0 helps us to tell whether idle timeout is working or not. The number of node containers stays above zero when the problem occurs.
  • --debugEnabled true helps us to tell whether the debug log of 'Checking containers...' is constantly printed to standard output or not , i.e.; thread AutoStartProxyPoolPoller is working or not.
version: "2"

services:
  zalenium:
    image: dosel/zalenium:3.14.0g
    container_name: zalenium
    privileged: true
    tty: true
    ports:
      - "4444:4444"
    volumes:
      - /tmp/videos:/home/seluser/videos/
      - /var/run/docker.sock:/var/run/docker.sock
    command: >
      start
        --seleniumImageName elgalu/selenium:3.14.0-p16
        --desiredContainers 0
        --maxDockerSeleniumContainers 10
        --debugEnabled true
    environment:
      - PULL_SELENIUM_IMAGE=true

Ruby script

  • Run this ruby script in the background in a few threads concurrently for the duration of several minutes.
  • The desired number of threads depends on the number of CPU cores.
  • Running with two threads and four CPU cores is a good example.
  • Shorter idleTimeout i.e. higher frequency of stale containers getting removed seems to make the problem more reproducible in lesser period of time.
    In a "real" UI testing environment where idleTimeout defaults to 90 seconds, each UI test takes a dozen of seconds and the concurrency of tests is about four, it usually takes a couple of hours for the problem to occur.
require 'selenium-webdriver' # version 3.14.0 ( also reproducible with version 2.53.4 )

def exec
  caps = Selenium::WebDriver::Remote::Capabilities.chrome
  caps[:idleTimeout] = 10
  driver = Selenium::WebDriver.for :remote, url: 'http://${YOUR_ZALENIUM_HOST}:4444/wd/hub', desired_capabilities: caps
  sleep 10
  driver.quit
end

loop do
  begin
    exec
  rescue => e
    puts e
  end
end

Java thread dump taken after the problem occurs.

https://gist.github.com/mad-p/6082c9ee556ad84d1304be1c9f91b562

The Java thread dump was taken as follows.

docker exec -it ${YOUR_ZALENIUM_CONTAINER_NAME} bash
seluser@zalenium:~$ ps aux | grep java
seluser@zalenium:~$ sudo kill -3 ${YOUR_PROCESS_ID_OF_JAVA}

Root cause

The root cause seems to be the issue below.

Properly close the Apache response so that connections can be reused
eclipse-ee4j/jersey#3861

Tentative workaround

Use patched version of jersey-apache-connector.

git clone https://github.com/zalando/zalenium/
cd zalenium
git checkout 3.14.0g
mkdir -p src/main/java/org/glassfish/jersey/apache/connector
curl -o src/main/java/org/glassfish/jersey/apache/connector/ApacheConnector.java https://raw.githubusercontent.com/jersey/jersey/2.22.2/connectors/apache-connector/src/main/java/org/glassfish/jersey/apache/connector/ApacheConnector.java
# Here, manually apply the patch below to ApacheConnector.java.
# https://github.com/eclipse-ee4j/jersey/pull/3861/files
mvn clean package && (cd target && docker build -t ${YOUR_REPOSITORY}/zalenium:3.14.0g . )

Permanent workaround

Please consider upgrading com.spotify/docker-client:8.11.7 to a newer version(not released yet as of Nov 2018) where docker-client uses jersey-apache-connector:2.29(scheduled to be released on spring 2019).

Zalenium:3.14.0g uses docker-client:8.11.7.
https://github.com/zalando/zalenium/blob/3.14.0g/pom.xml#L62

docker-client:8.11.7 uses jersey-apache-connector:2.22.2.
https://github.com/spotify/docker-client/blob/v8.11.7/pom.xml#L109-L113

See also

com.spotify/docker-client

https://github.com/spotify/docker-client
https://mvnrepository.com/artifact/com.spotify/docker-client

jersey-apache-connector

https://github.com/jersey/jersey/ (old repo)
https://github.com/eclipse-ee4j/jersey/
https://mvnrepository.com/artifact/org.glassfish.jersey.connectors/jersey-apache-connector

Issues at com.spotify/docker-client

spotify/docker-client#727
spotify/docker-client#727 (comment)
spotify/docker-client#727 (comment)

Issues at jersey-apache-connector

https://github.com/jersey/jersey/issues/3772 (old repo)
eclipse-ee4j/jersey#3772
eclipse-ee4j/jersey#3772 (comment)

Jersey release schedule and roadmap

https://projects.eclipse.org/projects/ee4j.jersey
https://github.com/eclipse-ee4j/jersey/wiki/Road-Map

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions