-
Notifications
You must be signed in to change notification settings - Fork 570
Multithreaded, concurrent and intensive use often cause stale containers to remain up forever #808
Description
Zalenium Image Version(s): 3.14.0g ( also reproducible with 3.14.0c )
Docker Version: 18.09.0, build 4d60db4
If using docker-compose, version: 1.23.1, build b02f1306
OS: OSX High Sierra ( also reproducible on CentOS 7.2.1511 and latest Arch Linux )
Docker Command to start Zalenium: Executing through docker-compose.yml
Expected Behavior
Stale containers will get removed after idle timeout.
Thread AutoStartProxyPoolPoller
works as expected.
Actual Behavior
Stale containers won't get removed and remain up forever even after idle timeout.
Thread AutoStartProxyPoolPoller
hangs forever.
Note that those containers can still be reused as normal.
Minimal code to reproduce the problem
docker-compose.yml
--desiredContainers 0
helps us to tell whetheridle timeout
is working or not. The number of node containers stays above zero when the problem occurs.--debugEnabled true
helps us to tell whether the debug log of 'Checking containers...' is constantly printed to standard output or not , i.e.; threadAutoStartProxyPoolPoller
is working or not.
version: "2"
services:
zalenium:
image: dosel/zalenium:3.14.0g
container_name: zalenium
privileged: true
tty: true
ports:
- "4444:4444"
volumes:
- /tmp/videos:/home/seluser/videos/
- /var/run/docker.sock:/var/run/docker.sock
command: >
start
--seleniumImageName elgalu/selenium:3.14.0-p16
--desiredContainers 0
--maxDockerSeleniumContainers 10
--debugEnabled true
environment:
- PULL_SELENIUM_IMAGE=true
Ruby script
- Run this ruby script in the background in a few threads concurrently for the duration of several minutes.
- The desired number of threads depends on the number of CPU cores.
- Running with two threads and four CPU cores is a good example.
- Shorter
idleTimeout
i.e. higher frequency of stale containers getting removed seems to make the problem more reproducible in lesser period of time.
In a "real" UI testing environment where idleTimeout defaults to 90 seconds, each UI test takes a dozen of seconds and the concurrency of tests is about four, it usually takes a couple of hours for the problem to occur.
require 'selenium-webdriver' # version 3.14.0 ( also reproducible with version 2.53.4 )
def exec
caps = Selenium::WebDriver::Remote::Capabilities.chrome
caps[:idleTimeout] = 10
driver = Selenium::WebDriver.for :remote, url: 'http://${YOUR_ZALENIUM_HOST}:4444/wd/hub', desired_capabilities: caps
sleep 10
driver.quit
end
loop do
begin
exec
rescue => e
puts e
end
end
Java thread dump taken after the problem occurs.
https://gist.github.com/mad-p/6082c9ee556ad84d1304be1c9f91b562
The Java thread dump was taken as follows.
docker exec -it ${YOUR_ZALENIUM_CONTAINER_NAME} bash
seluser@zalenium:~$ ps aux | grep java
seluser@zalenium:~$ sudo kill -3 ${YOUR_PROCESS_ID_OF_JAVA}
Root cause
The root cause seems to be the issue below.
Properly close the Apache response so that connections can be reused
eclipse-ee4j/jersey#3861
Tentative workaround
Use patched version of jersey-apache-connector
.
git clone https://github.com/zalando/zalenium/
cd zalenium
git checkout 3.14.0g
mkdir -p src/main/java/org/glassfish/jersey/apache/connector
curl -o src/main/java/org/glassfish/jersey/apache/connector/ApacheConnector.java https://raw.githubusercontent.com/jersey/jersey/2.22.2/connectors/apache-connector/src/main/java/org/glassfish/jersey/apache/connector/ApacheConnector.java
# Here, manually apply the patch below to ApacheConnector.java.
# https://github.com/eclipse-ee4j/jersey/pull/3861/files
mvn clean package && (cd target && docker build -t ${YOUR_REPOSITORY}/zalenium:3.14.0g . )
Permanent workaround
Please consider upgrading com.spotify/docker-client:8.11.7
to a newer version(not released yet as of Nov 2018) where docker-client uses jersey-apache-connector:2.29
(scheduled to be released on spring 2019).
Zalenium:3.14.0g uses docker-client:8.11.7.
https://github.com/zalando/zalenium/blob/3.14.0g/pom.xml#L62
docker-client:8.11.7 uses jersey-apache-connector:2.22.2.
https://github.com/spotify/docker-client/blob/v8.11.7/pom.xml#L109-L113
See also
com.spotify/docker-client
https://github.com/spotify/docker-client
https://mvnrepository.com/artifact/com.spotify/docker-client
jersey-apache-connector
https://github.com/jersey/jersey/ (old repo)
https://github.com/eclipse-ee4j/jersey/
https://mvnrepository.com/artifact/org.glassfish.jersey.connectors/jersey-apache-connector
Issues at com.spotify/docker-client
spotify/docker-client#727
spotify/docker-client#727 (comment)
spotify/docker-client#727 (comment)
Issues at jersey-apache-connector
https://github.com/jersey/jersey/issues/3772 (old repo)
eclipse-ee4j/jersey#3772
eclipse-ee4j/jersey#3772 (comment)
Jersey release schedule and roadmap
https://projects.eclipse.org/projects/ee4j.jersey
https://github.com/eclipse-ee4j/jersey/wiki/Road-Map