Skip to content

[CRITICAL] boinc daemon hangs if port TCP/6006 is not an X11 server. #3405

@Technologov

Description

@Technologov

Describe the bug
A clear and concise description of what the bug is.
Tilps says:
I got a complete strace. "Boinc" happily doing things for a while - then it tries to look for a domain
socket, doesn't find it - tries to connect to port 6000, no answer, tries to look for a different domain
socket, doesn't find it, then port 6001 - it repeats this sequence until it gets to port 6006 -
finds it can connect, and then hangs.
6006 is the port tensorboard uses
It appears to be searching for an x windows session
but since there isn't an x windows session it keeps searching until it hits the tensorboard port
so either we disable its need to try and find an x windows session for whatever reason -
or have an x windows session for it to find ... or we reconfigure tensorboard to use non-default port number

BOINC is searching for an x windows session, and when it finds port 6006 is open, but not responding in the way it would expect from an x windows session, it hangs.

Steps To Reproduce

  1. Run boinc daemon -- on a CLI-only server (no GUI, no X11)
  2. Run TensorFlow (it uses port TCP/6006 by default) -- or any other software
    that uses port TCP/6006.
    steps to reproduce (BOINC side):
# apt-get install boinc-client
cd /var/lib/boinc-client/
boinccmd --project_attach http://www.worldcommunitygrid.org/  $KEY
boinccmd --set_network_mode always
boinccmd --set_run_mode always
boinccmd --set_gpu_mode never
# service boinc-client restart
What actually happens ?
root@rampage-107:~# boinccmd --read_global_prefs_override
Operation failed: read() failed
root@rampage-107:~#

at this stage "boinc" daemon gets stuck, and no work units get processed anymore.

====================================

Expected behavior
A clear and concise description of what you expected to happen.

root@rampage-107:~# boinccmd --read_global_prefs_override
root@rampage-107:~#
boinccmd must run without errors.

Screenshots
If applicable, add screenshots to help explain your problem.

System Information

  • OS: Linux - Ubuntu 18.04 LTS
  • BOINC Version: root@rampage-107:~# boinc --version (boinc as supplied by Ubuntu)
    7.9.3 x86_64-pc-linux-gnu

Additional context
Add any other context about the problem here.

In practice any Linux server (CLI only) running Deep Learning (TensorFlow) and BOINC --
boinc will get stuck after about 30 minutes or so...

this server has enough RAM memory and disk space, so those issues can be ruled out:

root@rampage-107:~# uptime
 00:31:18 up 44 days,  2:37, 16 users,  load average: 57.85, 59.09, 57.01

root@rampage-107:~# free -h
              total        used        free      shared  buff/cache   available
Mem:           125G         36G        882M        1.0G         88G         87G
Swap:          8.0G        100M        7.9G

root@rampage-107:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             63G     0   63G   0% /dev
tmpfs            13G  2.8M   13G   1% /run
/dev/sda2       916G  358G  512G  42% /
tmpfs            63G  100K   63G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            63G     0   63G   0% /sys/fs/cgroup
/dev/loop2       90M   90M     0 100% /snap/core/8039
tmpfs            13G     0   13G   0% /run/user/1000
/dev/loop0       90M   90M     0 100% /snap/core/8213
tmpfs            13G     0   13G   0% /run/user/0

-Technologov, 17.12.2019.

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    Status

    Backlog

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions