-
Notifications
You must be signed in to change notification settings - Fork 480
Description
Describe the bug
A clear and concise description of what the bug is.
Tilps says:
I got a complete strace. "Boinc" happily doing things for a while - then it tries to look for a domain
socket, doesn't find it - tries to connect to port 6000, no answer, tries to look for a different domain
socket, doesn't find it, then port 6001 - it repeats this sequence until it gets to port 6006 -
finds it can connect, and then hangs.
6006 is the port tensorboard uses
It appears to be searching for an x windows session
but since there isn't an x windows session it keeps searching until it hits the tensorboard port
so either we disable its need to try and find an x windows session for whatever reason -
or have an x windows session for it to find ... or we reconfigure tensorboard to use non-default port number
BOINC is searching for an x windows session, and when it finds port 6006 is open, but not responding in the way it would expect from an x windows session, it hangs.
Steps To Reproduce
- Run boinc daemon -- on a CLI-only server (no GUI, no X11)
- Run TensorFlow (it uses port TCP/6006 by default) -- or any other software
that uses port TCP/6006.
steps to reproduce (BOINC side):
# apt-get install boinc-client
cd /var/lib/boinc-client/
boinccmd --project_attach http://www.worldcommunitygrid.org/ $KEY
boinccmd --set_network_mode always
boinccmd --set_run_mode always
boinccmd --set_gpu_mode never
# service boinc-client restart
What actually happens ?
root@rampage-107:~# boinccmd --read_global_prefs_override
Operation failed: read() failed
root@rampage-107:~#
at this stage "boinc" daemon gets stuck, and no work units get processed anymore.
====================================
Expected behavior
A clear and concise description of what you expected to happen.
root@rampage-107:~# boinccmd --read_global_prefs_override
root@rampage-107:~#
boinccmd must run without errors.
Screenshots
If applicable, add screenshots to help explain your problem.
System Information
- OS: Linux - Ubuntu 18.04 LTS
- BOINC Version: root@rampage-107:~# boinc --version (boinc as supplied by Ubuntu)
7.9.3 x86_64-pc-linux-gnu
Additional context
Add any other context about the problem here.
In practice any Linux server (CLI only) running Deep Learning (TensorFlow) and BOINC --
boinc will get stuck after about 30 minutes or so...
this server has enough RAM memory and disk space, so those issues can be ruled out:
root@rampage-107:~# uptime
00:31:18 up 44 days, 2:37, 16 users, load average: 57.85, 59.09, 57.01
root@rampage-107:~# free -h
total used free shared buff/cache available
Mem: 125G 36G 882M 1.0G 88G 87G
Swap: 8.0G 100M 7.9G
root@rampage-107:~# df -h
Filesystem Size Used Avail Use% Mounted on
udev 63G 0 63G 0% /dev
tmpfs 13G 2.8M 13G 1% /run
/dev/sda2 916G 358G 512G 42% /
tmpfs 63G 100K 63G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 63G 0 63G 0% /sys/fs/cgroup
/dev/loop2 90M 90M 0 100% /snap/core/8039
tmpfs 13G 0 13G 0% /run/user/1000
/dev/loop0 90M 90M 0 100% /snap/core/8213
tmpfs 13G 0 13G 0% /run/user/0
-Technologov, 17.12.2019.
Metadata
Metadata
Assignees
Type
Projects
Status