Skip to content

Commit 9214e78

Browse files
committed
Distributed: Delete bad test
This removes a bad test from Distributed. The failure here looks similar to JuliaLang/julia#35081, but is not quite the same (though the failure is also suppressed by the fix in JuliaLang/julia#35081). The issue in JuliaLang/julia#35081 is that the `Distributed` test tries to use the `@distributed` macro, on the main test cluster, which in turn initializes the lazy connections between the workers in the all_to_all cluster topology. Now, in JuliaLang/julia#35081, the issue was that this initialization on the other worker was delayed and in the meantime the test harness had reaped the worker running the Distributed test, so having the other worker connect to it failed (with ECONNREFUSED). Here we face a similar issue. As in JuliaLang/julia#35081, the use of the `@distributed` macro causes the lazy links to be initialized, but here the issue is that the Distributed tests call the internal function `bind_client_port` which tries to allocate a socket on the common client port. In the actual Distributed source, we only ever call this function together with the `SO_REUSEPORT` setsocketopt, and we rely this option being set on all sockets initialized on our client port. Here, however, we do not set this option, so whether or not the subsequent worker-to-worker connection (triggered by the distributed macro from JuliaLang/julia#35081) succeeds depends on whether the socket created here was reaped by GC (and likely has subsequently exited the CLOSE_WAIT state) allowing its reallocation. If not, we observe the CI error (but with EADDRINUSE rather than ECONNREFUSED). This PR just deletes the problematic test. I don't think it has much value. The tested function is an internal implementation detail and is not exported. It is exercised plenty during regular cluster construction. One could try to test the higher level `socket_reuse_port` function, but to do so properly would basically amount to building the inverse logic of that function into that test, which seems unnecessary. For future reference, to aid in debugging similar issues, here's a dtrace script I used to print the sequence of socket operations performed by Julia. It showed this sequence, though until we found JuliaLang/julia#35081, I didn't understand how it came to be. Nevertheless, it may be helpful for future similar issues: ``` inline string JULIA = "julia"; #pragma D option quiet /* From sys/socket.h */ inline int af_inet = 2; inline int af_inet6 = 30; inline int so_reuseport = 0x0200; syscall::setsockopt:entry /execname==JULIA && arg2==so_reuseport/ { self->start = timestamp; self->fd = arg0; } syscall::setsockopt:return /execname==JULIA && self->start/ { printf("%-11s %-6d %-4d %-16s %-3d SO_REUSEPORT %d\n", probefunc, pid, arg0, execname, self->fd, errno); self->start = 0; } /* Adapted from https://github.com/brendangregg/DTrace-book-scripts/blob/master/Chap6/soconnect.d */ dtrace:::BEGIN { /* Add translations as desired from /usr/include/sys/errno.h */ err[0] = "Success"; err[EINTR] = "Interrupted syscall"; err[EIO] = "I/O error"; err[EACCES] = "Permission denied"; err[ENETDOWN] = "Network is down"; err[ENETUNREACH] = "Network unreachable"; err[ECONNRESET] = "Connection reset"; err[ECONNREFUSED] = "Connection refused"; err[ETIMEDOUT] = "Timed out"; err[EHOSTDOWN] = "Host down"; err[EHOSTUNREACH] = "No route to host"; err[EINPROGRESS] = "In progress"; printf("%-11s %-6s %-4s %-16s %-3s %-16s %-5s %8s %s\n", "SYSCALL", "PID", "FD", "PROCESS", "FAM", "ADDRESS", "PORT", "LAT(us)", "RESULT"); } syscall::connect*:entry, syscall::bind:entry /execname==JULIA/ { /* assume this is sockaddr_in until we can examine family */ this->s = (struct sockaddr_in *)copyin(arg1, sizeof (struct sockaddr)); self->fd = arg0; this->f = this->s->sin_family; } syscall::connect*:entry, syscall::bind:entry /this->f == af_inet && execname==JULIA/ { self->family = this->f; self->port = ntohs(this->s->sin_port); self->address = inet_ntop(self->family, (void *)&this->s->sin_addr); self->start = timestamp; } syscall::connect*:entry, syscall::bind:entry /this->f == af_inet6 && execname==JULIA/ { /* refetch for sockaddr_in6 */ this->s6 = (struct sockaddr_in6 *)copyin(arg1, sizeof (struct sockaddr_in6)); self->family = this->f; self->port = ntohs(this->s6->sin6_port); self->address = inet_ntop(self->family, (struct in6_addr *)&this->s6->sin6_addr); self->start = timestamp; } syscall::connect*:return, syscall::bind:return /self->start && execname==JULIA/ { this->delta = (timestamp - self->start) / 1000; printf("%-11s %-6d %-4d %-16s %-3d %-16s %-5d %8d %d\n", probefunc, pid, self->fd, execname, self->family, self->address, self->port, this->delta, errno); self->family = 0; self->address = 0; self->port = 0; self->start = 0; } ```
1 parent b09b75b commit 9214e78

File tree

1 file changed

+1
-9
lines changed

1 file changed

+1
-9
lines changed

test/managers.jl

Lines changed: 1 addition & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
using Test
22
using Distributed
33
using Sockets
4-
using Distributed: parse_machine, bind_client_port, SSHManager, LocalManager
4+
using Distributed: parse_machine, SSHManager, LocalManager
55

66
@test parse_machine("127.0.0.1") == ("127.0.0.1", nothing)
77
@test parse_machine("127.0.0.1:80") == ("127.0.0.1", 80)
@@ -15,14 +15,6 @@ using Distributed: parse_machine, bind_client_port, SSHManager, LocalManager
1515
@test_throws ArgumentError parse_machine("127.0.0.1:0")
1616
@test_throws ArgumentError parse_machine("127.0.0.1:65536")
1717

18-
for ip in (IPv4(0), IPv6(0))
19-
sock = TCPSocket()
20-
@test bind_client_port(sock, typeof(ip)) === sock
21-
addr, port = getsockname(sock)
22-
@test addr === ip
23-
@test port::UInt16 === Distributed.client_port[] != 0
24-
end
25-
2618
@test occursin(r"^SSHManager\(machines=.*\)$",
2719
sprint((t,x) -> show(t, "text/plain", x), SSHManager("127.0.0.1")))
2820
@test sprint((t,x) -> show(t, "text/plain", x), LocalManager(1, true)) == "LocalManager()"

0 commit comments

Comments
 (0)