Skip to content

8291652: (ch) java/nio/channels/SocketChannel/VectorIO.java failed with "Exception: Server 15: Timed out" #26049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jaikiran
Copy link
Member

@jaikiran jaikiran commented Jun 30, 2025

Can I please get a review of this test-only change which proposes to address an intermittent test failure in java/nio/channels/SocketChannel/VectorIO.java?

As noted in https://bugs.openjdk.org/browse/JDK-8291652, this test has been failing intermittently in our CI. Some years back the test was improved to include additional debug logs to identify the root cause https://bugs.openjdk.org/browse/JDK-8180085. In a recent failure, these test logs indicate that the Server thread hadn't yet accept()ed a Socket connection, when the client side of the test threw an exception because it had waited for 8 seconds for the server side of the test to complete.

The change in this PR updates the test to wait for the Server thread to reach a point where it is ready to accept() a Socket connection. Only after it reaches this state, the client side of the testing will be initiated. Furthermore, the artificial 8 second wait has been removed from this test and it now waits as long as it takes for the testing to complete. If the test waits far too long then the jtreg infrastructure will timeout the test and at the same time capture the necessary artifacts to help debug unexpected time outs.

While at it, the test has also been updated to use InetAddress.getLoopbackAddress() instead of localhost. This should prevent any unexpected address mappings for localhost from playing a role in this test.

With these changes, I've run the test more than 1000 times in our CI and it hasn't failed.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8291652: (ch) java/nio/channels/SocketChannel/VectorIO.java failed with "Exception: Server 15: Timed out" (Bug - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26049/head:pull/26049
$ git checkout pull/26049

Update a local copy of the PR:
$ git checkout pull/26049
$ git pull https://git.openjdk.org/jdk.git pull/26049/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26049

View PR using the GUI difftool:
$ git pr show -t 26049

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26049.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jun 30, 2025

👋 Welcome back jpai! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jun 30, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jun 30, 2025
@openjdk
Copy link

openjdk bot commented Jun 30, 2025

@jaikiran The following label will be automatically applied to this pull request:

  • nio

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the nio nio-dev@openjdk.org label Jun 30, 2025
@mlbridge
Copy link

mlbridge bot commented Jun 30, 2025

Webrevs

@msheppar
Copy link

msheppar commented Jul 2, 2025

A couple of observations to consider.

The setLength is a static member variable of the test effectively a global variable, but it has non synchronised access from multiple threads.
Something to consider for amendment — volatile or synchronized methods for access.

The use of the CountDownLatch is as about the best we can do. It should mitigate against the possibilities of observed race conditions, but won’t absolutely guarantee this.

Consider the following, slightly convoluted scenario:

The Server starts and executes as far as the countDown on the connAcceptLatch, at which point the server thread gets bumped by the OS and is placed in RTR queue waiting its next scheduled time slice.
The main thread execute as far at the sv.awaitFinish (executing bufferTest method), BUT it has closed the connection before the Server has executed accept or read the data from the socket.
This provides a possibility that data will disappear from the socket — so there is the possibility that a bit of a race condition will still exist.

Thus, it might be more prudent to close the socket on the client or initiator side (i.e. in the main test thread), after the Server has finished. As such after the sv.awaitFinish call.

In this case the Server will have closed its end of the socket connection also, at that point in time.

To accommodate this logic, pass the Server reference to the bufferTest method to invoke the sv.awaitFinish,
or arrange for the bufferTest method to return the SocketChannel reference
and invoke the close of the SocketChannel after sv.awaitFinish call in the main method.

Another alternative for this is a refactor extract method at line 92

SocketChannel openConnection(in port) throws Exception {
// Get a connection to the server
InetAddress loopback = InetAddress.getLoopbackAddress();
InetSocketAddress isa = new InetSocketAddress(loopback, port);
SocketChannel sc = SocketChannel.open();
sc.connect(isa);
sc.configureBlocking(generator.nextBoolean());
return sc
}
Call openConnection before bufferTest and pass the SocketChannel reference to bufferTest with sc.close() call removed

Then after awaitFinish call a sc.close in the main

In any case, back to the main point, that is to close the client SocketChannel after the sv.awaitFinish call.

The method waitToStartTest is on the Server class, maybe refactor rename waitServerStart. The waitToStartTest is a private method on Server but is really part of the public interface to the Server (the fact Server is a static inner class gives access to the private waitServerStart)

Line 174 the invocation connAcceptLatch.countDown which sets the test in motion

could, for the sake of symmetry, be encapsulated in a method

void signalServerStarted() {
connAcceptLatch.countDown
}

Another aspect of the test that caught the eye is the fact that the ServerSocketChannel bind is invoked with just the SocketAddress and doesn’t specify any backlog. IIRC correctly this results in a backlog of 0 being used.

Since back in the day, It has been best practice not to specify a backlog of 0, especially for portability, as backlog 0 semantics are ill defined (or in most cases not defined at all) on most OS platforms.
So, maybe add a backlog of 5 in homage to the original in BSD 4.2

A slight digression from this PR, and a general comment on the ServerSocketChannel::bind(SocketAddress local)

I think it would be better if the ServerSocketChannel implementation used a NON zero default backlog value out of the box, e.g. 5 rather than the backlog of 0. This could, also, be overridden with a System Property java.nio.DefaultSocketBacklog for use when the single arg ServerSocketChannel::bind method call is used. This would then give common uniform semantics across all OS platforms. Rather then the nebulous semantics for backlog value of zero.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nio nio-dev@openjdk.org rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

2 participants