Skip to content

8358343: [leyden] Drop notify_all in CompilationPolicyUtils::Queue::pop #74

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: premain
Choose a base branch
from

Conversation

shipilev
Copy link
Member

@shipilev shipilev commented Jun 3, 2025

Found this when reading premain-vs-mainline webrev. Mainline does not have notify_all in this method:
https://github.com/openjdk/jdk/blob/c382da579884c28f2765b2c6ba68c0ad4fdcb2ce/src/hotspot/share/compiler/compilationPolicy.hpp#L85-L92

But if you remove notify_all() in premain, then tests start to deadlock, see bug for a sample. The culprit is CompilationPolicy::flush_replay_training_at_init, which is only present in premain. I fixed it by using timed waits, which obviates the need for extra notifications. We only enter this method with -XX:+AOTVerifyTrainingData, so we don't care much about its performance. This is IMO better than doing a questionable notify_all followed by wait in load-bearing code.

Additional testing:

  • Linux x86_64 server fastdebug, runtime/cds (5x, no timeouts yet; still running more iterations)

Progress

  • Change must not contain extraneous whitespace
  • Change must be properly reviewed (1 review required, with at least 1 Committer)

Issue

  • JDK-8358343: [leyden] Drop notify_all in CompilationPolicyUtils::Queue::pop (Bug - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/leyden.git pull/74/head:pull/74
$ git checkout pull/74

Update a local copy of the PR:
$ git checkout pull/74
$ git pull https://git.openjdk.org/leyden.git pull/74/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 74

View PR using the GUI difftool:
$ git pr show -t 74

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/leyden/pull/74.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jun 3, 2025

👋 Welcome back shade! A progress list of the required criteria for merging this PR into premain will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jun 3, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jun 3, 2025
@mlbridge
Copy link

mlbridge bot commented Jun 3, 2025

Webrevs

@shipilev
Copy link
Member Author

shipilev commented Jun 3, 2025

@veresov @iwanowww ^

@shipilev shipilev requested review from veresov and iwanowww June 5, 2025 17:54
@veresov
Copy link
Collaborator

veresov commented Jun 6, 2025

Is there a reason why can't we just do the processing work in the thread calling the flush instead of waiting for the replay thread? That is, why not make it be like this:

void CompilationPolicy::flush_replay_training_at_init(TRAPS) {
  InstaceKlass* ik;
  do {
    ik = _training_replay_queue.try_pop(TrainingReplayQueue_lock, THREAD);
    if (ik != nullptr) {
      replay_training_at_init_impl(ik, THREAD);
    }
  } while (ik != nullptr);
}

@shipilev
Copy link
Member Author

shipilev commented Jun 6, 2025

Is there a reason why can't we just do the processing work in the thread calling the flush instead of waiting for the replay thread? That is, why not make it be like this:

You tell me :) I guess one upside of current code is to leave draining/processing in one place/thread, and thus never run into false positives/negatives due to diagnostic code (this hunk, gated by -XX:+AOTVerifyTrainingData) doing something that production code does not. So I mildly prefer it in current form. I can change it, if you want.

@veresov
Copy link
Collaborator

veresov commented Jun 6, 2025

I don't remember. :) I'm just not a big fan of spin-waits even with sleeps inside...

@veresov
Copy link
Collaborator

veresov commented Jun 6, 2025

It'd be also a potentially nicely reusable method. Provided that it works of course...

@veresov
Copy link
Collaborator

veresov commented Jun 6, 2025

Actually the current approach (even with the spin-wait) and my solution too are not really correct. The fact that the queue is empty doesn't mean that every last item has been processed. The last item may have been popped, but still is being worked on. So however you look at it we should set up some kind of a handshake to make sure the replay thread is done processing, not just done popping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

2 participants