Skip to content

Add support for Native multithreaded execution #4201

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 116 commits into from
Apr 4, 2025

Conversation

durban
Copy link
Contributor

@durban durban commented Dec 15, 2024

This is on top of @djspiewak's wip/multithreaded-wstp branch. I did nothing so far, except default to SleepSystem in 489fbb9. This makes it possible to run IOSpec on scala-native; it (sometimes) passes on my machine ;-). It's probably too early to have this as a PR, but I'm doing it anyway to avoid duplicating work, and maybe to have a discussion about EpollSystem (see below). (@djspiewak feel free to close this PR if you have other plans with your branch.)

@durban
Copy link
Contributor Author

durban commented Dec 15, 2024

So, I'm running this on Linux, and without 489fbb9 it tries to use EpollSystem. But starting the tests (e.g., testsNative/testOnly cats.effect.IOSpec) only leads to a hanging process. It seems to me, that the WSTP threads are waiting in epoll_wait, and a (the?) GC thread seems to wait for mutator threads to reach a safepoint(?). (Details: the GC is in thread_yield, called by Synchronizer_acquire.) I'm speculating here, but maybe Thread.sleep() does "something", which EpollSystem doesn't, before epoll_waiting?

EDIT: what I wrote above is probably completely wrong... 🤷‍♂️

@djspiewak
Copy link
Member

djspiewak commented Dec 15, 2024

This is great! Thank you for pushing this forward to the next obstacle.

One of the things that occurred to me as I poked at my branch originally is SN's tooling for introspecting thread state is really really limited so far as I understand it. Maybe this is just my ignorance and LLVM has some magic we could turn on, but I strongly suspect we're going to need better introspection to run down some of these problems.

What I'm thinking is we're probably going to end up building that, or at least leaning in heavily to do so, and that's probably a large part of what we'll need to do to get this off the ground. We should chat with the SN folks.

@armanbilge
Copy link
Member

armanbilge commented Dec 15, 2024

The reason it's hanging is because we haven't implemented interruption yet for the Native I/O-polling systems. This wasn't necessary when it was single-threaded, but now it's critical :)

def interrupt(targetThread: Thread, targetPoller: Poller): Unit = ()

Compare with:

def interrupt(targetThread: Thread, targetPoller: Poller): Unit = {
targetPoller.selector.wakeup()
()
}

@armanbilge
Copy link
Member

Oh, the other reason it may be hanging is indeed related to GC. On Scala Native, blocking native calls need to be annotated explicitly with the @blocking annotation, so that it does the necessary book-keeping so it's possible to GC while a thread is stuck in that blocking call.

https://github.com/scala-native/scala-native/blob/c7b54a18e3ff11d8b2792f16fbb6e97780314014/nativelib/src/main/scala/scala/scalanative/unsafe/package.scala#L103-L106

For now it's fine to just mark it @blocking, but b/c this comes at a performance cost, we should actually make two separate epoll_wait bindings. One will use @blocking for when the timeout is > 0, and the other will not, for when the timeout == 0.

@djspiewak
Copy link
Member

The reason it's hanging is because we haven't implemented interruption yet for the Native I/O-polling systems. This wasn't necessary when it was single-threaded, but now it's critical :)

You know, I didn't even think about this. Makes loads of sense though. Pipes time!

@durban
Copy link
Contributor Author

durban commented Dec 20, 2024

@armanbilge Thanks, I've tried to do the 2 things you mentioned. In 62b8141 I turned on the EpollSystem again, and tried implementing interrupt, and added the scala-native blocking annotation. This way testsNative/testOnly cats.effect.IOSpec passes on my machine. It obviously needs more work (e.g., I think interrupt is not threadsafe), but at least it's a step in the right direction.

@djspiewak
Copy link
Member

djspiewak commented Dec 26, 2024

Interesting. So I merged your branch with series/3.x and now I'm getting the following:

[error] Unknown DWARF abbrev code: 26
[error] 
[error] STACKTRACE
[error] 
[error] java.lang.RuntimeException: Unknown DWARF abbrev code: 26
[error] 
[error] 
[error] This looks like a specs2 exception...
[error] Please report it with the preceding stacktrace at http://github.com/etorreborre/specs2/issues
[error]  
[error] Error: Total 1, Failed 0, Errors 1, Passed 0
[error] Error during tests:
[error] 	cats.effect.IOSpec

Edit: Appears to be a macOS only thing. Compiles and runs on Linux. Lovely.

@djspiewak
Copy link
Member

Okay got around the issue with Lorenzo's help. It's fixed in SN main, so I updated to a local snapshot (lol) on my branch and made progress. I'll dig into interruption for kqueue

@djspiewak
Copy link
Member

Update:

  1. We no longer need the snapshot. I've pushed the magic compiler settings incantation on my branch
  2. Also pushed is an initial stab at using EVFILT_USER to handle kqueue interrupts. I implemented it by basically putting that event permanently into the front of the changes array and then triggering it on the kqfd on interrupt(). In principle, this makes sense, but it doesn't actually work. The threads wake up but things spin forever. I'm probably being dumb. Have fun.

Will get back to this later.

@djspiewak
Copy link
Member

I pushed more. Kqueue is pretty close to working I think.

@djspiewak
Copy link
Member

Unexpected (expected?) bonus from @durban's fix: CI is much faster now!

djspiewak
djspiewak previously approved these changes Apr 4, 2025
Copy link
Member

@djspiewak djspiewak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted previously, looking for all three approvals here before merge.

@djspiewak
Copy link
Member

For those of you with more time than sense, here's another snapshot: 3.7-4972921

@durban durban dismissed djspiewak’s stale review April 4, 2025 02:10

The merge-base changed after approval.

val next = if (diff != java.lang.Double.POSITIVE_INFINITY) {
val next = if (diff != java.lang.Float.POSITIVE_INFINITY) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a bugfix that's not specific to Native?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really a bugfix, more like a "stlye" thing; there is no real bug (i.e., it doesn't need to go into 3.6.x), because Double.POSITIVE_INFINITY == Float.POSITIVE_INFINITY.

armanbilge
armanbilge previously approved these changes Apr 4, 2025
Copy link
Member

@armanbilge armanbilge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good (enough ;)! Opened a bunch of follow-up issues, but they are essentially just optimizations. We seem to be getting green CI consistently and on Cirrus as well which is amazing.

Huge props to Daniel and Daniel for driving this effort from start-to-finish.

@durban durban dismissed armanbilge’s stale review April 4, 2025 04:26

The merge-base changed after approval.

djspiewak
djspiewak previously approved these changes Apr 4, 2025
Copy link
Member

@djspiewak djspiewak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch

@durban durban dismissed djspiewak’s stale review April 4, 2025 14:13

The merge-base changed after approval.

armanbilge
armanbilge previously approved these changes Apr 4, 2025
Copy link
Member

@armanbilge armanbilge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@durban ready to land this? 😃

@durban durban dismissed armanbilge’s stale review April 4, 2025 19:21

The merge-base changed after approval.

@durban
Copy link
Contributor Author

durban commented Apr 4, 2025

I went over this whole PR, and I have one real concern: whether we're invoking undefined behavior in TimerHeap (scala-native/scala-native/issues/4288; #4357). But even if we are, that sounds like something maybe worth fixing in scala-native, and probably shouldn't block this (already long) PR. I can't github-approve, because it's "my" PR, but I'm approving.

Various things that we should look at (not in this PR, but probably before 3.7.0):

@djspiewak
Copy link
Member

djspiewak commented Apr 4, 2025

whether we're invoking undefined behavior in TimerHeap

Oh that's fascinating. Can you open an issue on our side to track this so we don't lose it? I wouldn't want to do a proper 3.7 with that question outstanding.

FiberMonitor on scala-native

Yeah I think @armanbilge has an issue for this. It's going to be dependent on getting WeakReference working correctly on SN.

TracingPlatform: I think the ConcurrentHashMap grows forever, is this a memory leak?

I don't think so. It's bounded by the number of distinct lambdas (not closures), no?

Double check stackallocs (there have been problems with them):

Can you make an issue for this one too? Agreed we should look really carefully at them.

IOApp: there used to be a "keepalive" on scala-native. Was this a mistake? Is this only a JS thing? If it wasn't, do we not need it any more?

I think it was cargo-culted from Scala.js, where in turn I cargo-culted it from the original IOApp in CE2, where it was originally added by @alexandru. I still don't actually know what purpose it serves, but our ancestors work in mysterious ways.

I don't think it was ever needed on Native.

IORuntimeBuilder#addPoller seems to exist on JS, and it's silently ignored(?).

Oh, that feels like a bug. I wonder if we should pull that back?

@armanbilge
Copy link
Member

I went over this whole PR, and I have one real concern: whether we're invoking undefined behavior in TimerHeap (scala-native/scala-native/issues/4288)

Can you share more details (maybe a follow-up issue)? I saw your SN issue but I didn't have a chance to think about it carefully.


  • FiberMonitor on scala-native

  • TracingPlatform: I think the ConcurrentHashMap grows forever, is this a memory leak?

It's keyed by classes, and there's a finite number of classes in a compiled artifact.


  • IOApp: there used to be a "keepalive" on scala-native. Was this a mistake? Is this only a JS thing? If it wasn't, do we not need it any more?

It wasn't a mistake. We don't need it anymore. We needed it before because we couldn't block on completion of the main fiber in the main method. e.g. consider an application that is just IO.never, how to keep the runtime from terminating? This is what "keepalive" helped with. But now we can just block.


  • IORuntimeBuilder#addPoller seems to exist on JS, and it's silently ignored(?).

JS doesn't really support pollers ... you're relying on the runtime for this. Did we put addPoller on IORuntimeBuilder in this PR? If so, let's undo it, if not, let's make an issue and deal with it separately.

@djspiewak djspiewak merged commit 97f8feb into typelevel:series/3.x Apr 4, 2025
38 checks passed
@durban
Copy link
Contributor Author

durban commented Apr 4, 2025

@djspiewak @armanbilge Okay, I promise I'll open an issue for everything you've mentioned, just not right now...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants