Re-Entrancy Re-Design #11

cpurdy · 2021-10-05T17:21:41Z

cpurdy
Oct 5, 2021
Maintainer

Ecstasy uses mutable domains of information, called "services". Like any other object, a service has a callable surface area, i.e. its API. When a call is made "into" a service (i.e. a call from a different service), that call creates a new fiber in the service being called, and that new fiber will then perform the call's work "inside of" the service. A service can only execute one fiber at a time; when multiple calls are made into a service, the reentrancy rules are used to determine if other fibers may be scheduled before the first is complete, and under what conditions those fibers may be scheduled. If a fiber is allowed to be scheduled, then it will only be possible for that fiber to be scheduled when the current fiber either (a) completes, (b) makes a blocking call into another service, or (c) explicitly calls Service.yield().

A recent bug in the JSON database prototype resulted in several long days of painful debugging. Basically, the Client instances (any number thereof) are services that manage their transactional boundaries using the TxManager, which is also a service. The Client instances also use the ObjectStores instances directly (one per database object), and each ObjectStore is a separate service. So there are m Client instances and n ObjectStore instances.

But there's one more twist to this puzzle. In addition to the m Client instances hammering away on the TxManager, the n ObjectStore instances also use the TxManager (for example, to enlist themselves), and these uses are almost always caused by an invocation from one of those Client instances.

The services themselves are designed to deal with this chaos. The TxManager, for example, explicitly uses the Open Reentrancy setting, because it is a busy hub of work, and it wants to push as much through as it possibly can (almost all of which is turned into asynchronous requests, to be performed by the other services). In a few cases, mostly related to life-cycle management, it uses explicit CriticalSection blocks, in order to "close the front door", i.e. so while it is doing something important, it knows that no other unrelated call will come in to the service and get scheduled to run.

Ultimately, there were two bugs that surfaced in the course of debugging this issue:

The first, and most obvious one, was caused by the SkiplistMap used by the ObjectStore to track in-flight transactions. We were using the computeIfAbsent method, and in the lamba, the TxManager service was being called to enlist the ObjectStore into the transaction. This call, going across a service boundary and blocking, allowed other fibers to be scheduled. A quick fix was simple enough: Use a "using new CriticalSection()" block inside of the lambda, wrapped around the call to the TxManager. However, it was worrisome that such a thing could occur deep inside a mutable data structure, one that was not designed with concurrency and reentrancy in mind.
The second, and more insidious one, was caused by the same SkiplistMap. Inside the SkiplistMap, there is a coin-flip that occurs (it's part of Bill Pugh's original design for the skiplist data structure). That coin-flip is performed by a call to a Random object, which was injected into the SkiplistMap. The injected Random just happened to be a service (all injected references are either services or immutable objects), and when the SkiplistMap called to that service (not knowing that it was a service!), the blocking call allowed that active fiber to be suspended -- in the middle of an operation deep inside SkiplistMap! -- and that, in turn caused data corruption when another fiber started executing and mutating that same SkiplistMap. Again, why should the SkiplistMap (or any other low level, mutable data structure) have to even be aware of this possibility?!?

In our current design, the reentrancy for a service is managed as a property on the service object itself. (And the handy CriticalSection simply sets this value on the way in, and unsets it on the way out.)

But what about simple, mutable data structures being used within the service that could inadvertently make a blocking call to a service, either because one was injected (surprise!), or perhaps one was passed in as a lambda, or as an interface to invoke that directly -- or indirectly -- calls a service. Mark made the simple and astute observation that one should have to opt in to concurrency, i.e. to allowing a switch to another fiber to occur.

The more that we have investigated this idea, the more that it grows on us. The conceptual net result (i.e. the goal of the re-design) is that developers will only be forced to think about concurrency when they trying to increase concurrency by addressing mutable data structures that are not explicitly concurrent-safe (i.e. reentrancy-safe).

The new design is based on a few, fundamental observations:

The reentrancy problem almost always occurs within methods on mutable data structures;
With rare exception, neither functions nor immutable objects suffer from this malady;
The presence within the fiber of any execution frame that is not safe, means that the fiber is not safe; each new execution frame is infected by the unsafeness of its caller.

The new design has several purposeful attributes that are common in Ecstasy:

Reasonable defaults: If something is obvious, it should not be required for a developer to type it, unless its purpose in doing so is to make the code more readable.
Hierarchical configuration: For example, in the absence of explicit information on a method, the information can be provided by the parent of the method, and so on, all the way up to the top level class containing the method.
Safety and predictability are considered non-negotiable; the ability to maximize concurrency and performance must respect safety and predictability.

Proposed changes:

Introduce two new method/property/class annotations, @Synchronized and @Concurrent. The first, @Synchronized, explicitly marks a method/property/class as being unsafe for concurrent reentrancy. The second, @Concurrent, explicitly marks a method/property/class as being safe for concurrent reentrancy. When a conflict exists, @Synchronized overrides @Concurrent.
Change the Reentrancy reentrancy property on Service to @RO Boolean reentrant, which evaluates to True iff the current fiber can safely be "interrupted" (by the scheduling of another concurrent fiber) when it makes a blocking call to another service. It evaluates to True iff each execution frame for the fiber is concurrent-safe (i.e. reentrancy-safe for new fibers), and no CriticalSection has been registered.
An execution frame is considered concurrent-safe based on a set of rules (below), but in general: Anything marked with @Synchronized is unsafe, anything marked with @Concurrent is assumed safe, and any execution frame whose invocation target (i.e. "this") is a mutable object, is assumed unsafe.
When a service makes a blocking call (or a call to Service.yield()), another fiber must not be scheduled unless (a) Service.reentrant evaluates to True, or (b) the incoming call is directly traceable back to this same blocking call.

Rules for determining the concurrent-safeness of an execution frame:

A class/property/method is said to be explicitly unsafe iff the class/property/method is @Synchronized.
A class/property/method is said to be explicitly safe iff the class/property/method is @Concurrent.
A class is considered concurrent-safe iff all of the following hold true:
1. the class is not explicitly unsafe
2. any of the following hold true:
  1. the instance class differs from the class, and is concurrent-safe
  2. there is no instance class, or there is an instance class that does not differ from the class, and any of the following hold true:
    1. the class is explicitly safe
    2. the class is of an immutable form (Module, Package, Const, Enum/Enum-Value)
A property is considered concurrent-safe iff all of the following hold true:
1. the property is not explicitly unsafe
2. any of:
  1. the property is explicitly safe
  2. the property's parent is concurrent-safe
A method is considered concurrent-safe iff all of the following hold true:
1. the method is not explicitly unsafe
2. any of:
  1. the method is explicitly safe
  2. the method is static (it is a function)
  3. the method's parent is concurrent-safe
A frame is concurrent-safe, iff all of the following hold true:
1. The frame is for a method that is concurrent-safe
2. The calling frame on this fiber, if there is one, is concurrent-safe

cpurdy · 2021-10-05T20:39:42Z

cpurdy
Oct 5, 2021
Maintainer Author

Another thing that we considered (that I forgot to mention) was a simple helper on the Service interface for performing a lambda in the equivalent of a critical section (by tagging the method with the @Synchronized annotation):

@Synchronized void processUninterrupted(function void() process)
    {
    process();
    }

And/or:

@Synchronized <ResultType> ResultType computeUninterrupted(function ResultType() compute)
    {
    return compute();
    }

0 replies

markandrewfalco · 2021-10-05T23:13:53Z

markandrewfalco
Oct 5, 2021
Collaborator

I like it very much. I do still feel a bit uneasy about @syncronized and @Concurrent being "inherited" from the parent when unspecified, but I imagine with usage we'll get a feel for if it is better that way or not.

2 replies

markandrewfalco Oct 5, 2021
Collaborator

speaking of inheritance I assume these annotations are not inherited in the classic sense? Either way it would be worth calling out in the above rules

cpurdy Oct 6, 2021
Maintainer Author

Gene and I went through the algorithm, and yes, the annotations are not inherited "in the classic sense". In other words, if class Derived extends Base {..}, and @Concurrent class Base {..}, then the Base class is concurrent-safe, but the Derived class is not. The rules work that way, naturally, because structural annotations apply only the structure, and are not inherited. (I'm hoping to push the new annotations later today.)

markandrewfalco · 2021-10-05T23:16:07Z

markandrewfalco
Oct 5, 2021
Collaborator

Regarding "directly traceable", does this mean just one level service a -> service b -> service a, or can it be more indirect?

9 replies

cpurdy Oct 6, 2021
Maintainer Author

I'm not sure what you are proposing. I think we agree on the issue; the question is the course of action to take when that issue may exist.

cpurdy Oct 7, 2021
Maintainer Author

Mark's proposal: Instead of choosing between the binary "throw new Deadlock()" and "blindly allow concurrent re-entrancy by the same logical thread-branch", we could:

Allow a blocking call from concurrent-unsafe code to another service to eventually re-enter, assuming that the service method being entered is @Concurrent, but only allow calls within that new re-entered fiber to call only @Concurrent methods; as soon as it attempts to call a @Synchronized method, it would throw new Deadlock() -- which would solve the lambda callback problem, for lambdas that are @Concurrent. (Note: All uses here of @Concurrent and @Synchronized are short-hand for the whole implicit and explicit verbage, yadda yadda yadda.)
Theoretically, while a blocking call from concurrent-unsafe code to another service remains blocked, the service could also schedule other unrelated (i.e. not the same logical thread-branch) fibers, allowing them to execute within @Concurrent methods only, and blocking them when they reach an @Synchronized method (but explicitly not doing the throw new Deadlock()). Calls into @Synchronized methods of the service would obviously not be eligible (it would be illegal to even start those fibers).

cliffclick Oct 7, 2021
Collaborator

That sounds like a deadlock waiting to happen. What does "not logical thread-branch" mean? Suppose A calls B & blocks; B then forks calls to C,D,E in parallel, each of which want to (through a long indirect path of calls) re-enter A. Is this allowed or not?

cpurdy Oct 8, 2021
Maintainer Author

Cliff - The example that you described is an excellent example of the branches of a logical thread. We are still searching for a good term for the concept. When a method, in its execution, calls another method, that is a logical thread, whether those two methods are in the same service, or not. Since traversing services allows the call to be "forked" (executed async), that "thread" becomes a tree, with branches.

Good names are welcome.

cliffclick Oct 8, 2021
Collaborator

So A->B and blocks; B forks [C,D,....100000 more] and blocks. Some of the millions more block, call X999->X1001->D->E->F, etc, giant rats' nest of fibres calling each other. All of these can also call back into A? But not some pre-fork-of-A, call it A--, with its own million-strong pile-o-fibres running through a thousand unrelated services?

I can't believe: you can reason about this situation any better than reasoning about the situation where you disallow it, and simply throw a Deadlock.

cpurdy · 2021-10-06T14:17:35Z

cpurdy
Oct 6, 2021
Maintainer Author

Gene didn't like the wording of this statement: "The presence within the fiber of any execution frame that is not safe, means that the fiber is not safe; each new execution frame is infected by the unsafeness of its caller."

His clarification is something like this: "A callee cannot transition from concurrent-unsafe to concurrent-safe." In other words, a callee cannot somehow mark its caller as safe for concurrent/reentrant execution.

So, once a fiber is concurrent-unsafe, the only way to make it concurrent-safe is to return all the back to the concurrent-safe frame (assuming there is one) before the fiber became concurrent-unsafe.

In other words: "If the current method (i.e. execution frame) is not allowing service reentrancy/concurrency to occur, then nothing that the current method calls will be allowed to cause service reentrancy/concurrency to occur."

0 replies

cliffclick · 2021-10-06T16:07:13Z

cliffclick
Oct 6, 2021
Collaborator

Yo, calling "Service.yield" in unsafe code is at best a code-smell, at very likely a real bug. Why yield? Its generally because you need something else to happen before you can make progress. So you yield, and upon return you check that the condition has changed, and if not, yield again. If you're in unsafe code, nobody else can run, so yield never lets anything change, so you're in an infinite loop, i.e. live-lock. Even if you are in safe code, there's no scheduling guarantees from the OS that the correct other fiber/service executes and progress is made. If 1000 runnable fibers all call yield, I can pretty much guarantee you that the 1001'th fiber will NEVER run and you'll just live-lock spinning yields on the "wrong" fibers.

Another problem (for XTC) is that you can't (statically) know you're in unsafe code.

IMHO (and fairly extensive experience here), calling yield is just asking for a livelock bug.
Always wait upon some condition! I suggest you ponder yanking Service.yield.

Cliff

5 replies

ggleyzer Oct 6, 2021
Maintainer

I think you're right; we should consider removing the API

cpurdy Oct 6, 2021
Maintainer Author

I don't want to rush to this conclusion. Initially, IIRC, we did not have a yield() call. That's because a service is not a thread, and we didn't wish to pretend that it was.

Conceptually (and only conceptually), each service can be thought of as one OS thread coupled with one concurrent multi-appender/single-reader queue. "Callers" place messages into the queue. The thread, when it is idle, is awoken by the appearance of a message in the queue, and then creates a fiber for that message, and then executes the message's call using that fiber. No other incoming message does anything, other than sit in that queue, until one of two things happens:

the running fiber terminates (either normally, by return-ing all the way out of the fiber, or exceptionally); or
the running fiber does something that can yield to another fiber (what we are labeling in the current conversation as concurrent execution).

Previously, the only "something that can yield" was a call to another service, and then iff this service had is reentrancy configured to Open or Prioritized (or Exclusive, in which "re-entrancy is only permitted for requests originating from the conceptual thread of execution that has already entered the service (service A invoked service B invokes service A).")

But the conceptual queue also can contains things that are not "method invocation messages". For example, when a call is made to another service, and a future is used to capture the result, that "future result" comes back to the calling service in this conceptual queue, so until the queue is examined, that future result has not occurred.

In fact, the only place in the entire code base that yield() appears is on the FutureVar implementation:

FutureVar waitForCompletion()
    {
    while (completion == Pending)
        {
        this:service.yield();
        }
    return this;
    }

(Note that this code never, ever actually gets used; it basically just exists as documentation. The current prototype runtime implements FutureVar natively -- what the Java guys call "intrinsics", i.e. functionality fully-implemented by the language library,but in reality it's fully replaced by the runtime with something internal to the runtime itself.)

Anyhow, long story short: We may want to get rid of the no-parameter yield(), but that should not be done without considering its purpose for existing. Cliff - was your suggestion that what is desirable is a parameterized yield, that unambiguously identifies the condition on which it is waiting (or purpose for which it is yielding)?

cliffclick Oct 6, 2021
Collaborator

It was my suggestion that you drop yield altogether. In this case, you want to block when "completion" changes state, and so you want something like a semaphore or a condition variable - NOT a "yield". This can be also be done with mutexs, lock/wait/notify, etc... all things that involve blocking the current thread, and waking it up when some condition has changed.

This requires you to think about your problem some more: "try again later" is intellectually very cheap.

The obvious "try again later" mostly works on mostly not too loaded machines such as a normal devs' development environment. It can fail progress on heavy loads. When it happens they are a bitch to debug. I have certainly had to debug cases written in pure Java, on the normal JVM, such that attaching e.g. IntelliJ would make the live-lock spin disappear, but it hit about 10% of the time on an un-attached JVM after about 5 minutes of hard work. When it hit, the system simply locked up indefinitely, i.e. live-lock. Adding prints or otherwise touching the situation made the problem disappear, at least on my desktop. It crushed testing all the time. Took several weeks to diagnose. :-P

The fact that you only have it in one place, and only for documentation (not executed) should say enough about its value as an API call.

markandrewfalco Oct 7, 2021
Collaborator

Cliff, for what it's worth you can programmatically determine if you're in a @Concurrent safe fiber by reading the Service.reentrant property. Service.yield() does seem suspect, and the one usage while acting as documentation is also something we presumably wouldn't want real code to do, but I get that the future source does "need" this.

cpurdy Oct 7, 2021
Maintainer Author

On the team call this morning, we made some substantial progress on this conversation.

First, it's worth re-iterating that we do not do "busy waits" in Ecstasy, nor do we want to do anything that would inadvertently direct programmers toward such a kludge. Ecstasy has fibers, for the purpose of avoiding silly things like this; creating a new fiber is probably cheaper than performing a single (real) yield(), and the idea behind relying on fibers is to make complex async APIs (e.g. async I/O, aysnc APIs using continuations, etc.) generally unnecessary (although still easy to implement, when appropriate!) The technical reason for choosing the service/fiber model is to make use of the super-efficient CPU processing that is possible when a hardware thread doesn't have to switch, doesn't read and write shared mutable state, and can (hopefully) rely on the effectiveness of CPU cache in a machine with far more cores than memory bandwidth.

So, with all that out of the way:

An actual yield() method could easily be implemented in natural (i.e. "only Ecstasy") code as follows:

    // cliff: "static" used on a service or immutable class makes it a singleton
    static @Concurrent service YieldService
        {
        void literallyAnyCallWillDo()
            {
            }
        }

    // so here is the way to implement yield, i.e. by calling anything on another service
    @Concurrent void yield()
        {
        YieldService.literallyAnyCallWillDo();
        }

The reason that we want something like a yield() is to be able to check for the completion of previously executed futures. That is because things like async future completion are communicated in the control plane of a service, and when a service is executing user code, it cannot check the (conceptual) queue in the control plane. However, when a call is made to another service (or to the current yield() method), items in that queue are processed. So we are considering dropping yield() altogether, but introducing explicit support for processing any pending events in the service control plane, adding something like this on the Service interface:

    /**
     * @return True iff at least one Future created by this service received a result 
     */
    Boolean hasFutureArrived();

The remaining question is how to handle the implementation of Future (FutureVar.x), and Gene came up with a turtles solution:

    /**
     * Wait for the completion of the future.
     */
    FutureVar waitForCompletion()
        {
        if (completion == Pending)
            {
            @Future FutureVar result; 

            this.thenDo(() -> 
                {
                // the future has arrived
                result = this;
                });

            return result;
            }

        // the future is already here
        return this;
        }

cpurdy · 2021-10-06T19:27:56Z

cpurdy
Oct 6, 2021
Maintainer Author

Proposed @Synchronized doc:

The Synchronized mixin is used to annotate classes, properties, and methods, to indicate that the
class, property, or method is not safe for concurrent/reentrant execution.

Imagine a demonstrably concurrent-unsafe implementation of a counter:

@Concurrent class Counter                                                                        
   {                                                                                            
    private Int counter = 0;                                                                     
    Int next()                                                                                   
        {                                                                                        
        Int n = counter + 1;                                                                     
        this:service.yield(); // allows other fibers to execute before this line completes       
        counter = n;          // possible corruption: storing a stale value                      
        return n;                                                                                
        }                                                                                        
    }

Any one of the following three actions would disallow the yield() call in the above example
from executing other fibers:

The next() method could be annotated with @Synchronized;

Since the class is mutable, the class is implicitly @Synchronized, and so the @Concurrent
annotation could simply be removed;

The @Concurrent annotation could be replaced with the @Synchronized annotation.

Miscellaneous notes:

Regardless of the class' and method's annotations, or the lack thereof, the ability to prevent
other fibers from executing requires only a single execution frame (any frame) in the
current fiber to be implicitly or explicitly @Synchronized.

In the absence of any annotation on a class, @Synchronized is implied for object instances of
the class that are mutable, and @Concurrent is implied for object instances of
the class that are immutable.

When both @Synchronized and @Concurrent appear on the same class, property, or method, the
@Synchronized annotation takes precedence.

Proposed @Concurrent doc:

The Concurrent mixin is used to annotate classes, properties, and methods, to indicate that the
class, property, or method is safe for concurrent/reentrant execution.

Imagine a simple, provably reentrant- and concurrent-safe counter implementation, with stupid
calls to yield() added just to be difficult:

class Counter
    {
    private Int counter = 0;
    Int next()
        {
        this:service.yield();
        Int n = ++counter; // read, modify, and write the value without interruption
        this:service.yield();
        return n;
        }
    }

In most cases, one would expect that an instance of this class is mutable. As such, the class
(and thus the method) are implicitly @Synchronized, and thus the call to yield() does not
actually yield, and thus it does not allow other fibers to execute.

Any one of the following actions would mark this code "safe" for executing other fibers during
the call to yield():

The next() method could be annotated with @Concurrent;

The Counter class could be annotated with @Concurrent.

Miscellaneous notes:

Regardless of the class' and method's annotations, or the lack thereof, the ability to allow
other fibers to execute requires each and every execution frame for the current fiber to be
implicitly or explicitly @Concurrent.

In the absence of any annotation on a class, @Synchronized is implied for object instances of the class that are mutable, and @Concurrent is implied for object instances of the class that are immutable.

When both @Synchronized and @Concurrent appear on the same class, property, or method, the @Synchronized annotation takes precedence.

0 replies

markandrewfalco · 2021-10-07T17:05:14Z

markandrewfalco
Oct 7, 2021
Collaborator

Quick summary of what was I believe agreed to on today's call:

renameCriticalSection to SynchronizedSection which takes an optional Boolean to indicate that it is a critical section otherwise it operates the same as an @Synchronized method. When critical==true this blocks all reentrancy into the service.
rename Service.reentrancy to Service.syncronicity which is an enum of (concurrent, synchronized, critical)
likely remove Service.yield in favor of something more restricted to just process service event backlog and discourage busy waiting
Service's don't imply @Concurrent only const/immutables/functions do
blocking from within @Synchronized code does not prevent reentrancy, it only prevents another fiber from calling into @Synchronized code, i.e when that other fiber tries it will block
when a fiber blocks calling into @Synchronized that acts as a "yield" point allowing other fibers to be switched to
remove exclusive as reentrancy mode, it will just be a scheduling decision, not a correctness one
the exclusive style reentrancy does not get to violate the above rules, i.e. even when reentered to avoid deadlock the new fiber cannot call into @Syncronized code, if it does then a DeadlockException will be thrown

0 replies

cliffclick · 2021-10-07T23:24:07Z

cliffclick
Oct 7, 2021
Collaborator

Again on Yield; why do you have a "service event backlog"? Why are you even contemplating busy-waiting? Disallow it by fiat, and don't hand out any API support. Relying on it at any point is a ticket to a hellish live-lock debugging session.

re Checking for completed Futures: Don't Bother (semantically - optimistically its useful to peek at just the first few prior ones when you are installing a new Future, but this code should never be exposed to any user). Futures checking happens when you block because, e.g. you need that future result. If nobody needs it right now, why check right now? Just wait for the need to arise.

re Checking for completed Futures: there's a real Control Theory problem here. If the count of uncompleted Futures grows large, and you start by checking them ALL at any given step, and the actual work they represent is Small - you quickly go O(n^2) under load instead of O(n). Looks good In The Small, sucks big In The Large, and is very performance unstable. Been there, done that; had my 100M+ Futures in a doubling Java array and had it run out of doubling, go negative array length and die. Fix was to Quit Looking for uncompleted Futures, and freaking just go and complete some.

1 reply

cpurdy Oct 8, 2021
Maintainer Author

Again on Yield; why do you have a "service event backlog"?

We design everything at two levels. The higher level is intended to be simple, and conceptual; we design it in a manner that we can explain how each part of the design works, preferably in terms of the language and its core library (which is implemented entirely in the language). The lower level design -- which must match the behavior indicated by the higher level design -- neither precedes nor follows the higher level design; it is designed in lock step with the higher level design, because nothing is permitted in the higher level design unless we see a clear, safe, predictable (and almost always, fast) way to implement it at a lower level.

Conceptually, a service is a dedicated thread and an accompanying queue. When an async method is invoked across a service boundary, the immediate result is a Future. (Technically, the calling service allocates a place for the future result to return to, and indicates that to the service being called.) That future implementation appears to be written in Ecstasy, but in reality, the implementation is native. However, the futures composed on top of that are entirely built in Ecstasy. Things like "when that future completes, do something", or "when that future throws an exception, handle it in this manner". That code must run on the calling service.

To run code on the service, one must either invoke into the service, or have the code invoked by a running fiber that is already present in the service. So when an async call completes (with a result, or an exception) that completion is communicated back to the calling service via a conceptual queue of events. Those events sit into that queue, until the service has an opportunity to check its queue.

So to allow the service to check that queue (again, this is all conceptual; the predictable behavior is the important thing), while disallowing other calls to the service from running, we introduce the hasFutureArrived() method. Like peekmessage/translatemessage/dispatchmessage in ancient Microsoft Windows, the call to hasFutureArrived() is the permission that the service needs to check for any future results in the conceptual queue, and execute the cascading branches of the corresponding future trees that trigger as a result.

My explanation may be lacking. It's hard to package up a design and put it into words; were it easier, we'd probably be done by now.

cliffclick · 2021-10-08T16:22:02Z

cliffclick
Oct 8, 2021
Collaborator

The more I think about this, the more I think you're just nearly an H2O cluster, with all the foibles that entails. I beat my way through a solid design, with no deadlock nor live-lock possible, and near 100% throughput levels achievable as the normal course of events.... BUT

I absolutely had to have fiber priorities
- User priorities are not visible, nor set by the user. The system picked it for you.
I had to have an invariant that said: any cycle of "service" callbacks (not the H2O term) raised their priority by exactly 1.
- This is a bitch, and the formal key to preventing dead/live-lock in a cluster.
There were some "infinitely high" priorities, higher than any user task but prioritized within themselves, for some system tasks. Most of these were related to making the guarantees of forward progress. Again, not user-visible.
I did NOT have a "service is blocked until returned", ALL code was re-entrant / multi-fiber-entrant
- This is a major difference, as requiring block-until-returns immediately raises deadlock issues. As stated, H2O could not deadlock.
- If you wanted some special state to be available upon return from a blocking call, you had to arrange for it. Typically you used a mix of immutable objects and defensive copies. No compiler support to get right which was immutable and which had to be copied.
I did NOT have a "the returning Future runs on the same fiber". Fibers were utterly fungible. Performance warts if you statistically did not re-use the same fiber most of the time, for short blocking calls. Highly weighted in the runtime to stay in fiber.

As a consequence, I routinely achieved very high throughput, with typically very short latencies. My queues, which in theory could grow to millions - and did until I got the priority thing sorted out - typically stay very small. They have a hard formal Control Theory mechanism in place to keep that so - that's that "do old work before starting something new" concept, which is what that priority gig is about.

Probably better to talk about this F2F... it gets deep quick, and there's a lot of things buried in my memory about how to make this work - and what the hidden gotchas that I fell over were.

2 replies

cpurdy Oct 8, 2021
Maintainer Author

The Futures are an interesting concept in Ecstasy. It's not what you've experienced in other languages.

First of all, the Future mixes into the implementation of the variable class (the conceptual class that is new'd for each and every local variable -- obviously this doesn't actually happen in reality.)

So now, if you have a variable Int x that is marked as a future, you can obtain the reference to the variable itself (as opposed to dereferencing the value held in the variable) using the C-style operator:

@Future Int x;
Future<Int> future = &x;

The conceptual Future implementation is here. Like a lot of fundamental types (Int, Boolean, String, ...), the actual implementation is a combination of some of that code, plus a bunch of native stuff replacing the low level or time-critical functionality in that code.

At any rate, since the future is an object within this service, it cannot be written to by other services. That's why they (conceptually) pass a message to this service indicating that this service should fill in the completion of the future -- but only when this service gets around to checking if there is any such message waiting to be processed.

But when that native future is completed, it may cascade to dependent futures, such as in this example:

PrepareResult   prepareResult = store.prepare^(writeId, destinationId);
Future<Boolean> preparedOne   = &prepareResult.transform(pr ->
    {
    // code here is executed when the service processes the future completion "message" 
    }

Note that the use of the caret-invoke ^( operator in the above code is just short-hand; the following three lines are identical from the compiler's perspective:

PrepareResult prepareResult = store.prepare^(writeId, destinationId);
@Future PrepareResult prepareResult = store.prepare(writeId, destinationId);
@Future PrepareResult prepareResult = store.prepare^(writeId, destinationId); // redundant

I hope this is starting to fill in some of the picture for you. There are some things that we do that are intended to look like things that developers already know, yet mapping those capabilities into a completely different runtime model. That can be confusing when you are trying to "see through" the design, since the language design is intended to (non-malevolently) cause people to think that they already know the language, to help them dive in.

jnorthrup Oct 9, 2021

as I hold the disruptor pattern in high regard https://lmax-exchange.github.io/disruptor/disruptor.html and this is a popularly promulgated benchmark of tackling concurrency (and to my mind, basically kafka give or take the density you assign to the term 'bare metal') ,

on the recent call i was trying to lazily get a fix on the issues apart from the fractured discussion, kafka came to mind before the disruptor.

It seems the intent is to provide a 360 degree moat around deadlocks. a language promoting an Erlang-esque service pattern seems indefensible against all the potential c++ evils one could come up against but queue and lock serialization does at least have a HFT example to draw parralels from, if only to simply the problem statement to the late-joiner

cpurdy · 2021-10-10T05:32:46Z

cpurdy
Oct 10, 2021
Maintainer Author

It seems the intent is to provide a 360 degree moat around deadlocks.

That's probably a bit too kind 🤣 ... we certainly want to make it relatively easy to avoid deadlocks, and if one happens to occur, we want it to be evidenced as an exception, not as an actual deadlock condition.

But what we are trying to do is to make safe concurrent code easy. Because there just aren't that many Martin Thompsons in the world, and there are lots of multi-core chips. 🤣

but queue and lock serialization does at least have a HFT example to draw parralels from, if only to simply the problem statement to the late-joiner

Right. The idea is that we won't know what the ideal mechanism to use is, until we see the runtime profile data. So we would probably start (from the initial code gen pass) optimistically, assuming no contention. Maybe using some sort of sticky, biased lock (zero cost until contention actually shows up). Then switch (deopt) to short spin, and then maybe to a single waiter model where the waiter simply deposits his "unpark" as a continuation on the current lock owner (contention, but low). Eventually, if contention is high, then a highly concurrent queue implementation (something like the disruptor approach, but tuned for multi-producer, single consumer) becomes tempting.

4 replies

cliffclick Oct 10, 2021
Collaborator

And when contention gets high enough, you need a fair lock, which most OSs do not provide.
HotSpot includes the biased lock, CAS-once, the short-spin+CAS, multi-waiter-using-OS lock.
Azul made the biased lock much cheaper, and added a fair lock. I don't think we ever did the single-waiter version.
Disruptor itself typically requires some sort of user-mode algorithm change.

jnorthrup Oct 10, 2021

if you've not yet differentiated kotlin concurrency from go, groovy, etc. they started with something akin to rx channels but arrived at something like Cliff describes by installing park/unpark in the compiler by some hueristics. there is also "runBlocking" makes a local concurrent context. the compiler flags concurrent libraries outside of concurrent contexts as invalid. the async keywords are just methods belonging to the scoped context.

Structured concurrency

tl;dr : declare a concurrent context between braces, execute until the closing brace, all coroutines must be done or cancelled past that point. create a sub-context, same rules prevent those inner scopes from escaping.

what i see from this concurrency arrangement in actual practice is that kotlin is an agreeable language until you start creating concurrent contexts, then it's every bit as prickly as Rust and you can still deadlock it if you spread things too thin.

cpurdy Oct 10, 2021
Maintainer Author

The Kotlin example:

suspend fun loadAndCombine(name1: String, name2: String): Image =
    coroutineScope { 
        val deferred1 = async { loadImage(name1) }
        val deferred2 = async { loadImage(name2) }
        combineImages(deferred1.await(), deferred2.await())
    }

To give you some idea, here's the equivalent in Ecstasy, using two loading services to async load the images in parallel:

Image loadAndCombine(String name1, String name2)
    {
    val deferred1 = loader1.loadImage^(name1);
    val deferred2 = loader2.loadImage^(name2);
    return &deferred1.and(&deferred2, combineImages);
    }

I'm not sure how Kotlin implements co-routines. A lot of languages (Go, for example) have implemented them as state machines on a single thread. We don't do that, for better or worse; each service, conceptually, could be running in parallel on a separate hardware thread.

jnorthrup Oct 11, 2021

suspend fun loadAndCombine(name1: String, name2: String): Image =
coroutineScope {
val deferred1 = async { loadImage(name1) }
val deferred2 = async { loadImage(name2) }
combineImages(deferred1.await(), deferred2.await())
}

I'm not sure how Kotlin implements co-routines.

keywords suspend and CoroutineContext mean the same thing, treated specially in the compiler to insert a context preamble and park code iff more than one thing are happening. there's a compiler warning when unused. IINM they co-opt a debug instruction and perform java asm insertion. there is an IR above the jvm now.

Kotlin Arrow formerly called Kategories uses the compiler suspend as a marker for effects. FP code is re-entrant and idempotent in that library by convention of immutable constructs using only immutable "val" and not "var" code called "pure", and mutable effects are designated into suspend in usercode such that their compiler plugins can reason about when to hit the gas on factorization and higher order functions and when there be dragons. they incidentally started out using Option results but as it turns out the elvis operator <predicate>"?."<notNull> "?:"<elseWhenNull> is a completely optimized isomorph in jvm.

they have written a plugin for kotlinC+Idea to parse the PSI and go further with type transmutations to intercept JVM Object casts, not sure of the completion status on that as yet.

in this library again the agreeable kotlin language becomes cantankerous with assertions that reason about purity and sanity outside of the suspend functions, and kotlin itself asserts very strong features onto code that is within the suspend keyword facade for the coroutinecontext. there are not however any guaranteed guard-rails ruling out deadlocks, though through deprecation of Channel in favor of Flow class constructs there are prescriptive starting points for which the jvm threads can still be used to make better or worse outcomes of the coroutine overheads.

edit: disambiguation kotlin Channel=concurrent coroutine-enabled Provider/Consumer. Sequence is a non-concurrent lazy provider like Java Channel, Flow is effectively a Sequence with several variations that only act by coroutine functions. Flow's and sequences both "yield(T)" back to surrounding consumers, though the code and interfaces are unrelated. generally speaking when a solution is needed the effectiveness rating is something like Array>List>Iterable>Sequence>Flow

cliffclick · 2021-10-11T05:25:44Z

cliffclick
Oct 11, 2021
Collaborator

The H2O version, plus typos. Main difference is that we usually lump all the Futures into the same box for bulk waiting, and lack of language support to cut down on the verbiage (outside of lambdas which help alot).

Image loadAndCombine(String name1, String name2)  {
    Futures fs = new Futures();   
    var f1 = fs.add(() -> loadImage(name1));
    var f2 = fs.add(() -> loadImage(name2));
    fs.blockForAll();   // Bulk blocking
    return combineImages(f1.get(),f2.get());
}

5 replies

cpurdy Oct 12, 2021
Maintainer Author

Cliff - There's an interesting difference if you look closely. The Java code waits until the processing is done, and returns the result. The Ecstasy code returns the result potentially before it's even begun to obtain the two images, and thus likely before it's combined them. So even synchronous code can end up being a conduit for deferred futures, as in "return x.foo();" doesn't even have to know that the "Int foo()" method is on a different service and returns a future Int. In theory, whoever first tries to look at the value kills the async pipeline.

The corollary of this is that any Int result can be treated as a @Future Int (not that we do this often), and if you assign that returned value to it, then it simply behaves as an already-completed future if there wasn't actually a future coming back. I hope that makes sense; it's often hard to describe some of the things that we've done, because the words are the same, but many of the ideas are subtly (but significantly) different.

BTW - I saw an interesting thread on a related topic (async/await, but close enough) this morning over on Reddit: https://www.reddit.com/r/ProgrammingLanguages/comments/q6f7cy/on_asyncawait_and_its_contagiousness/ ... a good read (well, at least parts of it).

jnorthrup Oct 12, 2021

if you look at lambda captures, when they escape to heap for later gc you are leaking the state-dependency-graph into the lambda's lifecycle and opening up worst case generics -> autoboxing.

edit: sorry, you have an llvm agenda, this is jvm specific. nvm

cliffclick Oct 12, 2021
Collaborator

Well, the java code blocks for it explicitly. If we want to "carry on" the delayed computation, we simply don't block (yet) but gather up all the things-to-be-blocked-on as we go. In this example the bulk-blocking feature really isn't being used.

Image loadAndCombine(String name1, String name2)  {
    Futures fs = new Futures();   
    var f1 = fs.add(() -> loadImage(name1));
    var f2 = fs.add(() -> loadImage(name2));
    var f3 = fs.add(() -> combineImages(f1.get(),f2.get()));
    fs.blockForAll();   // Bulk blocking
    return `f3.get();`
}

If we want to delay each 'fx.get()' call, the signature to combineImages has to change to take a Future (and should have some meaningful other work to run in parallel to the image fetches).

Here's a more meaningful Big Array variant:

    Futures fs = new Futures();   
    for( int i=0; i<ary.length; i+= slice_size )
      fs.add(()-> do_slice(ary,i,slice_size);
    fs.blockForAll();

Fires off all the work in the main thread, then waits for it. Pretty mindless, and gets painful as ary.length approaches infinity: all the work is spawned up-front, which can run into the millions.

And a (better performing) divide-and-conquer approach:

Future<Result> do( Futures fs, Problem p ) {
  return p.size() < p.work_size )   // Conquer the work instead of dividing
    ? fs.add(()->map(p))            // map over a small problem
    :  fs.add(                      // Divide the work instead of conquering
      () -> do(fs, P.splitLeft ()), // Split left/right are trivial bounds-adjustments on a copy of tiny P
      () -> do(fs, P.splitLeft ()),
      (lhs,rhs) -> reduce(lhs,rhs));// 3-arg version; when the first 2 complete, the 3rd action goes
}

All the maps AND reduces AND job-splitting run in parallel. But typically (statistically) you spawn depth-first, and leave powers-of-2 sized chunks of work lying about to be work-stolen. Performs much better as the problem size approaches infinity.

Cliff

cliffclick Oct 12, 2021
Collaborator

if you look at lambda captures, when they escape to heap for later gc you are leaking the state-dependency-graph into the lambda's lifecycle and opening up worst case generics -> autoboxing.

edit: sorry, you have an llvm agenda, this is jvm specific. nvm

Autoboxing in Java is straightforward, if annoying, to prevent. Lambdas, by themselves, do not autobox. Nor does heap-escaping. You have to call a generic with a primitive. Avoiding that takes some discipline, but is not too hard. Certainly in the land of H2O doing this in "Bad Places" leads to quick GC death and so is easy to detect.

jnorthrup Oct 12, 2021

jvm languages without primitives are annoying for this reason.

cpurdy · 2021-10-13T16:43:44Z

cpurdy
Oct 13, 2021
Maintainer Author

Cliff, I was not suggesting "better or worse"; I was trying to draw your eye to an interesting design choice that we made (which will be expensive if we are not very careful in how we generate code for it).

To some extent, it is similar to Jim's comment about auto-boxing and escaping ... there is an implicit second conduit for the return mechanism (i.e. it's not just a simple value return on-the-stack mechanism), and that second conduit (for the @Future of the first) may even need to be lazily dynamically generated.

You can imagine it in an example as simple as:

interface Barrable
    {
    Int bar();
    }

Int foo(Barrable b)
    {
    return b.bar();
    }

In theory, an instance of "b" could be a service, returning a future Int. So it needs a "reverse trampoline" (I just made up the term, so use your imagination) to handle something that it wasn't originally expecting.

We do rely on this concept elsewhere, and not just for futures; in other words, we rely on being able to make incorrect assumptions that can be violated at a later point, and rejiggering everything at that point to make sure that the unexpected is handled correctly, and that the cost is minimized after the first time. You and I have discussed this on preambles (multiple entries to the same function, with increasing specificity of assumptions on each), but I don't think we've discussed it going the other way (hence the term "reverse trampoline").

1 reply

jnorthrup Oct 14, 2021

in https://stackoverflow.com/questions/69423990/whats-a-realistic-use-case-for-exchanger we get a decent description of Doug Lea's Exchanger which seems ideal for chaining dominoes of lambdas now that lambdas exist. your explanation of reverse trampoline reminded me of shoehorning Exchanger into a udp event handler which would pass back a router housekeeping response. the thinking was that some economy of context switches would be achieved under steady state load

cliffclick · 2021-10-14T04:34:40Z

cliffclick
Oct 14, 2021
Collaborator

Yeah, no problem. I've seen the term used elsewhere. For what its worth, HotSpot uses them in a number of places.

0 replies

cpurdy · 2021-11-04T01:59:22Z

cpurdy
Nov 4, 2021
Maintainer Author

So after a few weeks with this new design, we ran into a wall. Or perhaps a hand grenade.

Specifically, when a service is already "red" (a fiber is in a red method), and that fiber calls another service (allowing other "green"fibers to be scheduled), those "green" fibers can run up until exactly the point that they attempt to invoke a "red" method.

The problem? That sudden halt can easily happen in the middle of a line of code.

Long story short: After discussing with Mark and Gene, we reverted to a simpler form of the new model, in which other fibers (even green ones) will not be scheduled so long as a fiber is in red territory.

We'll cover this in detail on tomorrow's call.

4 replies

markandrewfalco Nov 4, 2021
Collaborator

I'm looking forward to continuing the conversation. I'm still not seeing that the updated proposal simplifies the requirements on concurrent code. In the old model a @Concurrent method needs to be aware that basically any call can potentially cause your code to be yield. In the new model any call to other @Concurrent code still poses that risk unless you can prove it doesn't call out to something @Concurrent that yields. The benefit of the new model is only that if you call from @Concurrent to @synchronized you do know that you won't yield. So the only benefit is when you actually know the "color" of the method you call and you know it's red, all other options are equally unsafe. For the place where you are green and require avoiding a yield it would seem preferable to state that by having an explicit @synchronized block rather than an inferred one.

I'm not saying this is an easy model to write concurrent code in, and I'm totally up for finding a better model, I just don't see how newly proposed model is that.

I do also have a solution to the "memory model", issue we briefly discussed today. Basically @synchronized code can be aggressively optimized by and re-ordered by the jit/cpu, and @Concurrent code cannot be, and must actually execute in program order.

I think the result of all of the above is that most code is @synchronized (the default), and thus aggressively optimizable. Select high contention code can be written to be @Concurrent but requires extra care and likely still has internal @synchronized blocks to ensure atomicity of certain sections.

cpurdy Nov 4, 2021
Maintainer Author

"Any potential yield point must leave the state of the fiber and the state of the service in the same manner required as the state would be organized at a safe point."

cliffclick Nov 4, 2021
Collaborator

and @Concurrent code cannot be, and must actually execute in program order.

Pretty much can't make that happen on a modern X86, and pretty much don't want to. You NEED a language-level cross-fiber memory model, even if its only "writes on this fiber have random visibility to reads on another fiber running in the same @Concurrent code". Mind you, random sucks, and since you'll get the X86 model anyways it becomes hard to get a test case... but no guarantee that the random order won't get you anyways. The JMM is perhaps more sane than random.

The JMM is what it is, because it can be made to go fast on all modern hardware, has implementations on all modern hardware, is usable (even if you'd like to not use it), is understandable by non-experts (yes I hired quite a few), is implementable by an expert (not just me, quite a few folks have made it work) and has strong a theoretical background.

In your case, a Shared-Memory-Model doesn't apply outside of services (r/w not cross-visible), doesn't apply in @synchronized code. It DOES apply across the sync boundary - you'll need a lang guarantee (have already?), and typically a slow hardware fence/CAS - and it DOES apply in @Concurrent code, in spades.

cpurdy Nov 5, 2021
Maintainer Author

Hi Cliff, one thing to keep in mind is that our "@Concurrent" annotation does not ever [1] permit two threads to be accessing or mutating the same state at the same time. So while a memory model is needed to guard the park/unpark points for each fiber (and safe points in general), it does seem a far easier mountain to climb than what the JMM had to solve. Hopefully, Mark is getting to start today on a "concurrent hash map" implementation, along the lines of what we've discussed a few times (basically, a service per partition). It should help illustrate some of these concepts, and more importantly, it should help shake out the design that this thread is all about ...

--

[1] other than the "@atomic" annotated properties, which are natively implemented in the manner that one would expect, but appear to the developer as if each atomic were its own individual and separate service.

Re-Entrancy Re-Design #11

Uh oh!

Uh oh!

cpurdy Oct 5, 2021 Maintainer

Replies: 15 comments · 33 replies

Uh oh!

cpurdy Oct 5, 2021 Maintainer Author

Uh oh!

markandrewfalco Oct 5, 2021 Collaborator

Uh oh!

markandrewfalco Oct 5, 2021 Collaborator

Uh oh!

cpurdy Oct 6, 2021 Maintainer Author

Uh oh!

markandrewfalco Oct 5, 2021 Collaborator

Uh oh!

cpurdy Oct 6, 2021 Maintainer Author

Uh oh!

cpurdy Oct 7, 2021 Maintainer Author

Uh oh!

cliffclick Oct 7, 2021 Collaborator

Uh oh!

cpurdy Oct 8, 2021 Maintainer Author

Uh oh!

cliffclick Oct 8, 2021 Collaborator

Uh oh!

cpurdy Oct 6, 2021 Maintainer Author

Uh oh!

cliffclick Oct 6, 2021 Collaborator

Uh oh!

ggleyzer Oct 6, 2021 Maintainer

Uh oh!

cpurdy Oct 6, 2021 Maintainer Author

Uh oh!

cliffclick Oct 6, 2021 Collaborator

Uh oh!

markandrewfalco Oct 7, 2021 Collaborator

Uh oh!

Uh oh!

cpurdy Oct 7, 2021 Maintainer Author

Uh oh!

cpurdy Oct 6, 2021 Maintainer Author

Uh oh!

Uh oh!

markandrewfalco Oct 7, 2021 Collaborator

Uh oh!

cliffclick Oct 7, 2021 Collaborator

Uh oh!

cpurdy Oct 8, 2021 Maintainer Author

Uh oh!

cliffclick Oct 8, 2021 Collaborator

Uh oh!

cpurdy Oct 8, 2021 Maintainer Author

Uh oh!

Uh oh!

jnorthrup Oct 9, 2021

Uh oh!

cpurdy Oct 10, 2021 Maintainer Author

Uh oh!

cliffclick Oct 10, 2021 Collaborator

Uh oh!

jnorthrup Oct 10, 2021

Uh oh!

cpurdy Oct 10, 2021 Maintainer Author

Uh oh!

Uh oh!

jnorthrup Oct 11, 2021

Uh oh!

cpurdy
Oct 5, 2021
Maintainer

Replies: 15 comments 33 replies

cpurdy
Oct 5, 2021
Maintainer Author

markandrewfalco
Oct 5, 2021
Collaborator

markandrewfalco Oct 5, 2021
Collaborator

cpurdy Oct 6, 2021
Maintainer Author

markandrewfalco
Oct 5, 2021
Collaborator

cpurdy Oct 6, 2021
Maintainer Author

cpurdy Oct 7, 2021
Maintainer Author

cliffclick Oct 7, 2021
Collaborator

cpurdy Oct 8, 2021
Maintainer Author

cliffclick Oct 8, 2021
Collaborator

cpurdy
Oct 6, 2021
Maintainer Author

cliffclick
Oct 6, 2021
Collaborator

ggleyzer Oct 6, 2021
Maintainer

cpurdy Oct 6, 2021
Maintainer Author

cliffclick Oct 6, 2021
Collaborator

markandrewfalco Oct 7, 2021
Collaborator

cpurdy Oct 7, 2021
Maintainer Author

cpurdy
Oct 6, 2021
Maintainer Author

markandrewfalco
Oct 7, 2021
Collaborator

cliffclick
Oct 7, 2021
Collaborator

cpurdy Oct 8, 2021
Maintainer Author

cliffclick
Oct 8, 2021
Collaborator

cpurdy Oct 8, 2021
Maintainer Author

cpurdy
Oct 10, 2021
Maintainer Author

cliffclick Oct 10, 2021
Collaborator

cpurdy Oct 10, 2021
Maintainer Author