Replies: 15 comments 33 replies
-
Another thing that we considered (that I forgot to mention) was a simple helper on the Service interface for performing a lambda in the equivalent of a critical section (by tagging the method with the
And/or:
|
Beta Was this translation helpful? Give feedback.
-
I like it very much. I do still feel a bit uneasy about @syncronized and @Concurrent being "inherited" from the parent when unspecified, but I imagine with usage we'll get a feel for if it is better that way or not. |
Beta Was this translation helpful? Give feedback.
-
Regarding "directly traceable", does this mean just one level service a -> service b -> service a, or can it be more indirect? |
Beta Was this translation helpful? Give feedback.
-
Gene didn't like the wording of this statement: "The presence within the fiber of any execution frame that is not safe, means that the fiber is not safe; each new execution frame is infected by the unsafeness of its caller." His clarification is something like this: "A callee cannot transition from concurrent-unsafe to concurrent-safe." In other words, a callee cannot somehow mark its caller as safe for concurrent/reentrant execution. So, once a fiber is concurrent-unsafe, the only way to make it concurrent-safe is to In other words: "If the current method (i.e. execution frame) is not allowing service reentrancy/concurrency to occur, then nothing that the current method calls will be allowed to cause service reentrancy/concurrency to occur." |
Beta Was this translation helpful? Give feedback.
-
Yo, calling "Service.yield" in unsafe code is at best a code-smell, at very likely a real bug. Why yield? Its generally because you need something else to happen before you can make progress. So you yield, and upon return you check that the condition has changed, and if not, yield again. If you're in unsafe code, nobody else can run, so yield never lets anything change, so you're in an infinite loop, i.e. live-lock. Even if you are in safe code, there's no scheduling guarantees from the OS that the correct other fiber/service executes and progress is made. If 1000 runnable fibers all call yield, I can pretty much guarantee you that the 1001'th fiber will NEVER run and you'll just live-lock spinning yields on the "wrong" fibers. Another problem (for XTC) is that you can't (statically) know you're in unsafe code. IMHO (and fairly extensive experience here), calling yield is just asking for a livelock bug. Cliff |
Beta Was this translation helpful? Give feedback.
-
Proposed
Proposed
|
Beta Was this translation helpful? Give feedback.
-
Quick summary of what was I believe agreed to on today's call:
|
Beta Was this translation helpful? Give feedback.
-
Again on Yield; why do you have a "service event backlog"? Why are you even contemplating busy-waiting? Disallow it by fiat, and don't hand out any API support. Relying on it at any point is a ticket to a hellish live-lock debugging session. re Checking for completed Futures: Don't Bother (semantically - optimistically its useful to peek at just the first few prior ones when you are installing a new Future, but this code should never be exposed to any user). Futures checking happens when you block because, e.g. you need that future result. If nobody needs it right now, why check right now? Just wait for the need to arise. re Checking for completed Futures: there's a real Control Theory problem here. If the count of uncompleted Futures grows large, and you start by checking them ALL at any given step, and the actual work they represent is Small - you quickly go O(n^2) under load instead of O(n). Looks good In The Small, sucks big In The Large, and is very performance unstable. Been there, done that; had my 100M+ Futures in a doubling Java array and had it run out of doubling, go negative array length and die. Fix was to Quit Looking for uncompleted Futures, and freaking just go and complete some. |
Beta Was this translation helpful? Give feedback.
-
The more I think about this, the more I think you're just nearly an H2O cluster, with all the foibles that entails. I beat my way through a solid design, with no deadlock nor live-lock possible, and near 100% throughput levels achievable as the normal course of events.... BUT
As a consequence, I routinely achieved very high throughput, with typically very short latencies. My queues, which in theory could grow to millions - and did until I got the priority thing sorted out - typically stay very small. They have a hard formal Control Theory mechanism in place to keep that so - that's that "do old work before starting something new" concept, which is what that priority gig is about. Probably better to talk about this F2F... it gets deep quick, and there's a lot of things buried in my memory about how to make this work - and what the hidden gotchas that I fell over were. |
Beta Was this translation helpful? Give feedback.
-
That's probably a bit too kind 🤣 ... we certainly want to make it relatively easy to avoid deadlocks, and if one happens to occur, we want it to be evidenced as an exception, not as an actual deadlock condition. But what we are trying to do is to make safe concurrent code easy. Because there just aren't that many Martin Thompsons in the world, and there are lots of multi-core chips. 🤣
Right. The idea is that we won't know what the ideal mechanism to use is, until we see the runtime profile data. So we would probably start (from the initial code gen pass) optimistically, assuming no contention. Maybe using some sort of sticky, biased lock (zero cost until contention actually shows up). Then switch (deopt) to short spin, and then maybe to a single waiter model where the waiter simply deposits his "unpark" as a continuation on the current lock owner (contention, but low). Eventually, if contention is high, then a highly concurrent queue implementation (something like the disruptor approach, but tuned for multi-producer, single consumer) becomes tempting. |
Beta Was this translation helpful? Give feedback.
-
The H2O version, plus typos. Main difference is that we usually lump all the Futures into the same box for bulk waiting, and lack of language support to cut down on the verbiage (outside of lambdas which help alot).
|
Beta Was this translation helpful? Give feedback.
-
Cliff, I was not suggesting "better or worse"; I was trying to draw your eye to an interesting design choice that we made (which will be expensive if we are not very careful in how we generate code for it). To some extent, it is similar to Jim's comment about auto-boxing and escaping ... there is an implicit second conduit for the return mechanism (i.e. it's not just a simple value return on-the-stack mechanism), and that second conduit (for the You can imagine it in an example as simple as:
In theory, an instance of "b" could be a service, returning a future Int. So it needs a "reverse trampoline" (I just made up the term, so use your imagination) to handle something that it wasn't originally expecting. We do rely on this concept elsewhere, and not just for futures; in other words, we rely on being able to make incorrect assumptions that can be violated at a later point, and rejiggering everything at that point to make sure that the unexpected is handled correctly, and that the cost is minimized after the first time. You and I have discussed this on preambles (multiple entries to the same function, with increasing specificity of assumptions on each), but I don't think we've discussed it going the other way (hence the term "reverse trampoline"). |
Beta Was this translation helpful? Give feedback.
-
Yeah, no problem. I've seen the term used elsewhere. For what its worth, HotSpot uses them in a number of places. |
Beta Was this translation helpful? Give feedback.
-
So after a few weeks with this new design, we ran into a wall. Or perhaps a hand grenade. Specifically, when a service is already "red" (a fiber is in a red method), and that fiber calls another service (allowing other "green"fibers to be scheduled), those "green" fibers can run up until exactly the point that they attempt to invoke a "red" method. The problem? That sudden halt can easily happen in the middle of a line of code. Long story short: After discussing with Mark and Gene, we reverted to a simpler form of the new model, in which other fibers (even green ones) will not be scheduled so long as a fiber is in red territory. We'll cover this in detail on tomorrow's call. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Ecstasy uses mutable domains of information, called "services". Like any other object, a service has a callable surface area, i.e. its API. When a call is made "into" a service (i.e. a call from a different service), that call creates a new fiber in the service being called, and that new fiber will then perform the call's work "inside of" the service. A service can only execute one fiber at a time; when multiple calls are made into a service, the reentrancy rules are used to determine if other fibers may be scheduled before the first is complete, and under what conditions those fibers may be scheduled. If a fiber is allowed to be scheduled, then it will only be possible for that fiber to be scheduled when the current fiber either (a) completes, (b) makes a blocking call into another service, or (c) explicitly calls Service.yield().
A recent bug in the JSON database prototype resulted in several long days of painful debugging. Basically, the Client instances (any number thereof) are services that manage their transactional boundaries using the TxManager, which is also a service. The Client instances also use the ObjectStores instances directly (one per database object), and each ObjectStore is a separate service. So there are
m
Client instances andn
ObjectStore instances.But there's one more twist to this puzzle. In addition to the
m
Client instances hammering away on the TxManager, then
ObjectStore instances also use the TxManager (for example, to enlist themselves), and these uses are almost always caused by an invocation from one of those Client instances.The services themselves are designed to deal with this chaos. The TxManager, for example, explicitly uses the
Open
Reentrancy
setting, because it is a busy hub of work, and it wants to push as much through as it possibly can (almost all of which is turned into asynchronous requests, to be performed by the other services). In a few cases, mostly related to life-cycle management, it uses explicitCriticalSection
blocks, in order to "close the front door", i.e. so while it is doing something important, it knows that no other unrelated call will come in to the service and get scheduled to run.Ultimately, there were two bugs that surfaced in the course of debugging this issue:
The first, and most obvious one, was caused by the SkiplistMap used by the ObjectStore to track in-flight transactions. We were using the computeIfAbsent method, and in the lamba, the TxManager service was being called to enlist the ObjectStore into the transaction. This call, going across a service boundary and blocking, allowed other fibers to be scheduled. A quick fix was simple enough: Use a "
using new CriticalSection()
" block inside of the lambda, wrapped around the call to the TxManager. However, it was worrisome that such a thing could occur deep inside a mutable data structure, one that was not designed with concurrency and reentrancy in mind.The second, and more insidious one, was caused by the same SkiplistMap. Inside the SkiplistMap, there is a coin-flip that occurs (it's part of Bill Pugh's original design for the skiplist data structure). That coin-flip is performed by a call to a
Random
object, which was injected into the SkiplistMap. The injectedRandom
just happened to be aservice
(all injected references are either services or immutable objects), and when the SkiplistMap called to that service (not knowing that it was a service!), the blocking call allowed that active fiber to be suspended -- in the middle of an operation deep inside SkiplistMap! -- and that, in turn caused data corruption when another fiber started executing and mutating that same SkiplistMap. Again, why should the SkiplistMap (or any other low level, mutable data structure) have to even be aware of this possibility?!?In our current design, the reentrancy for a service is managed as a property on the service object itself. (And the handy CriticalSection simply sets this value on the way in, and unsets it on the way out.)
But what about simple, mutable data structures being used within the service that could inadvertently make a blocking call to a service, either because one was injected (surprise!), or perhaps one was passed in as a lambda, or as an interface to invoke that directly -- or indirectly -- calls a service. Mark made the simple and astute observation that one should have to opt in to concurrency, i.e. to allowing a switch to another fiber to occur.
The more that we have investigated this idea, the more that it grows on us. The conceptual net result (i.e. the goal of the re-design) is that developers will only be forced to think about concurrency when they trying to increase concurrency by addressing mutable data structures that are not explicitly concurrent-safe (i.e. reentrancy-safe).
The new design is based on a few, fundamental observations:
The reentrancy problem almost always occurs within methods on mutable data structures;
With rare exception, neither functions nor immutable objects suffer from this malady;
The presence within the fiber of any execution frame that is not safe, means that the fiber is not safe; each new execution frame is infected by the unsafeness of its caller.
The new design has several purposeful attributes that are common in Ecstasy:
Reasonable defaults: If something is obvious, it should not be required for a developer to type it, unless its purpose in doing so is to make the code more readable.
Hierarchical configuration: For example, in the absence of explicit information on a method, the information can be provided by the parent of the method, and so on, all the way up to the top level class containing the method.
Safety and predictability are considered non-negotiable; the ability to maximize concurrency and performance must respect safety and predictability.
Proposed changes:
Introduce two new method/property/class annotations,
@Synchronized
and@Concurrent
. The first,@Synchronized
, explicitly marks a method/property/class as being unsafe for concurrent reentrancy. The second,@Concurrent
, explicitly marks a method/property/class as being safe for concurrent reentrancy. When a conflict exists,@Synchronized
overrides@Concurrent
.Change the
Reentrancy reentrancy
property on Service to@RO Boolean reentrant
, which evaluates to True iff the current fiber can safely be "interrupted" (by the scheduling of another concurrent fiber) when it makes a blocking call to another service. It evaluates to True iff each execution frame for the fiber is concurrent-safe (i.e. reentrancy-safe for new fibers), and no CriticalSection has been registered.An execution frame is considered concurrent-safe based on a set of rules (below), but in general: Anything marked with
@Synchronized
is unsafe, anything marked with@Concurrent
is assumed safe, and any execution frame whose invocation target (i.e. "this
") is a mutable object, is assumed unsafe.When a service makes a blocking call (or a call to
Service.yield()
), another fiber must not be scheduled unless (a)Service.reentrant
evaluates to True, or (b) the incoming call is directly traceable back to this same blocking call.Rules for determining the concurrent-safeness of an execution frame:
A class/property/method is said to be explicitly unsafe iff the class/property/method is
@Synchronized
.A class/property/method is said to be explicitly safe iff the class/property/method is
@Concurrent
.A class is considered concurrent-safe iff all of the following hold true:
A property is considered concurrent-safe iff all of the following hold true:
A method is considered concurrent-safe iff all of the following hold true:
A frame is concurrent-safe, iff all of the following hold true:
Beta Was this translation helpful? Give feedback.
All reactions