|
1 |
| -Functions and types for process recovery. |
2 |
| - |
3 |
| -* Only communicators derived from the communicator returned by |
4 |
| -Fenix_Init are eligible for reconstruction. |
5 |
| -After communicators have been repaired, they contain the same |
6 |
| -number of ranks as before the failure occurred, unless the user |
7 |
| -did not allocate sufficient redundant resources (*spare ranks*) |
8 |
| -and instructed Fenix not to create new ranks. In this case |
9 |
| -communicators will still be repaired, but will contain fewer |
10 |
| -ranks than before the failure occurred. |
11 |
| - |
12 |
| -* To ease adoption of MPI fault tolerance, Fenix automatically |
13 |
| -captures any errors resulting from MPI library calls that are a |
14 |
| -result of a damaged communicator (other errors reported by the |
15 |
| -MPI runtime are ignored by Fenix and are returned to the |
16 |
| -application, for handling by the application writer). In other |
17 |
| -words, programmers do not need to replace calls to the MPI library |
18 |
| -with calls to Fenix (for example, *Fenix_Send* instead of |
19 |
| -*MPI_Send*). |
| 1 | +Process recovery within Fenix can be broken down into three steps: detection, |
| 2 | +communicator recovery, and application recovery. |
| 3 | + |
| 4 | +--- |
| 5 | + |
| 6 | +## Detecting Failures |
| 7 | + |
| 8 | +Fenix is built on top of ULFM MPI, so specific fault detection mechanisms and |
| 9 | +options can be found in the [ULFM |
| 10 | +documentation](https://docs.open-mpi.org/en/v5.0.x/features/ulfm.html#). At a |
| 11 | +high level, this means that Fenix will detect failures when an MPI function |
| 12 | +call is made which involves a failed rank. Detection is not collectively |
| 13 | +consistent, meaning some ranks may fail to complete a collective while other |
| 14 | +ranks finish successfully. Once a failure is detected, Fenix will 'revoke' the |
| 15 | +communicator that the failed operation was using and the top-level communicator |
| 16 | +output by #Fenix_Init (these communicators are usually the same). The |
| 17 | +revocation is permanent, and means that all future operations on the |
| 18 | +communicator by any rank will fail. This allows knowledge of the failed rank to |
| 19 | +be propagated to all ranks in the communicator, even if some ranks would never |
| 20 | +have directly communicated with the failed rank. |
| 21 | + |
| 22 | +Since failures can only be detected during MPI function calls, applications with |
| 23 | +long periods of communication-free computation will experience delays in beginning |
| 24 | +recovery. Such applications may benefit from inserting periodic calls to |
| 25 | +#Fenix_Process_detect_failures to allow ranks to participate in global recovery |
| 26 | +operations with less delay. |
| 27 | + |
| 28 | +Fenix will only detect and respond to failures that occur on the communicator |
| 29 | +provided by #Fenix_Init or any communicators derived from it. Faults on other |
| 30 | +communicators will, by default, abort the application. Note that having |
| 31 | +multiple derived communicators is not currently recommended, and may lead to |
| 32 | +deadlock. In fact, even one derived communicator may lead to deadlock if not |
| 33 | +used carefully. If you have a use case that requires multiple communicators, |
| 34 | +please contact us about your use case -- we can provide guidance and may be |
| 35 | +able to update Fenix to support it. |
| 36 | + |
| 37 | +**Advanced:** Applications may wish to handle some failures themselves - either |
| 38 | +ignoring them or implementing custom recovery logic in certain code regions. |
| 39 | +This is not generally recommended. Significant care must be taken to ensure |
| 40 | +that the application does not attempt to enter two incompatible recovery steps. |
| 41 | +However, if you wish to do this, you can include "fenix_ext.h" and manually set |
| 42 | +`fenix.ignore_errs` to a non-zero value. This will cause Fenix's error handler |
| 43 | +to simply return any errors it encounters as the exit code of the application's |
| 44 | +MPI function call. Alternatively, applications may temporarily replace the |
| 45 | +communicator's error handler to avoid Fenix recovery. If you have a use case |
| 46 | +that would benefit from this, you can contact us for guidance and/or to request |
| 47 | +some specific error handling features. |
| 48 | + |
| 49 | +--- |
| 50 | + |
| 51 | +## Communicator Recovery |
| 52 | + |
| 53 | +Once a failure has been detected, Fenix will begin the collective process of |
| 54 | +rebuilding the resilient communicator provided by #Fenix_Init. There are two |
| 55 | +ways to rebuild: replacing failed ranks with spares, or shrinking the |
| 56 | +communicator to exclude the failed ranks. If there are any spares available, |
| 57 | +Fenix will use those to replace the failed ranks and maintain the original |
| 58 | +communicator size and guarantee that surviving processes keep the same rank ID. |
| 59 | +If there are not enough spares, some processes may have a different rank ID on |
| 60 | +the new communicator, and Fenix will warn the user about this by setting the |
| 61 | +error code for #Fenix_Init to #FENIX_WARNING_SPARE_RANKS_DEPLETED. |
| 62 | + |
| 63 | +**Advanced:** Communicator recovery is collective, blocking, and not |
| 64 | +interruptable. ULFM exposes some functions (e.g. MPIX_Comm_agree, |
| 65 | +MPIX_Comm_shrink) that are also not interrupable -- meaning they will continue |
| 66 | +despite any failures or revocations. If multiple collective, non-interruptable |
| 67 | +operations are started by different ranks in different orders, the application |
| 68 | +will deadlock. This is similar to what would happen if a non-resilient |
| 69 | +application called multiple collectives (e.g. `MPI_Allreduce`) in different |
| 70 | +orders. However, the preemptive and inconsistent nature of failure recovery |
| 71 | +makes it more complex to reason about ordering between ranks. Fenix uses these |
| 72 | +ULFM functions internally, so care is taken to ensure that the order of |
| 73 | +operations is consistent across ranks. Before any such operation begins, Fenix |
| 74 | +first uses MPIX_Comm_agree on the resilient communicator provided by |
| 75 | +#Fenix_Init to agree on which 'location' will proceed - if there is any |
| 76 | +disagreement, all ranks will enter recovery as if they had detected a failure. |
| 77 | +Applications which wish to use these functions themselves should follow this |
| 78 | +pattern, providing a unique 'location' value for any operations that may be |
| 79 | +interrupted. |
| 80 | + |
| 81 | +--- |
| 82 | + |
| 83 | +## Application Recovery |
| 84 | + |
| 85 | +Once a new communicator has been constructed, application recovery begins. |
| 86 | +There are two recovery modes: jumping (default) and non-jumping. With jumping |
| 87 | +recovery, Fenix will automatically `longjmp` to the #Fenix_Init call site once |
| 88 | +communicator recovery is complete. This allows for very simple recovery logic, |
| 89 | +since it mimics the traditional teardown-restart pattern. However, `longjmp` |
| 90 | +has many undefined semantics according to the C and C++ specifications and may |
| 91 | +result in unexpected behavior due to compiler assumptions and optimizations. |
| 92 | +Additionally, some applications may be able to more efficiently recover by |
| 93 | +continuing inline. Users can initialize Fenix as non-jumping (see test/no_jump) |
| 94 | +to instead return an error code from the triggering MPI function call after |
| 95 | +communicator recovery. This may require more intrusive code changes (checking |
| 96 | +return statuses of each MPI call). |
| 97 | + |
| 98 | +Fenix also allows applications to register one or more callback functions with |
| 99 | +#Fenix_Callback_register and #Fenix_Callback_pop, which removes the most |
| 100 | +recently registered callback. These callbacks are invoked after communicator |
| 101 | +recovery, just before control returns to the application. Callbacks are |
| 102 | +executed in the reverse order they were registered. |
| 103 | + |
| 104 | +For C++ applications, it is recommended to use Fenix in non-jumping mode and to |
| 105 | +register a callback that throws an exception. At it's simplest, wrapping |
| 106 | +everything between #Fenix_Init and #Fenix_Finalize in a single try-catch can |
| 107 | +give the same simple recovery logic as jumping mode, but without the undefined |
| 108 | +behavior of `longjmp`. |
| 109 | + |
| 110 | +#Fenix_Init outputs a role, from #Fenix_Rank_role, which helps inform the |
| 111 | +application about the recovery state of the rank. It is important to note that |
| 112 | +all spare ranks are captured inside #Fenix_Init until they are used for |
| 113 | +recovery. Therefore, after recovery, recovered ranks will not have the same |
| 114 | +callbacks registered -- recovered ranks will need to manually invoke any |
| 115 | +callbacks that use MPI functions. These roles also help the application more |
| 116 | +generally modify it's behavior based on each rank's recovery state. |
0 commit comments