Skip to content

Commit 1bc0e81

Browse files
Improved Process Recovery docs writeup
1 parent 34a76f1 commit 1bc0e81

File tree

5 files changed

+128
-27
lines changed

5 files changed

+128
-27
lines changed

CMakeLists.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,9 @@ set(FENIX_VERSION_MAJOR 1)
1616
set(FENIX_VERSION_MINOR 0)
1717

1818
option(BUILD_EXAMPLES "Builds example programs from the examples directory" OFF)
19-
option(BUILD_TESTING "Builds tests and test modes of files" ON)
20-
option(BUILD_DOCS "Builds documentation if is doxygen found" ON)
21-
option(DOCS_ONLY "Only build documentation" OFF)
19+
option(BUILD_TESTING "Builds tests and test modes of files" ON)
20+
option(BUILD_DOCS "Builds documentation if is doxygen found" ON)
21+
option(DOCS_ONLY "Only build documentation" OFF)
2222

2323
#Solves an issue with some system environments putting their MPI headers before
2424
#the headers CMake includes. Forces non-system MPI headers when incorrect headers

doc/CMakeLists.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
find_package(Doxygen)
22

3-
set(FENIX_DOCS_OUTPUT ${CMAKE_CURRENT_BINARY_DIR} CACHE INTERNAL "Documentation output directory")
4-
set(FENIX_DOCS_MAN "YES" CACHE INTERNAL "Option to disable man page generation for CI builds")
5-
set(FENIX_BRANCH "local" CACHE INTERNAL "Git branch being documented, or local if not building for Github Pages")
3+
set(FENIX_DOCS_OUTPUT ${CMAKE_CURRENT_BINARY_DIR} CACHE PATH "Documentation output directory")
4+
set(FENIX_DOCS_MAN "YES" CACHE BOOL "Option to disable man page generation for CI builds")
5+
set(FENIX_BRANCH "local" CACHE BOOL "Git branch being documented, or local if not building for Github Pages")
66

77
if(NOT DOXYGEN_FOUND)
88
message(STATUS "Doxygen not found, `make docs` disabled")

doc/markdown/DataRecovery.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,9 +35,10 @@ redundant storage, can be associated with a specific instance of
3535
a Fenix *data group* to form a semantic unit. Committing such a
3636
group ensures that the data involved is available for recovery.
3737

38+
-----
39+
3840
## Data Groups
3941

40-
-----
4142
A Fenix *data group* provides dual functionality. First, it serves
4243
as a container for a set of data objects (*members*) that are
4344
committed together, and hence provides transaction semantics.
@@ -55,9 +56,10 @@ conditionally skip the creation after initialization).
5556
See #Fenix_Data_group_create
5657
for creating a data group.
5758

59+
-----
60+
5861
## Data Redundancy Policies
5962

60-
-----
6163
Fenix internally uses an extensible system for defining data
6264
policies to keep the door open to easily adding new data policies
6365
and configuring them on a per-data-group basis. We currently

doc/markdown/ProcessRecovery.md

Lines changed: 116 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,116 @@
1-
Functions and types for process recovery.
2-
3-
* Only communicators derived from the communicator returned by
4-
Fenix_Init are eligible for reconstruction.
5-
After communicators have been repaired, they contain the same
6-
number of ranks as before the failure occurred, unless the user
7-
did not allocate sufficient redundant resources (*spare ranks*)
8-
and instructed Fenix not to create new ranks. In this case
9-
communicators will still be repaired, but will contain fewer
10-
ranks than before the failure occurred.
11-
12-
* To ease adoption of MPI fault tolerance, Fenix automatically
13-
captures any errors resulting from MPI library calls that are a
14-
result of a damaged communicator (other errors reported by the
15-
MPI runtime are ignored by Fenix and are returned to the
16-
application, for handling by the application writer). In other
17-
words, programmers do not need to replace calls to the MPI library
18-
with calls to Fenix (for example, *Fenix_Send* instead of
19-
*MPI_Send*).
1+
Process recovery within Fenix can be broken down into three steps: detection,
2+
communicator recovery, and application recovery.
3+
4+
---
5+
6+
## Detecting Failures
7+
8+
Fenix is built on top of ULFM MPI, so specific fault detection mechanisms and
9+
options can be found in the [ULFM
10+
documentation](https://docs.open-mpi.org/en/v5.0.x/features/ulfm.html#). At a
11+
high level, this means that Fenix will detect failures when an MPI function
12+
call is made which involves a failed rank. Detection is not collectively
13+
consistent, meaning some ranks may fail to complete a collective while other
14+
ranks finish successfully. Once a failure is detected, Fenix will 'revoke' the
15+
communicator that the failed operation was using and the top-level communicator
16+
output by #Fenix_Init (these communicators are usually the same). The
17+
revocation is permanent, and means that all future operations on the
18+
communicator by any rank will fail. This allows knowledge of the failed rank to
19+
be propagated to all ranks in the communicator, even if some ranks would never
20+
have directly communicated with the failed rank.
21+
22+
Since failures can only be detected during MPI function calls, applications with
23+
long periods of communication-free computation will experience delays in beginning
24+
recovery. Such applications may benefit from inserting periodic calls to
25+
#Fenix_Process_detect_failures to allow ranks to participate in global recovery
26+
operations with less delay.
27+
28+
Fenix will only detect and respond to failures that occur on the communicator
29+
provided by #Fenix_Init or any communicators derived from it. Faults on other
30+
communicators will, by default, abort the application. Note that having
31+
multiple derived communicators is not currently recommended, and may lead to
32+
deadlock. In fact, even one derived communicator may lead to deadlock if not
33+
used carefully. If you have a use case that requires multiple communicators,
34+
please contact us about your use case -- we can provide guidance and may be
35+
able to update Fenix to support it.
36+
37+
**Advanced:** Applications may wish to handle some failures themselves - either
38+
ignoring them or implementing custom recovery logic in certain code regions.
39+
This is not generally recommended. Significant care must be taken to ensure
40+
that the application does not attempt to enter two incompatible recovery steps.
41+
However, if you wish to do this, you can include "fenix_ext.h" and manually set
42+
`fenix.ignore_errs` to a non-zero value. This will cause Fenix's error handler
43+
to simply return any errors it encounters as the exit code of the application's
44+
MPI function call. Alternatively, applications may temporarily replace the
45+
communicator's error handler to avoid Fenix recovery. If you have a use case
46+
that would benefit from this, you can contact us for guidance and/or to request
47+
some specific error handling features.
48+
49+
---
50+
51+
## Communicator Recovery
52+
53+
Once a failure has been detected, Fenix will begin the collective process of
54+
rebuilding the resilient communicator provided by #Fenix_Init. There are two
55+
ways to rebuild: replacing failed ranks with spares, or shrinking the
56+
communicator to exclude the failed ranks. If there are any spares available,
57+
Fenix will use those to replace the failed ranks and maintain the original
58+
communicator size and guarantee that surviving processes keep the same rank ID.
59+
If there are not enough spares, some processes may have a different rank ID on
60+
the new communicator, and Fenix will warn the user about this by setting the
61+
error code for #Fenix_Init to #FENIX_WARNING_SPARE_RANKS_DEPLETED.
62+
63+
**Advanced:** Communicator recovery is collective, blocking, and not
64+
interruptable. ULFM exposes some functions (e.g. MPIX_Comm_agree,
65+
MPIX_Comm_shrink) that are also not interrupable -- meaning they will continue
66+
despite any failures or revocations. If multiple collective, non-interruptable
67+
operations are started by different ranks in different orders, the application
68+
will deadlock. This is similar to what would happen if a non-resilient
69+
application called multiple collectives (e.g. `MPI_Allreduce`) in different
70+
orders. However, the preemptive and inconsistent nature of failure recovery
71+
makes it more complex to reason about ordering between ranks. Fenix uses these
72+
ULFM functions internally, so care is taken to ensure that the order of
73+
operations is consistent across ranks. Before any such operation begins, Fenix
74+
first uses MPIX_Comm_agree on the resilient communicator provided by
75+
#Fenix_Init to agree on which 'location' will proceed - if there is any
76+
disagreement, all ranks will enter recovery as if they had detected a failure.
77+
Applications which wish to use these functions themselves should follow this
78+
pattern, providing a unique 'location' value for any operations that may be
79+
interrupted.
80+
81+
---
82+
83+
## Application Recovery
84+
85+
Once a new communicator has been constructed, application recovery begins.
86+
There are two recovery modes: jumping (default) and non-jumping. With jumping
87+
recovery, Fenix will automatically `longjmp` to the #Fenix_Init call site once
88+
communicator recovery is complete. This allows for very simple recovery logic,
89+
since it mimics the traditional teardown-restart pattern. However, `longjmp`
90+
has many undefined semantics according to the C and C++ specifications and may
91+
result in unexpected behavior due to compiler assumptions and optimizations.
92+
Additionally, some applications may be able to more efficiently recover by
93+
continuing inline. Users can initialize Fenix as non-jumping (see test/no_jump)
94+
to instead return an error code from the triggering MPI function call after
95+
communicator recovery. This may require more intrusive code changes (checking
96+
return statuses of each MPI call).
97+
98+
Fenix also allows applications to register one or more callback functions with
99+
#Fenix_Callback_register and #Fenix_Callback_pop, which removes the most
100+
recently registered callback. These callbacks are invoked after communicator
101+
recovery, just before control returns to the application. Callbacks are
102+
executed in the reverse order they were registered.
103+
104+
For C++ applications, it is recommended to use Fenix in non-jumping mode and to
105+
register a callback that throws an exception. At it's simplest, wrapping
106+
everything between #Fenix_Init and #Fenix_Finalize in a single try-catch can
107+
give the same simple recovery logic as jumping mode, but without the undefined
108+
behavior of `longjmp`.
109+
110+
#Fenix_Init outputs a role, from #Fenix_Rank_role, which helps inform the
111+
application about the recovery state of the rank. It is important to note that
112+
all spare ranks are captured inside #Fenix_Init until they are used for
113+
recovery. Therefore, after recovery, recovered ranks will not have the same
114+
callbacks registered -- recovered ranks will need to manually invoke any
115+
callbacks that use MPI functions. These roles also help the application more
116+
generally modify it's behavior based on each rank's recovery state.

include/fenix.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,7 @@ extern "C" {
7575
/**
7676
* @defgroup ReturnCodes Return Codes
7777
* @brief All possible return codes from Fenix functions.
78+
* Errors are negative, warnings are positive.
7879
* @{
7980
*/
8081
#define FENIX_SUCCESS 0
@@ -115,6 +116,7 @@ extern "C" {
115116

116117
/**
117118
* @defgroup ProcessRecovery Process Recovery
119+
* @brief Functions for managing process recovery in Fenix.
118120
* @details @include{doc} ProcessRecovery.md
119121
* @{
120122
*/

0 commit comments

Comments
 (0)