Improve failure mode, add multiple DCs (#1273)

as51340 · antejavor · matea16 · web-flow · commit 6092e0105ac7 · 2025-06-10T09:54:38.000+02:00
* Improve failure mode, add multiple DCs

* Remove crash-stop, add omission faults

* Document slower failover time

---------

Co-authored-by: Ante Javor &lt;ante.javor@memgraph.io&gt;
Co-authored-by: Matea Pesic &lt;80577904+matea16@users.noreply.github.com&gt;
diff --git a/pages/clustering/high-availability.mdx b/pages/clustering/high-availability.mdx
@@ -635,8 +635,8 @@ the wrong state of other clusters, it can become a leader without being connecte
 
 ## Recovering from errors
 
-Distributed systems can fail in numerous ways. With the current implementation, Memgraph instances are resilient to occasional network
-failures and independent machine failures. Byzantine failures aren't handled since the Raft consensus protocol cannot deal with them either. 
+Distributed systems can fail in numerous ways. Memgraph processes are resilient to network
+failures, omission faults and independent machine failures. Byzantine failures aren't tolerated since the Raft consensus protocol cannot deal with them either.
 
 Recovery Time Objective (RTO) is an often used term for measuring the maximum tolerable length of time that an instance or cluster can be down.
 Since every highly available Memgraph cluster has two types of instances, we need to analyze the failures of each separately.
@@ -652,9 +652,6 @@ and the time needed to realize the instance is down (`--instance-down-timeout-se
 using just a handful of RPC messages (correct time depends on the distance between instances). It is important to mention that the whole failover is performed without the loss of committed data
 if the newly chosen MAIN (previously REPLICA) had all up-to-date data.
 
-Current deployment assumes the existence of only one datacenter, which automatically means that Memgraph won't be available in the case the whole datacenter goes down. We are actively
-working on 2 datacenter (2-DC) architecture.
-
 ## Raft configuration parameters
 
 Several Raft-related parameters are important for the correct functioning of the cluster. The leader coordinator sends a heartbeat
@@ -664,9 +661,14 @@ expiration is set to 2000ms so that cluster can never get into situation where m
 the ability to survive occasional network hiccups without triggering leadership changes.
 
 
+## Data center failure
+
+The architecture we currently use allows us to deploy coordinators in 3 data centers and hence tolerate a failure of the whole data center. Data instances can be freely
+distributed in any way you want between data centers. The failover time will be slighlty increased due to the network communication needed.
+
 ## Kubernetes
 
-We support deploying Memgraph HA instances as part of the Kubernetes cluster. 
+We support deploying Memgraph HA as part of the Kubernetes cluster through Helm charts.
 You can see example configurations [here](/getting-started/install-memgraph/kubernetes#memgraph-high-availability-helm-chart).
 
 ## Docker Compose