Skip to content

Commit 6092e01

Browse files
as51340antejavormatea16
authored
Improve failure mode, add multiple DCs (#1273)
* Improve failure mode, add multiple DCs * Remove crash-stop, add omission faults * Document slower failover time --------- Co-authored-by: Ante Javor <ante.javor@memgraph.io> Co-authored-by: Matea Pesic <80577904+matea16@users.noreply.github.com>
1 parent 0f33d57 commit 6092e01

File tree

1 file changed

+8
-6
lines changed

1 file changed

+8
-6
lines changed

pages/clustering/high-availability.mdx

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -635,8 +635,8 @@ the wrong state of other clusters, it can become a leader without being connecte
635635

636636
## Recovering from errors
637637

638-
Distributed systems can fail in numerous ways. With the current implementation, Memgraph instances are resilient to occasional network
639-
failures and independent machine failures. Byzantine failures aren't handled since the Raft consensus protocol cannot deal with them either.
638+
Distributed systems can fail in numerous ways. Memgraph processes are resilient to network
639+
failures, omission faults and independent machine failures. Byzantine failures aren't tolerated since the Raft consensus protocol cannot deal with them either.
640640

641641
Recovery Time Objective (RTO) is an often used term for measuring the maximum tolerable length of time that an instance or cluster can be down.
642642
Since every highly available Memgraph cluster has two types of instances, we need to analyze the failures of each separately.
@@ -652,9 +652,6 @@ and the time needed to realize the instance is down (`--instance-down-timeout-se
652652
using just a handful of RPC messages (correct time depends on the distance between instances). It is important to mention that the whole failover is performed without the loss of committed data
653653
if the newly chosen MAIN (previously REPLICA) had all up-to-date data.
654654

655-
Current deployment assumes the existence of only one datacenter, which automatically means that Memgraph won't be available in the case the whole datacenter goes down. We are actively
656-
working on 2 datacenter (2-DC) architecture.
657-
658655
## Raft configuration parameters
659656

660657
Several Raft-related parameters are important for the correct functioning of the cluster. The leader coordinator sends a heartbeat
@@ -664,9 +661,14 @@ expiration is set to 2000ms so that cluster can never get into situation where m
664661
the ability to survive occasional network hiccups without triggering leadership changes.
665662

666663

664+
## Data center failure
665+
666+
The architecture we currently use allows us to deploy coordinators in 3 data centers and hence tolerate a failure of the whole data center. Data instances can be freely
667+
distributed in any way you want between data centers. The failover time will be slighlty increased due to the network communication needed.
668+
667669
## Kubernetes
668670

669-
We support deploying Memgraph HA instances as part of the Kubernetes cluster.
671+
We support deploying Memgraph HA as part of the Kubernetes cluster through Helm charts.
670672
You can see example configurations [here](/getting-started/install-memgraph/kubernetes#memgraph-high-availability-helm-chart).
671673

672674
## Docker Compose

0 commit comments

Comments
 (0)