Skip to content

Bugfix: force reset causes data loss and deletion of durability files in HA environment #1174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 6, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 30 additions & 15 deletions pages/clustering/high-availability.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -470,28 +470,43 @@ listening to MAIN with the given UUID.

#### Force sync of data

On failover, the current logic is to choose the most-up-to-date instance from all available instances to promote to the new MAIN. For promotion to the MAIN
to successfully happen, the new MAIN figures out if REPLICA is behind (has less up-to-date data) or has data that the new MAIN doesn't have.
If REPLICA has data that MAIN doesn't have, in that case REPLICA is in a diverged-from-MAIN state. If at least one REPLICA is in diverged-from-MAIN
state, failover won't succeed as MAIN can't replicate data to diverged-from-MAIN REPLICA.
When choosing a new MAIN in the failover procedure from the list of available REPLICAs, the instance with the latest commit timestamp for the default database is chosen as the new MAIN.
In case some other instance had more up-to-date data when the failover procedure was choosing a new MAIN but was down at that point when rejoining the cluster,
the new MAIN instance sends a force sync RPC request to such instance. Force sync RPC request deletes all current data on all databases on a given instance and accepts data from the
current MAIN. This way cluster will always follow the current MAIN.
During a failover event, Memgraph selects the most up-to-date, alive instance to
become the new MAIN. The selection process works as follows:
1. From the list of available REPLICA instances, Memgraph chooses the one with
the latest commit timestamp for the default database.
2. If an instance that had more recent data was down during this selection
process, it will not be considered for promotion to MAIN.

If a previously down instance had more up-to-date data but was unavailable
during failover, it will go through a specific recovery process upon rejoining
the cluster:
- The new MAIN will clear the returning replica’s storage.
- The returning replica will then receive all commits from the new MAIN to
synchronize its state.
- The replica's old durability files will be preserved in a `.old` directory in
`data_directory/snapshots` and `data_directory/wal` folders, allowing admins
to manually recover data if needed.

Memgraph prioritizes availability over strict consistency (leaning towards AP in
the CAP theorem). While it aims to maintain consistency as much as possible, the
current failover logic can result in a non-zero Recovery Point Objective (RPO),
that is, data loss, because:
- The promoted MAIN might not have received all commits from the previous MAIN
before the failure.
- This design ensures that the MAIN remains writable for the maximum possible
time.

If your environment requires strong consistency and can tolerate write
unavailability, [reach out to
us](https://github.com/memgraph/memgraph/discussions). We are actively exploring
support for a fully synchronous mode.


## Actions on follower coordinators

From follower coordinators you can only execute `SHOW INSTANCES`. Registration of data instance, unregistration of data instances, demoting instance, setting instance to MAIN and
force reseting cluster state are all disabled.

<Callout type="warning">

Under certain extreme scenarios, the current implementation of HA could lead to having Recovery Point Objective (RPO) != 0 (aka data loss). These are environments with high volume of transactions
where data is constantly changed, added, deleted... If you are operating in such scenarios, please open an issue on [GitHub](https://github.com/memgraph/memgraph/issues)
as we are eager to expand our support for this kind of workload.

</Callout>

## Instances' restart

Expand Down