Skip to content

Conversation

@samliok
Copy link
Collaborator

@samliok samliok commented Mar 4, 2025

the core problem was runCrashAndRestartExecution was being called too early. The test flakes occur when the crashed node clones the WAL before the original epoch finishes writing to it.

Here's an example of when our node would stall. The println "wal cloned" is called right before we clone the wal. Reboot started is printed before e.Start() on the crashed node is called. Even after Reboot started is printed, the original epoch is still performing actions.

wal cloned
=== RUN   TestEpochLeaderFailoverWithEmptyNotarization/TestEpochLeaderFailoverWithEmptyNotarization-with-crash#19
[03-04|15:38:48.817] DEBUG Simplex/epoch.go:1657 Wrote block to WAL {"test": "TestEpochLeaderFailoverWithEmptyNotarization", "node": 1, "round": 4, "size": 125, "digest": "dbd816d0a0876d8ee5e6..."}
reboot started
[03-04|15:38:48.817] DEBUG Simplex/epoch.go:2001 No future messages received for this round {"test": "TestEpochLeaderFailoverWithEmptyNotarization", "node": 1, "from": "01", "round": 4}
[03-04|15:38:48.817] DEBUG Simplex/epoch.go:2001 No future messages received for this round {"test": "TestEpochLeaderFailoverWithEmptyNotarization", "node": 1, "from": "02", "round": 4}
[03-04|15:38:48.817] DEBUG Simplex/epoch.go:2001 No future messages received for this round {"test": "TestEpochLeaderFailoverWithEmptyNotarization", "node": 1, "from": "03", "round": 4}
[03-04|15:38:48.817] INFO Simplex/epoch.go:208 Block Proposal Recovered From WAL {"test": "TestEpochLeaderFailoverWithEmptyNotarization/TestEpochLeaderFailoverWithEmptyNotarization-with-crash#19", "node": 0, "Round": 0}
[03-04|15:38:48.817] DEBUG Simplex/epoch.go:2001 No future messages received for this round {"test": "TestEpochLeaderFailoverWithEmptyNotarization", "node": 1, "from": "04", "round": 4}
[03-04|15:38:48.817] DEBUG Simplex/epoch.go:1679 Proposal broadcast {"test": "TestEpochLeaderFailoverWithEmptyNotarization", "node": 1, "round": 4, "size": 125, "digest": "dbd816d0a0876d8ee5e6..."}
[03-04|15:38:48.817] DEBUG Simplex/sched.go:95 No ready tasks, going to sleep {"test": "TestEpochLeaderFailoverWithEmptyNotarization/TestEpochLeaderFailoverWithEmptyNotarization-with-crash#19", "node": 0}
[03-04|15:38:48.817] INFO Simplex/epoch.go:222 Notarization Recovered From WAL {"test": "TestEpochLeaderFailoverWithEmptyNotarization/TestEpochLeaderFailoverWithEmptyNotarization-with-crash#19", "node": 0, "Round": 0}
[03-04|15:38:48.817] INFO Simplex/epoch.go:208 Block Proposal Recovered From WAL {"test": "TestEpochLeaderFailoverWithEmptyNotarization/TestEpochLeaderFailoverWithEmptyNotarization-with-crash#19", "node": 0, "Round": 1}
[03-04|15:38:48.817] INFO Simplex/epoch.go:222 Notarization Recovered From WAL {"test": "TestEpochLeaderFailoverWithEmptyNotarization/TestEpochLeaderFailoverWithEmptyNotarization-with-crash#19", "node": 0, "Round": 1}
[03-04|15:38:48.817] DEBUG Simplex/epoch.go:652 Received vote message {"test": "TestEpochLeaderFailoverWithEmptyNotarization", "node": 1, "from": "01", "round": 4, "digest": "dbd816d0a0876d8ee5e6..."}
[03-04|15:38:48.817] INFO Simplex/epoch.go:208 Block Proposal Recovered From WAL {"test": "TestEpochLeaderFailoverWithEmptyNotarization/TestEpochLeaderFailoverWithEmptyNotarization-with-crash#19", "node": 0, "Round": 2}
[03-04|15:38:48.817] INFO Simplex/epoch.go:222 Notarization Recovered From WAL {"test": "TestEpochLeaderFailoverWithEmptyNotarization/TestEpochLeaderFailoverWithEmptyNotarization-with-crash#19", "node": 0, "Round": 2}
[03-04|15:38:48.817] DEBUG Simplex/epoch.go:1011 Counting votes {"test": "TestEpochLeaderFailoverWithEmptyNotarization", "node": 1, "round": 4, "votes": 1, "from": "[01]"}
[03-04|15:38:48.817] DEBUG Simplex/epoch.go:2001 No future messages received for this round {"test": "TestEpochLeaderFailoverWithEmptyNotarization", "node": 1, "from": "01", "round": 4}
[03-04|15:38:48.817] DEBUG Simplex/epoch.go:2001 No future messages received for this round {"test": "TestEpochLeaderFailoverWithEmptyNotarization", "node": 1, "from": "02", "round": 4}
[03-04|15:38:48.817] DEBUG Simplex/epoch.go:2001 No future messages received for this round {"test": "TestEpochLeaderFailoverWithEmptyNotarization", "node": 1, "from": "03", "round": 4}
[03-04|15:38:48.817] DEBUG Simplex/epoch.go:1624 Scheduling block building {"test": "TestEpochLeaderFailoverWithEmptyNotarization/TestEpochLeaderFailoverWithEmptyNotarization-with-crash#19", "node": 0, "round": 4}
[03-04|15:38:48.817] DEBUG Simplex/sched.go:137 Scheduling new ready task {"test": "TestEpochLeaderFailoverWithEmptyNotarization/TestEpochLeaderFailoverWithEmptyNotarization-with-crash#19", "node": 0, "dependency": "6a1e9e62e42add29c8e2..."}
reboot complete

// notarize and finalize block for round 1
// we expect the future empty notarization for round 2 to increment the round
func TestEpochLeaderFailoverWithEmptyNotarization(t *testing.T) {
for range 100 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to the log, the test fails all the time:

2025-03-05T00:37:57.0108053Z         epoch_test.go:314: 
2025-03-05T00:37:57.0108548Z             	Error Trace:	/home/runner/work/Simplex/Simplex/epoch_test.go:314
2025-03-05T00:37:57.0109241Z             	            				/home/runner/work/Simplex/Simplex/epoch_failover_test.go:125
2025-03-05T00:37:57.0109920Z             	            				/home/runner/work/Simplex/Simplex/epoch_failover_test.go:783
2025-03-05T00:37:57.0110104Z             	Error:      	Not equal: 
2025-03-05T00:37:57.0110332Z             	            	expected: 0x3
2025-03-05T00:37:57.0110593Z             	            	actual  : 0x2
2025-03-05T00:37:57.0111325Z             	Test:       	TestEpochLeaderFailoverWithEmptyNotarization

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh oops, i had that fix but it wasn't pushed. also didn't mean to push the for loop

md.Seq = seq
_, ok := bb.BuildBlock(context.Background(), md)
require.True(t, ok)
require.Equal(t, md.Round, md.Seq)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since empty notarizations will cause these to go out of sync

@yacovm yacovm merged commit ff612ce into main Mar 5, 2025
5 checks passed
@samliok samliok deleted the flake-fix branch March 20, 2025 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants