Skip to content

Commit 4ff0f39

Browse files
authored
Merge pull request #302 from iotexproject/guo-patch-1
Create 2025-06-19.md
2 parents f29d347 + 2b35e03 commit 4ff0f39

File tree

1 file changed

+66
-0
lines changed

1 file changed

+66
-0
lines changed

postmortem/2025-06-19.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# 2025-06-19 IoTeX Mainnet Hard Fork Incident Postmortem
2+
3+
## Summary
4+
5+
During the mainnet (wake) hard fork upgrade on June 19, 2025, some delegates ended up with inconsistent staking states, preventing the IoTeX mainnet from reaching consensus. This caused block production to stop for about 3.5 hours. The network was restored by restarting a number of delegate nodes, which forced them to recalculate the staking state and reestablish consensus.
6+
7+
## Background
8+
9+
This hard fork upgrade was designed to reduce the block interval to 2.5 seconds and introduce system staking version 3 ([details](https://github.com/iotexproject/iotex-core/releases/tag/v2.2.0)). The upgrade was scheduled for block height `36,893,881` (Jun-19-2025 01:49:27 AM UTC). However, unexpected issues cropped up during the process, and the chain stopped producing new blocks at height `36,894,600` (Jun-19-2025 02:21:22 AM UTC).
10+
11+
## Root Cause Analysis
12+
13+
### Primary Cause
14+
15+
- While producing block `36,894,601`, nodes had inconsistent in-memory staking states, which led to validation failures for the `PutPollResult` system action in the proposed blocks. As a result, nodes couldn’t get enough endorsements, and consensus broke down.
16+
- The inconsistent staking states happened because the v2.2.0 code didn’t properly initialize the in-memory state of the v3 staking contract—it was missing pre-upgrade v3 staking data. Since nodes upgraded at different times, this led to mismatched in-memory staking states across the network.
17+
18+
### Technical Details
19+
20+
The problem appeared as a “delegates are not as expected” error during block proposal checks. Logs showed that the delegate arrays in memory didn’t match what was expected, making it clear that staking states weren’t consistent across nodes.
21+
22+
#### Error Log Sample
23+
24+
```
25+
{"level":"error","ts":"2025-06-19T02:21:30.123Z","caller":"factory/workingset.go:936","msg":"Failed to update state.","height":36894601,"error":"delegates are not as expected"}
26+
```
27+
28+
## Impact Assessment
29+
30+
The mainnet stopped producing blocks for about 3.5 hours, so no user transactions could be processed or confirmed during that time.
31+
32+
## Solution
33+
34+
The issue was fixed by restarting the node services, which forced them to reload the correct staking data from the local database. This got all nodes back in sync and restored consensus.
35+
36+
## Timeline
37+
38+
| Time (UTC) | Event | Details |
39+
|------------------------|--------------------------|-----------------------------------------------------------------------------------------------|
40+
| Jun-09-2025 09:06:15 AM| Deploy v3 staking contract| System staking v3 contract deployed to mainnet |
41+
| Jun-10-2025 02:28:00 AM| Hard fork release | v2.2.0 release announced, notified delegates to upgrade |
42+
| Jun-17-2025 08:05:50 AM| First v3 stake transaction| Stake tx: 824609a46742a07597dc1565c303a056e602292e84378c9a89ab4c271807766e |
43+
| Jun-17-2025 08:24:15 AM| Second v3 stake transaction| Stake tx: d5f88dee74004b3d3f032f6f6104596f7a0d724019c3bffe0c88e5181fc8988f |
44+
| Jun-19-2025 00:45:45 AM| Third v3 stake transaction| Stake tx: 53a9501a7aa470187ff6e617c569d255b262a7bced44143e9a7ee44ff31c8b4b |
45+
| Jun-19-2025 01:49:27 AM| Hard fork activation | Hard fork activated at block height 36,893,881 |
46+
| Jun-19-2025 02:21:22 AM| Block production stopped | Block height 36,894,600 |
47+
| Jun-19-2025 02:31:41 AM| Issue confirmed | An alert went off and the dev team jumped on it right away |
48+
| Jun-19-2025 05:55:00 AM| Solution implemented | The root cause was found and the node services were restarted to fix it |
49+
| Jun-19-2025 05:58:40 AM| Network recovery | Block production resumed successfully |
50+
| Jun-19-2025 07:00:00 AM| System stability confirmed| Network stability and functionality verified |
51+
52+
## What Went Well
53+
54+
- **Debugging Infrastructure:** Having a dedicated server with binary deployment proved valuable for efficient debugging. This setup allowed faster troubleshooting compared to waiting for container image builds (which took ~10 minutes).
55+
- **Data Migration Skills:** The team successfully demonstrated the ability to copy and mount node data from Kubernetes deployments to virtual machines for testing purposes, significantly improving debugging efficiency.
56+
- **Monitoring System:** The alerting system effectively detected block production stopped, enabling prompt response to the incident.
57+
- **Node Auto Restart:** Node auto restart when it clearly detects there is an inconsistency between its states and the consensus.
58+
59+
## What Went Wrong
60+
61+
- New staking infra should be activated after the hard fork to make the hard fork upgrade smooth.
62+
- Some delegates node upgrade too late to capture the new staking correctly.
63+
64+
## Follow-up Action Plan
65+
66+
- [ ] Improve the v3 staking indexer where starting from a lagging block height results in an incorrect in-memory staking state.

0 commit comments

Comments
 (0)