Merge pull request #302 from iotexproject/guo-patch-1

CoderZhi · web-flow · commit 4ff0f39e35de · 2025-06-20T05:34:32.000+03:00
Create 2025-06-19.md
diff --git a/postmortem/2025-06-19.md b/postmortem/2025-06-19.md
@@ -0,0 +1,66 @@
+# 2025-06-19 IoTeX Mainnet Hard Fork Incident Postmortem
+
+## Summary
+
+During the mainnet (wake) hard fork upgrade on June 19, 2025, some delegates ended up with inconsistent staking states, preventing the IoTeX mainnet from reaching consensus. This caused block production to stop for about 3.5 hours. The network was restored by restarting a number of delegate nodes, which forced them to recalculate the staking state and reestablish consensus.
+
+## Background
+
+This hard fork upgrade was designed to reduce the block interval to 2.5 seconds and introduce system staking version 3 ([details](https://github.com/iotexproject/iotex-core/releases/tag/v2.2.0)). The upgrade was scheduled for block height `36,893,881` (Jun-19-2025 01:49:27 AM UTC). However, unexpected issues cropped up during the process, and the chain stopped producing new blocks at height `36,894,600` (Jun-19-2025 02:21:22 AM UTC).
+
+## Root Cause Analysis
+
+### Primary Cause
+
+- While producing block `36,894,601`, nodes had inconsistent in-memory staking states, which led to validation failures for the `PutPollResult` system action in the proposed blocks. As a result, nodes couldn’t get enough endorsements, and consensus broke down.
+- The inconsistent staking states happened because the v2.2.0 code didn’t properly initialize the in-memory state of the v3 staking contract—it was missing pre-upgrade v3 staking data. Since nodes upgraded at different times, this led to mismatched in-memory staking states across the network.
+
+### Technical Details
+
+The problem appeared as a “delegates are not as expected” error during block proposal checks. Logs showed that the delegate arrays in memory didn’t match what was expected, making it clear that staking states weren’t consistent across nodes.
+
+#### Error Log Sample
+
+```
+{"level":"error","ts":"2025-06-19T02:21:30.123Z","caller":"factory/workingset.go:936","msg":"Failed to update state.","height":36894601,"error":"delegates are not as expected"}
+```
+
+## Impact Assessment
+
+The mainnet stopped producing blocks for about 3.5 hours, so no user transactions could be processed or confirmed during that time.
+
+## Solution
+
+The issue was fixed by restarting the node services, which forced them to reload the correct staking data from the local database. This got all nodes back in sync and restored consensus.
+
+## Timeline
+
+| Time (UTC)             | Event                    | Details                                                                                       |
+|------------------------|--------------------------|-----------------------------------------------------------------------------------------------|
+| Jun-09-2025 09:06:15 AM| Deploy v3 staking contract| System staking v3 contract deployed to mainnet                                                |
+| Jun-10-2025 02:28:00 AM| Hard fork release        | v2.2.0 release announced, notified delegates to upgrade                                       |
+| Jun-17-2025 08:05:50 AM| First v3 stake transaction| Stake tx: 824609a46742a07597dc1565c303a056e602292e84378c9a89ab4c271807766e                   |
+| Jun-17-2025 08:24:15 AM| Second v3 stake transaction| Stake tx: d5f88dee74004b3d3f032f6f6104596f7a0d724019c3bffe0c88e5181fc8988f                   |
+| Jun-19-2025 00:45:45 AM| Third v3 stake transaction| Stake tx: 53a9501a7aa470187ff6e617c569d255b262a7bced44143e9a7ee44ff31c8b4b                   |
+| Jun-19-2025 01:49:27 AM| Hard fork activation     | Hard fork activated at block height 36,893,881                                                |
+| Jun-19-2025 02:21:22 AM| Block production stopped | Block height 36,894,600                                                                       |
+| Jun-19-2025 02:31:41 AM| Issue confirmed          | An alert went off and the dev team jumped on it right away                                    |
+| Jun-19-2025 05:55:00 AM| Solution implemented     | The root cause was found and the node services were restarted to fix it                       |
+| Jun-19-2025 05:58:40 AM| Network recovery         | Block production resumed successfully                                                         |
+| Jun-19-2025 07:00:00 AM| System stability confirmed| Network stability and functionality verified                                                  |
+
+## What Went Well
+
+- **Debugging Infrastructure:** Having a dedicated server with binary deployment proved valuable for efficient debugging. This setup allowed faster troubleshooting compared to waiting for container image builds (which took ~10 minutes).
+- **Data Migration Skills:** The team successfully demonstrated the ability to copy and mount node data from Kubernetes deployments to virtual machines for testing purposes, significantly improving debugging efficiency.
+- **Monitoring System:** The alerting system effectively detected block production stopped, enabling prompt response to the incident.
+- **Node Auto Restart:** Node auto restart when it clearly detects there is an inconsistency between its states and the consensus.
+
+## What Went Wrong
+
+- New staking infra should be activated after the hard fork to make the hard fork upgrade smooth.
+- Some delegates node upgrade too late to capture the new staking correctly.
+
+## Follow-up Action Plan
+
+- [ ] Improve the v3 staking indexer where starting from a lagging block height results in an incorrect in-memory staking state.