-
Notifications
You must be signed in to change notification settings - Fork 447
Description
What happened:
Today we had a production outage of a couple minutes, suspectedly because our stolon cluster did not failover correctly.
I manually fixed the problem by restarting a keeper, but another keeper is still in an unhealthy state from the cluster's perspective, while the keeper itself seems to think that it is fine.
stolonctl status
shows keeper node_6
as HEALTHY = false
, PG HEALTHY = false
, and PG LISTENADDRESS = (unknown)
:
Click to expand full `stolonctl status` output
# stolonctl status
=== Active sentinels ===
ID LEADER
778771af false
7c12d47f false
bfe3a503 true
d5cd2ad2 false
=== Active proxies ===
ID
26803fc5
295e41dc
62ae83e9
651751d3
c431e17d
c939976b
e0d69a43
=== Keepers ===
UID HEALTHY PG LISTENADDRESS PG HEALTHY PG WANTEDGENERATION PG CURRENTGENERATION
node_4 true 10.0.0.4:5432 true 2 2
node_5 true 10.0.0.5:5432 true 5 5
node_6 false (unknown) false 2 0
=== Cluster Info ===
Master Keeper: node_5
===== Keepers/DB tree =====
node_5 (master)
├─node_6
└─node_4
For reading convenience, Keepers
formatted nicely:
=== Keepers ===
UID HEALTHY PG LISTENADDRESS PG HEALTHY PG WANTEDGENERATION PG CURRENTGENERATION
node_4 true 10.0.0.4:5432 true 2 2
node_5 true 10.0.0.5:5432 true 5 5
node_6 false (unknown) false 2 0
Click to expand full `stolonctl spec` output
{
"sleepInterval": "2s",
"requestTimeout": "4s",
"failInterval": "8s",
"synchronousReplication": true,
"minSynchronousStandbys": 1,
"maxSynchronousStandbys": 2,
"usePgrewind": true,
"initMode": "new",
"pgParameters": {
"datestyle": "iso, mdy",
"default_text_search_config": "pg_catalog.english",
"dynamic_shared_memory_type": "posix",
"lc_messages": "en_US.UTF-8",
"lc_monetary": "en_US.UTF-8",
"lc_numeric": "en_US.UTF-8",
"lc_time": "en_US.UTF-8",
"log_min_duration_statement": "10",
"log_timezone": "UTC",
"logging_collector": "on",
"max_connections": "100",
"max_wal_size": "1GB",
"min_wal_size": "80MB",
"random_page_cost": "1.1",
"shared_buffers": "128MB",
"timezone": "UTC",
"wal_level": "replica",
"work_mem": "65536"
}
}
Click to expand full `stolonctl clusterdata read | jq .` output
{
"formatVersion": 1,
"changeTime": "2023-07-26T12:41:50.048697008Z",
"cluster": {
"uid": "735e8856",
"generation": 1,
"changeTime": "2023-07-26T11:41:07.029419632Z",
"spec": {
"sleepInterval": "2s",
"requestTimeout": "4s",
"failInterval": "8s",
"synchronousReplication": true,
"minSynchronousStandbys": 1,
"maxSynchronousStandbys": 2,
"additionalWalSenders": null,
"additionalMasterReplicationSlots": null,
"usePgrewind": true,
"initMode": "new",
"pgParameters": {
"datestyle": "iso, mdy",
"default_text_search_config": "pg_catalog.english",
"dynamic_shared_memory_type": "posix",
"lc_messages": "en_US.UTF-8",
"lc_monetary": "en_US.UTF-8",
"lc_numeric": "en_US.UTF-8",
"lc_time": "en_US.UTF-8",
"log_min_duration_statement": "10",
"log_timezone": "UTC",
"logging_collector": "on",
"max_connections": "100",
"max_wal_size": "1GB",
"min_wal_size": "80MB",
"random_page_cost": "1.1",
"shared_buffers": "128MB",
"timezone": "UTC",
"wal_level": "replica",
"work_mem": "65536"
},
"pgHBA": null,
"automaticPgRestart": null
},
"status": {
"phase": "normal",
"master": "3397fb23"
}
},
"keepers": {
"node_4": {
"uid": "node_4",
"generation": 1,
"changeTime": "2023-07-26T12:41:50.048761009Z",
"spec": {},
"status": {
"healthy": true,
"lastHealthyTime": "2023-07-26T12:41:50.046055293Z",
"bootUUID": "e687f779-361c-4213-a5f4-157f352ed63a",
"postgresBinaryVersion": {
"Maj": 14,
"Min": 1
},
"canBeMaster": true,
"canBeSynchronousReplica": true
}
},
"node_5": {
"uid": "node_5",
"generation": 1,
"changeTime": "2023-07-26T12:41:50.048755779Z",
"spec": {},
"status": {
"healthy": true,
"lastHealthyTime": "2023-07-26T12:41:50.046055443Z",
"bootUUID": "9b604f18-c48a-42f4-a774-cd49795e0e89",
"postgresBinaryVersion": {
"Maj": 14,
"Min": 1
},
"canBeMaster": true,
"canBeSynchronousReplica": true
}
},
"node_6": {
"uid": "node_6",
"generation": 1,
"changeTime": "2023-07-24T14:10:46.921750869Z",
"spec": {},
"status": {
"lastHealthyTime": "2023-07-24T14:10:44.902643198Z",
"bootUUID": "2c4a6ad5-1ed8-4a55-aa25-0c30cb2a4acc",
"postgresBinaryVersion": {
"Maj": 14,
"Min": 1
},
"canBeMaster": true,
"canBeSynchronousReplica": true
}
}
},
"dbs": {
"32e06e90": {
"uid": "32e06e90",
"generation": 2,
"changeTime": "2023-07-26T11:41:11.057249694Z",
"spec": {
"keeperUID": "node_6",
"requestTimeout": "4s",
"maxStandbys": 20,
"usePgrewind": true,
"additionalWalSenders": 5,
"additionalReplicationSlots": null,
"initMode": "resync",
"pgParameters": {
"datestyle": "iso, mdy",
"default_text_search_config": "pg_catalog.english",
"dynamic_shared_memory_type": "posix",
"lc_messages": "en_US.UTF-8",
"lc_monetary": "en_US.UTF-8",
"lc_numeric": "en_US.UTF-8",
"lc_time": "en_US.UTF-8",
"log_min_duration_statement": "10",
"log_timezone": "UTC",
"logging_collector": "on",
"max_connections": "100",
"max_wal_size": "1GB",
"min_wal_size": "80MB",
"random_page_cost": "1.1",
"shared_buffers": "128MB",
"timezone": "UTC",
"wal_level": "replica",
"work_mem": "65536"
},
"pgHBA": null,
"role": "standby",
"followConfig": {
"type": "internal",
"dbuid": "3397fb23"
},
"followers": [],
"synchronousStandbys": null,
"externalSynchronousStandbys": null
},
"status": {
"synchronousStandbys": null
}
},
"3397fb23": {
"uid": "3397fb23",
"generation": 5,
"changeTime": "2023-07-26T12:41:50.048753905Z",
"spec": {
"keeperUID": "node_5",
"requestTimeout": "4s",
"maxStandbys": 20,
"synchronousReplication": true,
"usePgrewind": true,
"additionalWalSenders": 5,
"additionalReplicationSlots": null,
"initMode": "none",
"pgParameters": {
"datestyle": "iso, mdy",
"default_text_search_config": "pg_catalog.english",
"dynamic_shared_memory_type": "posix",
"lc_messages": "en_US.UTF-8",
"lc_monetary": "en_US.UTF-8",
"lc_numeric": "en_US.UTF-8",
"lc_time": "en_US.UTF-8",
"log_min_duration_statement": "10",
"log_timezone": "UTC",
"logging_collector": "on",
"max_connections": "100",
"max_wal_size": "1GB",
"min_wal_size": "80MB",
"random_page_cost": "1.1",
"shared_buffers": "128MB",
"timezone": "UTC",
"wal_level": "replica",
"work_mem": "65536"
},
"pgHBA": null,
"role": "master",
"followers": [
"32e06e90",
"c6cb2f35"
],
"synchronousStandbys": [
"c6cb2f35"
],
"externalSynchronousStandbys": []
},
"status": {
"healthy": true,
"currentGeneration": 5,
"listenAddress": "10.0.0.5",
"port": "5432",
"systemdID": "7191260299357525396",
"timelineID": 31,
"xLogPos": 411870652720,
"timelinesHistory": [
{
"timelineID": 29,
"switchPoint": 365271653264,
"reason": "no recovery target specified"
},
{
"timelineID": 30,
"switchPoint": 409000508648,
"reason": "no recovery target specified"
}
],
"pgParameters": {
"datestyle": "iso, mdy",
"default_text_search_config": "pg_catalog.english",
"dynamic_shared_memory_type": "posix",
"lc_messages": "en_US.UTF-8",
"lc_monetary": "en_US.UTF-8",
"lc_numeric": "en_US.UTF-8",
"lc_time": "en_US.UTF-8",
"log_min_duration_statement": "10",
"log_timezone": "UTC",
"logging_collector": "on",
"max_connections": "100",
"max_wal_size": "1GB",
"min_wal_size": "80MB",
"random_page_cost": "1.1",
"shared_buffers": "128MB",
"timezone": "UTC",
"wal_level": "replica",
"work_mem": "65536"
},
"synchronousStandbys": [
"c6cb2f35"
],
"olderWalFile": "0000001F0000005F000000D1"
}
},
"c6cb2f35": {
"uid": "c6cb2f35",
"generation": 2,
"changeTime": "2023-07-26T12:41:50.048725131Z",
"spec": {
"keeperUID": "node_4",
"requestTimeout": "4s",
"maxStandbys": 20,
"usePgrewind": true,
"additionalWalSenders": 5,
"additionalReplicationSlots": null,
"initMode": "none",
"pgParameters": {
"datestyle": "iso, mdy",
"default_text_search_config": "pg_catalog.english",
"dynamic_shared_memory_type": "posix",
"lc_messages": "en_US.UTF-8",
"lc_monetary": "en_US.UTF-8",
"lc_numeric": "en_US.UTF-8",
"lc_time": "en_US.UTF-8",
"log_min_duration_statement": "10",
"log_timezone": "UTC",
"logging_collector": "on",
"max_connections": "100",
"max_wal_size": "1GB",
"min_wal_size": "80MB",
"random_page_cost": "1.1",
"shared_buffers": "128MB",
"timezone": "UTC",
"wal_level": "replica",
"work_mem": "65536"
},
"pgHBA": null,
"role": "standby",
"followConfig": {
"type": "internal",
"dbuid": "3397fb23"
},
"followers": [],
"synchronousStandbys": null,
"externalSynchronousStandbys": null
},
"status": {
"healthy": true,
"currentGeneration": 2,
"listenAddress": "10.0.0.4",
"port": "5432",
"systemdID": "7191260299357525396",
"timelineID": 31,
"xLogPos": 411870652048,
"timelinesHistory": [
{
"timelineID": 29,
"switchPoint": 365271653264,
"reason": "no recovery target specified"
},
{
"timelineID": 30,
"switchPoint": 409000508648,
"reason": "no recovery target specified"
}
],
"pgParameters": {
"datestyle": "iso, mdy",
"default_text_search_config": "pg_catalog.english",
"dynamic_shared_memory_type": "posix",
"lc_messages": "en_US.UTF-8",
"lc_monetary": "en_US.UTF-8",
"lc_numeric": "en_US.UTF-8",
"lc_time": "en_US.UTF-8",
"log_min_duration_statement": "10",
"log_timezone": "UTC",
"logging_collector": "on",
"max_connections": "100",
"max_wal_size": "1GB",
"min_wal_size": "80MB",
"random_page_cost": "1.1",
"shared_buffers": "128MB",
"timezone": "UTC",
"wal_level": "replica",
"work_mem": "65536"
},
"synchronousStandbys": null,
"olderWalFile": "0000001F0000005F000000C5"
}
}
},
"proxy": {
"generation": 430,
"changeTime": "2023-07-26T11:41:13.06941116Z",
"spec": {
"masterDbUid": "3397fb23",
"enabledProxies": [
"26803fc5",
"295e41dc",
"62ae83e9",
"651751d3",
"c431e17d",
"c939976b",
"e0d69a43"
]
},
"status": {}
}
}
Relevant here is
"node_6": {
"status": {
"lastHealthyTime": "2023-07-24T14:10:44.902643198Z",
The stolon-keeper
seems to be running fine according to systemctl status stolon-keeper.service
:
● stolon-keeper.service
Loaded: loaded (/etc/systemd/system/stolon-keeper.service; enabled; vendor preset: enabled)
Active: active (running) since Sat 2023-04-29 11:39:35 UTC; 2 months 27 days ago
Main PID: 2105 (keeper)
IP: 859.8G in, 11.9T out
IO: 6.8T read, 2.7T written
Tasks: 32 (limit: 154456)
Memory: 1.5G
CPU: 6d 23h 5min 51.631s
CGroup: /system.slice/stolon-keeper.service
├─ 2105 /nix/store/61syngpz7igk7hh84fcpxn1ywpdgbprb-stolon/bin/keeper --cluster-name stolon-cluster --store-backend=consul --uid node_6 --data-dir /var/lib/postgres/node_6 --pg-listen-address=10.0.0.6 --pg-bin-path=/nix/store/swc4d2nvp800dibijq0>
├─3516988 /nix/store/swc4d2nvp800dibijq09ksbw26nwg6yp-postgresql-14.1/bin/postgres -D /var/lib/postgres/node_6/postgres -c unix_socket_directories=/tmp
├─3516989 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ">
├─3516990 "postgres: startup recovering 0000001F0000005F000000AF waiting" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "">
├─3516991 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ">
├─3516992 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "">
├─3516993 "postgres: stats collector " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ">
├─3517052 "postgres: walreceiver streaming 5F/AFACADF8" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "">
└─3542888 "postgres: postgres postgres 10.0.0.6(43110) COPY" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" >
Jul 26 12:20:19 node-6 keeper[2105]: 2023-07-26T12:20:19.821Z INFO cmd/keeper.go:1676 postgres parameters not changed
Jul 26 12:20:19 node-6 keeper[2105]: 2023-07-26T12:20:19.821Z INFO cmd/keeper.go:1703 postgres hba entries not changed
Jul 26 12:20:21 node-6 keeper[2105]: 2023-07-26T12:20:21.851Z INFO cmd/keeper.go:1557 our db requested role is standby {"followedDB": "3397fb23"}
Jul 26 12:20:21 node-6 keeper[2105]: 2023-07-26T12:20:21.851Z INFO cmd/keeper.go:1576 already standby
Jul 26 12:20:21 node-6 keeper[2105]: 2023-07-26T12:20:21.865Z INFO cmd/keeper.go:1676 postgres parameters not changed
Jul 26 12:20:21 node-6 keeper[2105]: 2023-07-26T12:20:21.866Z INFO cmd/keeper.go:1703 postgres hba entries not changed
Jul 26 12:20:23 node-6 keeper[2105]: 2023-07-26T12:20:23.895Z INFO cmd/keeper.go:1557 our db requested role is standby {"followedDB": "3397fb23"}
Jul 26 12:20:23 node-6 keeper[2105]: 2023-07-26T12:20:23.895Z INFO cmd/keeper.go:1576 already standby
Jul 26 12:20:23 node-6 keeper[2105]: 2023-07-26T12:20:23.913Z INFO cmd/keeper.go:1676 postgres parameters not changed
Jul 26 12:20:23 node-6 keeper[2105]: 2023-07-26T12:20:23.913Z INFO cmd/keeper.go:1703 postgres hba entries not changed
Its logs show some earlier errors while pg_rewind
was copying data over, but no indication that anything failed permanently:
- Full logs: node-6-stolon-keeper-logs-journalctl.txt
Relevant sections:
Jul 24 14:10:35 node-6 keeper[2105]: 2023-07-24T14:10:35.171Z INFO cmd/keeper.go:1543 already master
Jul 24 14:10:35 node-6 keeper[2105]: 2023-07-24T14:10:35.176Z INFO cmd/keeper.go:997 dropping replication slot {"slot": "stolon_93596109"}
Jul 24 14:10:35 node-6 keeper[2105]: 2023-07-24T14:10:35.182Z INFO cmd/keeper.go:997 dropping replication slot {"slot": "stolon_97eba8d4"}
Jul 24 14:10:35 node-6 keeper[2105]: 2023-07-24T14:10:35.195Z INFO cmd/keeper.go:1676 postgres parameters not changed
Jul 24 14:10:35 node-6 keeper[2105]: 2023-07-24T14:10:35.195Z INFO cmd/keeper.go:1703 postgres hba entries not changed
Jul 24 14:10:37 node-6 keeper[2105]: 2023-07-24T14:10:37.212Z INFO cmd/keeper.go:1141 current db UID different than cluster data db UID {"db": "cbcb41e1", "cdDB": "32e06e90"}
Jul 24 14:10:37 node-6 keeper[2105]: 2023-07-24T14:10:37.212Z INFO cmd/keeper.go:1296 resyncing the database cluster
Jul 24 14:10:37 node-6 keeper[2105]: 2023-07-24T14:10:37.213Z INFO postgresql/postgresql.go:384 stopping database
Jul 24 14:10:37 node-6 keeper[1690207]: waiting for server to shut down....
Jul 24 14:10:37 node-6 keeper[2105]: 2023-07-24T14:10:37.276Z ERROR cmd/keeper.go:720 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
Jul 24 14:10:37 node-6 keeper[1690207]: done
Jul 24 14:10:37 node-6 keeper[1690207]: server stopped
Jul 24 14:10:37 node-6 keeper[2105]: 2023-07-24T14:10:37.324Z INFO cmd/keeper.go:899 syncing using pg_rewind {"masterDB": "97eba8d4", "keeper": "node_4"}
Jul 24 14:10:37 node-6 keeper[2105]: 2023-07-24T14:10:37.325Z INFO postgresql/postgresql.go:921 running pg_rewind
Jul 24 14:10:37 node-6 keeper[1690211]: pg_rewind: fetched file "global/pg_control", length 8192
Jul 24 14:10:37 node-6 keeper[1690211]: pg_rewind: fetched file "pg_wal/0000001E.history", length 1289
Jul 24 14:10:37 node-6 keeper[1690211]: pg_rewind: Source timeline history:
Jul 24 14:10:37 node-6 keeper[1690211]: pg_rewind: Target timeline history:
Jul 24 14:10:37 node-6 keeper[1690211]: pg_rewind: 1: 0/0 - 5/B8C7E6D0
...
Jul 24 14:11:18 node-6 keeper[1690211]: pg_rewind: fetched file "global/pg_control", length 8192
Jul 24 14:11:18 node-6 keeper[2105]: 2023-07-24T14:11:18.308Z ERROR cmd/keeper.go:720 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
Jul 24 14:11:19 node-6 keeper[2105]: 2023-07-24T14:11:19.309Z ERROR cmd/keeper.go:720 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
Jul 24 14:11:20 node-6 keeper[2105]: 2023-07-24T14:11:20.310Z ERROR cmd/keeper.go:720 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
Jul 24 14:11:20 node-6 keeper[1690211]: pg_rewind: Done!
Jul 24 14:11:20 node-6 keeper[2105]: 2023-07-24T14:11:20.342Z INFO postgresql/postgresql.go:844 writing standby signal file
Jul 24 14:11:20 node-6 keeper[2105]: 2023-07-24T14:11:20.342Z INFO postgresql/postgresql.go:319 starting database
Jul 24 14:11:20 node-6 keeper[1690299]: 2023-07-24 14:11:20.352 UTC [1690299] LOG: redirecting log output to logging collector process
Jul 24 14:11:20 node-6 keeper[1690299]: 2023-07-24 14:11:20.352 UTC [1690299] HINT: Future log output will appear in directory "log".
Jul 24 14:11:20 node-6 keeper[2105]: 2023-07-24T14:11:20.600Z INFO cmd/keeper.go:1557 our db requested role is standby {"followedDB": "97eba8d4"}
Jul 24 14:11:20 node-6 keeper[2105]: 2023-07-24T14:11:20.600Z INFO cmd/keeper.go:1576 already standby
What you expected to happen:
- Either
stolonctl status
finds that the keeper is up and working, or - The keeper logs point out that there's a problem, and what it is. If it
How to reproduce it (as minimally and precisely as possible):
Unclear.
Stolon had been running uninterrupted for 2 months until this happened.
Environment:
- Stolon version: Stolon
master
commit 4bb4107