-
Notifications
You must be signed in to change notification settings - Fork 5.8k
Description
Name and Version
bitnami/postgresql-repmgr 16
What steps will reproduce the bug?
-
Deploy a 3-node PostgreSQL cluster using the Docker Compose config below
-
Wait for the cluster to be fully initialized and synchronized
-
Stop one standby node:
docker exec -ti <standby_container> pg_ctl stop -D /bitnami/postgresql/data -m fast
-
Make some changes on the primary node to create WAL divergence
-
Attempt to run pg_rewind manually:
docker exec -ti <standby_container> pg_rewind \ --target-pgdata=/bitnami/postgresql/data \ --source-server="host=postgres-0 port=5432 user=repmgr dbname=repmgr" \ --progress
Are you using any custom parameters or values?
Docker Compose Configuration:
version: '3.8'
x-version-common: &service-common
image: bitnami/postgresql-repmgr:${POSTGRES_RELEASE:-latest}
volumes:
- ${POSTGRES_VOLUME_PATH:?err}/postgres:${POSTGRESQL_DATA_DIR:-/bitnami/postgresql}:Z
x-common-env: &common-env
BITNAMI_DEBUG: "true"
POSTGRESQL_FSYNC: "on"
POSTGRESQL_PASSWORD: ${POSTGRES_PASSWORD:-odoo}
POSTGRESQL_POSTGRES_PASSWORD: ${ADMIN_POSTGRES_PASSWORD:-postgres}
POSTGRESQL_USERNAME: ${POSTGRES_USERNAME:-odoo}
POSTGRESQL_WAL_LEVEL: replica
POSTGRESQL_SYNCHRONOUS_COMMIT_MODE: "on"
POSTGRESQL_NUM_SYNCHRONOUS_REPLICAS: 1
POSTGRESQL_SYNCHRONOUS_REPLICAS_MODE: "FIRST"
POSTGRESQL_CLUSTER_APP_NAME: "*"
POSTGRESQL_CONF_DIR: /bitnami/postgresql/data
REPMGR_DEGRADED_MONITORING_TIMEOUT: 300
REPMGR_FAILOVER: automatic
REPMGR_MASTER_RESPONSE_TIMEOUT: 30
REPMGR_MONITORING_HISTORY: "yes"
REPMGR_PARTNER_NODES: postgres-0,postgres-1,postgres-2
REPMGR_PASSWORD: ${REPMGR_PASSWORD:-repmgr}
REPMGR_PRIMARY_HOST: postgres-0
REPMGR_PRIMARY_VISIBILITY_CONSENSUS: "true"
REPMGR_RECONNECT_ATTEMPTS: 10
REPMGR_RECONNECT_INTERVAL: 10
REPMGR_USE_PGREWIND: "yes"
REPMGR_USE_REPLICATION_SLOTS: 1
services:
postgres-0:
<<: *service-common
environment:
<<: *common-env
REPMGR_NODE_NAME: postgres-0
REPMGR_NODE_NETWORK_NAME: postgres-0
REPMGR_NODE_PRIORITY: 100
deploy:
placement:
constraints: [node.labels.postgres-0 == true]
postgres-1:
<<: *service-common
environment:
<<: *common-env
REPMGR_NODE_NAME: postgres-1
REPMGR_NODE_NETWORK_NAME: postgres-1
REPMGR_NODE_PRIORITY: 90
deploy:
placement:
constraints: [node.labels.postgres-1 == true]
postgres-2:
<<: *service-common
environment:
<<: *common-env
REPMGR_NODE_NAME: postgres-2
REPMGR_NODE_NETWORK_NAME: postgres-2
REPMGR_NODE_PRIORITY: 80
deploy:
placement:
constraints: [node.labels.postgres-2 == true]
What is the expected behavior?
pg_rewind
should synchronize a diverged standby node successfully without wiping existing data- When resync is needed, repmgr tries
pg_rewind
first pg_basebackup
is only used as a fallback if rewind fails or on explicit user command
What do you see instead?
Current Error:
pg_rewind: error: could not open file "/bitnami/postgresql/data/global/pg_control" for reading: No such file or directory
Previous error (fixed by setting POSTGRESQL_CONF_DIR
):
postgres: could not access the server configuration file "/bitnami/postgresql/data/postgresql.conf": No such file or directory
Fallback behavior:
After pg_rewind
fails, repmgr automatically runs pg_basebackup
which deletes and overwrites the entire data directory, risking data loss or unnecessary reinitialization.
Additional information
-
Environment: Docker Swarm mode, 3 nodes
-
Volume mount:
${POSTGRES_VOLUME_PATH}/postgres:${POSTGRESQL_DATA_DIR:-/bitnami/postgresql}:Z
-
PostgreSQL config:
wal_log_hints = on
(required for pg_rewind)data_checksums = off
-
Replication works normally otherwise
-
Manual
pg_rewind
run returns "no rewind required" when nodes are synchronized
Questions
- How to properly structure/configure data directories so
pg_rewind
can locate thepg_control
file? - Can automatic fallback to
pg_basebackup
be disabled to avoid data loss after failed rewind? - Is the use of
POSTGRESQL_CONF_DIR=/bitnami/postgresql/data
correct or should it point elsewhere?
Root Cause Analysis
The core issue stems from a configuration file location mismatch between Bitnami's PostgreSQL structure and pg_rewind
's expectations.
The Problem:
pg_rewind
is executed with these parameters:
pg_rewind -D "$POSTGRESQL_DATA_DIR" --source-server "host=${REPMGR_CURRENT_PRIMARY_HOST} port=${REPMGR_CURRENT_PRIMARY_PORT} user=${REPMGR_USERNAME} dbname=${REPMGR_DATABASE}"
Where POSTGRESQL_DATA_DIR
= /bitnami/postgresql/data
Why it fails:
pg_rewind
internally launches PostgreSQL in single-user mode to perform crash recovery- PostgreSQL in single-user mode automatically searches for
postgresql.conf
in the directory specified by the-D
parameter - In Bitnami's structure:
- Data directory:
/bitnami/postgresql/data
- Config files:
/opt/bitnami/postgresql/conf/postgresql.conf
- Data directory:
- When PostgreSQL (launched by pg_rewind) looks for
/bitnami/postgresql/data/postgresql.conf
, it doesn't exist - This causes the single-user PostgreSQL process to fail, preventing
pg_rewind
from completing
The sequence:
pg_rewind -D /bitnami/postgresql/data
└── Launches PostgreSQL in single-user mode
└── PostgreSQL looks for /bitnami/postgresql/data/postgresql.conf
└── File not found → Process fails
└── pg_rewind fails
└── repmgr falls back to pg_basebackup (data loss risk)