Skip to content

PostgreSQL repmgr: pg_rewind fails with pg_control error, pg_basebackup overwrites data #83755

@sowabdoul

Description

@sowabdoul

Name and Version

bitnami/postgresql-repmgr 16


What steps will reproduce the bug?

  1. Deploy a 3-node PostgreSQL cluster using the Docker Compose config below

  2. Wait for the cluster to be fully initialized and synchronized

  3. Stop one standby node:

    docker exec -ti <standby_container> pg_ctl stop -D /bitnami/postgresql/data -m fast
  4. Make some changes on the primary node to create WAL divergence

  5. Attempt to run pg_rewind manually:

    docker exec -ti <standby_container> pg_rewind \
      --target-pgdata=/bitnami/postgresql/data \
      --source-server="host=postgres-0 port=5432 user=repmgr dbname=repmgr" \
      --progress

Are you using any custom parameters or values?

Docker Compose Configuration:

version: '3.8'

x-version-common: &service-common
  image: bitnami/postgresql-repmgr:${POSTGRES_RELEASE:-latest}
  volumes:
    - ${POSTGRES_VOLUME_PATH:?err}/postgres:${POSTGRESQL_DATA_DIR:-/bitnami/postgresql}:Z

x-common-env: &common-env
  BITNAMI_DEBUG: "true"
  POSTGRESQL_FSYNC: "on"
  POSTGRESQL_PASSWORD: ${POSTGRES_PASSWORD:-odoo}
  POSTGRESQL_POSTGRES_PASSWORD: ${ADMIN_POSTGRES_PASSWORD:-postgres}
  POSTGRESQL_USERNAME: ${POSTGRES_USERNAME:-odoo}
  POSTGRESQL_WAL_LEVEL: replica
  POSTGRESQL_SYNCHRONOUS_COMMIT_MODE: "on"
  POSTGRESQL_NUM_SYNCHRONOUS_REPLICAS: 1
  POSTGRESQL_SYNCHRONOUS_REPLICAS_MODE: "FIRST"
  POSTGRESQL_CLUSTER_APP_NAME: "*"
  POSTGRESQL_CONF_DIR: /bitnami/postgresql/data
  REPMGR_DEGRADED_MONITORING_TIMEOUT: 300
  REPMGR_FAILOVER: automatic
  REPMGR_MASTER_RESPONSE_TIMEOUT: 30
  REPMGR_MONITORING_HISTORY: "yes"
  REPMGR_PARTNER_NODES: postgres-0,postgres-1,postgres-2
  REPMGR_PASSWORD: ${REPMGR_PASSWORD:-repmgr}
  REPMGR_PRIMARY_HOST: postgres-0
  REPMGR_PRIMARY_VISIBILITY_CONSENSUS: "true"
  REPMGR_RECONNECT_ATTEMPTS: 10
  REPMGR_RECONNECT_INTERVAL: 10
  REPMGR_USE_PGREWIND: "yes"
  REPMGR_USE_REPLICATION_SLOTS: 1

services:
  postgres-0:
    <<: *service-common
    environment:
      <<: *common-env
      REPMGR_NODE_NAME: postgres-0
      REPMGR_NODE_NETWORK_NAME: postgres-0
      REPMGR_NODE_PRIORITY: 100
    deploy:
      placement:
        constraints: [node.labels.postgres-0 == true]

  postgres-1:
    <<: *service-common
    environment:
      <<: *common-env
      REPMGR_NODE_NAME: postgres-1
      REPMGR_NODE_NETWORK_NAME: postgres-1
      REPMGR_NODE_PRIORITY: 90
    deploy:
      placement:
        constraints: [node.labels.postgres-1 == true]

  postgres-2:
    <<: *service-common
    environment:
      <<: *common-env
      REPMGR_NODE_NAME: postgres-2
      REPMGR_NODE_NETWORK_NAME: postgres-2
      REPMGR_NODE_PRIORITY: 80
    deploy:
      placement:
        constraints: [node.labels.postgres-2 == true]

What is the expected behavior?

  1. pg_rewind should synchronize a diverged standby node successfully without wiping existing data
  2. When resync is needed, repmgr tries pg_rewind first
  3. pg_basebackup is only used as a fallback if rewind fails or on explicit user command

What do you see instead?

Current Error:

pg_rewind: error: could not open file "/bitnami/postgresql/data/global/pg_control" for reading: No such file or directory

Previous error (fixed by setting POSTGRESQL_CONF_DIR):

postgres: could not access the server configuration file "/bitnami/postgresql/data/postgresql.conf": No such file or directory

Fallback behavior:

After pg_rewind fails, repmgr automatically runs pg_basebackup which deletes and overwrites the entire data directory, risking data loss or unnecessary reinitialization.


Additional information

  • Environment: Docker Swarm mode, 3 nodes

  • Volume mount: ${POSTGRES_VOLUME_PATH}/postgres:${POSTGRESQL_DATA_DIR:-/bitnami/postgresql}:Z

  • PostgreSQL config:

    • wal_log_hints = on (required for pg_rewind)
    • data_checksums = off
  • Replication works normally otherwise

  • Manual pg_rewind run returns "no rewind required" when nodes are synchronized


Questions

  1. How to properly structure/configure data directories so pg_rewind can locate the pg_control file?
  2. Can automatic fallback to pg_basebackup be disabled to avoid data loss after failed rewind?
  3. Is the use of POSTGRESQL_CONF_DIR=/bitnami/postgresql/data correct or should it point elsewhere?

Root Cause Analysis

The core issue stems from a configuration file location mismatch between Bitnami's PostgreSQL structure and pg_rewind's expectations.

The Problem:
pg_rewind is executed with these parameters:

pg_rewind -D "$POSTGRESQL_DATA_DIR" --source-server "host=${REPMGR_CURRENT_PRIMARY_HOST} port=${REPMGR_CURRENT_PRIMARY_PORT} user=${REPMGR_USERNAME} dbname=${REPMGR_DATABASE}"

Where POSTGRESQL_DATA_DIR = /bitnami/postgresql/data

Why it fails:

  1. pg_rewind internally launches PostgreSQL in single-user mode to perform crash recovery
  2. PostgreSQL in single-user mode automatically searches for postgresql.conf in the directory specified by the -D parameter
  3. In Bitnami's structure:
    • Data directory: /bitnami/postgresql/data
    • Config files: /opt/bitnami/postgresql/conf/postgresql.conf
  4. When PostgreSQL (launched by pg_rewind) looks for /bitnami/postgresql/data/postgresql.conf, it doesn't exist
  5. This causes the single-user PostgreSQL process to fail, preventing pg_rewind from completing

The sequence:

pg_rewind -D /bitnami/postgresql/data
  └── Launches PostgreSQL in single-user mode
      └── PostgreSQL looks for /bitnami/postgresql/data/postgresql.conf
          └── File not found → Process fails
              └── pg_rewind fails
                  └── repmgr falls back to pg_basebackup (data loss risk)

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions