April 12, 2019

Why is my PostgreSQL replication lagging?

PostgreSQL replication delays and bottlenecks can be difficult to troubleshoot especially after the fact. Sometimes all you've got are a bunch of these errors in the log:

ERROR: canceling statement due to conflict with recovery

Happily we can extract some helpful statistics from the binary replication logs as long as the relevant logs are still on the disk.

For replication, the slave reads in segments of the master's Write-Ahead Log (WAL) as if the slave were recovering from a crash. The WAL segments describe the changes that the slave should make to database objects in order to sync them with the master. In PostgreSQL 9.3 and later we can use pg_xlogdump to parse those WAL change records to figure out which tables and indexes are receiving the most traffic for a given time period.

pg_xlogdump writes out database objects by their OIDs eg 1668/16200/20275 so unless you've memorised the OIDs for your tables (and let's face it, who wouldn't want to do that?) we'll need to translate those OIDs to human-readable names. You'll also need to know the names of the WAL segment files for the relevant time period so we'll use ls -l and grep [time] to get the appropriate set of WAL segment files.

Here's the bash that sketches out the technique. Tested on CentOS 7. You'll probably need to change the location of the WAL files, the location of pg_xlogdump and of course the date & time you care about.

my_pg_xlogdump=/usr/pgsql-9.6/bin/pg_xlogdump                     
my_wals=/var/lib/pgsql/pg_wal 
my_time='Apr 30 11:' 
psql_raw='psql --no-align --tuples-only --user postgres'
ls -lh $my_wals | grep -P '[0-9A-F]{24}'                          `# list all the files that look like WAL segments` \
| awk '{print $6,$7,$8,$9 }' | grep -P "$my_time"                 `# filter, including only files whose dates look interesting` \
| (while read month day time file; do
    echo "$file $month $day $time";                               `# Write out the segment file name and the time` \
    $my_pg_xlogdump --path=$my_wals $file $file                   `# Ask pg_xlogdump to parse only one segment file at a time` \
    | egrep ' rel [0-9]+/[0-9]+/[0-9]+ '                          `# Filter for lines that contain a database object reference` \
    | perl -pe 's/.*? rel ([0-9]+)\/([0-9]+)\/([0-9]+) .*/$2 $3/' `# $2 and $3 are the database OID and the relation id respectively` \
    | sort | uniq -c | sort -n | tail -n 5                        `# Get the top 5 most common relations` \
    | (while read count db_oid relfilenode; do                    `# Get the database name, and use it to get the relation name`
        db=$($psql_raw -c "select d.datname from pg_database d where d.oid = $db_oid")
        rel=$($psql_raw $db -c "select n.nspname || '.' || c.relname from pg_namespace n join pg_class c on n.oid = c.relnamespace where c.relfilenode = $relfilenode")
        echo "  $count $db.$rel"                                  `# Write out the count for this relation.`
      done) 
  done)

Sample output for a single segment file:

000000010000001F000000EF Apr 30 11:22
  10360 employeedb.user_schema.address_sdx_streetname
  10370 employeedb.user_schema.address_street_address_id_idx
  10391 employeedb.user_schema.address_primary
  10494 employeedb.user_schema.address_idx_streetname
  30887 employeedb.user_schema.address

From this we see that the "address" table and its indicies were the most-replicated relations at 11:22am on April 30.

Note that VACUUM can change the file name(s) associated with tables. If this happens before you read the WAL, you might see output like this:

000000010000001F000000EF Apr 30 11:22
  10360 employeedb.
  10370 employeedb.user_schema.address_street_address_id_idx
  10391 employeedb.
  10494 employeedb.user_schema.address_idx_streetname
  30887 employeedb.user_schema.address

For this kind of troubleshooting it helps to keep WAL files hanging around on the disk for a while. Some helpful postgresql.conf configuration items are:

archive_mode: when enabled, saves a copy of WAL segments using archive_command
archive_command: the local shell command that archives a completed WAL segment
wal_keep_segments: specifies the minimum number of past WAL segments to keep in the pg_wal directory.

References for PostgreSQL 9.6:

Happy troubleshooting!