HA rscync error on active/passive HA Cluster

Question

hi guys, 
 ugh, with 9.5 sophos made a bad job, once again. 
 Since upgrading SG330 active/passive cluster to 9.5 (now 9.502-4) we had several problems. SSO didn&acute;t update the group entries from AD. (i.e. group members in "internet allow" AD group won&acute;t be updated, so that users new to this group can&acute;t use internet, also users removed from group are still able to use) 
 Second, the db syncing of HA cluster seems to fail. Even the Status Tab in HA shows Master als Active and Slave as Ready, the HA log produces errors every 10 seconds 
 2017:08:14-09:01:11 xxxx-01-2 repctl[3738]: [i] execute(1768): rsync: failed to connect to 198.19.250.1: Connection refused (111) 2017:08:14-09:01:11 xxxx-01-2 repctl[3738]: [i] execute(1768): rsync error: error in socket IO (code 10) at clientserver.c(122) [receiver=3.0.4] 2017:08:14-09:01:11 xxxx-01-2 repctl[3738]: [c] standby_clone(936): rsync failed on $VAR1 = { 2017:08:14-09:01:11 xxxx-01-2 repctl[3738]: [c] standby_clone(936): 'path' => '/postgres.default', 2017:08:14-09:01:11 xxxx-01-2 repctl[3738]: [c] standby_clone(936): 'dst' => '/var/storage/pgsql92/', 2017:08:14-09:01:11 xxxx-01-2 repctl[3738]: [c] standby_clone(936): 'module' => 'postgres-default' 
 After several retries it stops 
 repctl[3738]: [w] master_connection(2015): check_dbh: -1 2017:08:14-08:55:40 xxxx-01-2 repctl[3738]: [i] stop_backup_mode(765): stopped backup mode at 0000000100000020000000D3 2017:08:14-08:55:40 xxxx-01-2 repctl[3738]: [c] standby_clone(950): standby_clone failed: sync aborted (never executed successfully) 2017:08:14-08:55:40 xxxx-01-2 repctl[3738]: [e] prepare_secondary(346): prepare_secondary: clone failed 2017:08:14-08:55:40 xxxx-01-2 repctl[3738]: [c] prepare_secondary(360): failed to get database up, waiting for retry 2017:08:14-08:55:40 xxxx-01-2 repctl[3738]: [c] setup_replication(274): setup_replication was not properly executed 2017:08:14-08:55:40 xxxx-01-2 repctl[3738]: [i] setup_replication(278): checkinterval 300 2017:08:14-09:00:39 xxxx-01-2 repctl[3738]: [i] recheck(1057): got ALRM: replication recheck triggered Setup_replication_done = 0 2017:08:14-09:00:39 xxxx-01-2 repctl[3738]: [i] execute(1768): pg_ctl: no server running 2017:08:14-09:00:39 xxxx-01-2 ha_daemon[4346]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 462 39.097" name="HA control: cmd = 'sync start 1 database'" 2017:08:14-09:00:39 xxxx-01-2 ha_daemon[4346]: id="38A1" severity="info" sys="System" sub="ha" seq="S: 463 39.097" name="control_sync(): we are not in state SYNCING, ignoring sync for database/1" 2017:08:14-09:00:39 xxxx-01-2 repctl[3738]: [i] execute(1768): pg_ctl: PID file "/var/storage/pgsql92/data/postmaster.pid" does not exist 2017:08:14-09:00:39 xxxx-01-2 repctl[3738]: [i] execute(1768): Is server running? 
 
 and every 55 minutes we got message from alerting system: 
 HA SELFMON WARN: Restarting repctl for SLAVE(ACTIVE) 
 
 Any ideas? Exept of rebuild postgresql db&acute;s or releasing HA status and firmware reset of Slave and rebuild HA. 
 Cu 
 Thomas

sachingurung · Answer

Hi Thomas, 
 Daniel has a potential suggestion to your question. It could be that the slave's partition filled up. Also, check if you see any log lines similar to, 
 "[user:notice] cluster_sync[1590]: rsync: write failed on "/var/up2date/aptp/u2d-aptp-9.23906.tgz.gpg" (in cluster_sync): No space left on device (28) " 
 You can check in the fallback.log, execute: 
 fallback.log | grep no space 
 If that doesn't help, rebuild postgres and that should resolve the issue. 
 Thanks