This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

SG135 HA pair, slave stuck in 'Syncing'

We've got a pair of SG135s running 9.502-4 where the slave is stuck in 'Syncing'.  Both master and slave have been rebooted, and I'm unable to glean anything useful from the HA log.

They're properly cabled (all ports match).  I've pasted the logs below, any suggestions are welcome.

 

2017:11:17-01:05:46 fw-1 repctl[15999]: [i] daemonize_check(1480): daemonized, see syslog for further messages
2017:11:17-01:05:46 fw-1 repctl[15999]: [i] daemonize_check(1497): trying to signal daemon and exit
2017:11:17-01:05:46 fw-1 repctl[4200]: [i] recheck(1057): got HUP: replication recheck triggered Setup_replication_done = 1
2017:11:17-01:05:46 fw-1 ha_mode[15984]: topology_changed done (started at 01:05:45)
2017:11:17-01:05:46 fw-1 ha_daemon[4134]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 780 46.246" name="Reading cluster configuration"
2017:11:17-01:05:51 fw-1 ha_daemon[4134]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 781 51.339" name="Set syncing.files for node 2"
2017:11:17-01:05:53 fw-1 ha_daemon[4134]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 782 53.229" name="filesync_end(): initial sync failed, status = 0x500"
2017:11:17-01:05:53 fw-1 ha_daemon[4134]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 783 53.229" name="Clear syncing.files for node 2"
2017:11:17-01:06:01 fw-1 ha_daemon[4134]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 784 01.437" name="Monitoring interfaces for link beat: eth2 eth1 eth0"
2017:11:17-01:06:53 fw-2 repctl[4063]: [i] stop_backup_mode(765): stopped backup mode at 00000001000000D200000039<30>Nov 17 01:06:53 repctl[4063]: [i] execute(1768): waiting for server to start....
2017:11:17-01:06:54 fw-2 repctl[4063]: [i] execute(1768): .
2017:11:17-01:06:55 fw-2 repctl[4063]: [i] execute(1768): .
2017:11:17-01:06:56 fw-2 repctl[4063]: [i] execute(1768): .
2017:11:17-01:06:57 fw-2 repctl[4063]: [i] execute(1768): .
2017:11:17-01:06:58 fw-2 repctl[4063]: [i] execute(1768): .
2017:11:17-01:06:59 fw-2 repctl[4063]: [i] execute(1768): .
2017:11:17-01:07:00 fw-2 repctl[4063]: [i] execute(1768): .
2017:11:17-01:07:01 fw-2 repctl[4063]: [i] execute(1768): .
2017:11:17-01:07:02 fw-2 repctl[4063]: [i] execute(1768): .
2017:11:17-01:07:03 fw-2 repctl[4063]: [i] execute(1768): done
2017:11:17-01:07:03 fw-2 repctl[4063]: [i] execute(1768): server started
2017:11:17-01:07:03 fw-2 ha_daemon[4002]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 42 03.389" name="HA control: cmd = 'sync stop 1 database'"
2017:11:17-01:07:03 fw-2 ha_daemon[4002]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 43 03.389" name="Deactivating sync process for database on node 1"
2017:11:17-01:07:03 fw-2 repctl[4063]: [i] recheck(1057): got HUP: replication recheck triggered Setup_replication_done = 0
2017:11:17-01:07:03 fw-2 repctl[4063]: [i] execute(1768): pg_ctl: server is running (PID: 5436)
2017:11:17-01:07:03 fw-2 repctl[4063]: [i] execute(1768): /usr/pgsql92/bin/postgres "-D" "/var/storage/pgsql92/data"
2017:11:17-01:07:03 fw-2 ha_daemon[4002]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 44 03.567" name="HA control: cmd = 'sync start 1 database'"
2017:11:17-01:07:03 fw-2 ha_daemon[4002]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 45 03.567" name="Activating sync process for database on node 1"
2017:11:17-01:07:03 fw-2 repctl[4063]: [i] execute(1768): waiting for server to shut down...
2017:11:17-01:07:03 fw-2 repctl[4063]: [i] execute(1768): .
2017:11:17-01:07:04 fw-2 repctl[4063]: [i] execute(1768): done
2017:11:17-01:07:04 fw-2 repctl[4063]: [i] execute(1768): server stopped
2017:11:17-01:07:05 fw-2 repctl[4063]: [i] start_backup_mode(744): starting backup mode at 00000001000000D20000003B
2017:11:17-01:07:05 fw-2 ha_daemon[4002]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 46 05.359" name="HA control: cmd = 'sync start 1 database'"
2017:11:17-01:07:05 fw-2 ha_daemon[4002]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 47 05.359" name="Activating sync process for database on node 1"
2017:11:17-01:07:38 fw-2 repctl[4063]: [i] stop_backup_mode(765): stopped backup mode at 00000001000000D20000003B
2017:11:17-01:07:38 fw-2 repctl[4063]: [i] execute(1768): waiting for server to start...
2017:11:17-01:07:38 fw-2 repctl[4063]: [i] execute(1768): .
2017:11:17-01:07:39 fw-2 repctl[4063]: [i] execute(1768): .
2017:11:17-01:07:40 fw-2 repctl[4063]: [i] execute(1768): .
2017:11:17-01:07:41 fw-2 repctl[4063]: [i] execute(1768): .
2017:11:17-01:07:42 fw-2 repctl[4063]: [i] execute(1768): .
2017:11:17-01:07:43 fw-2 repctl[4063]: [i] execute(1768): .
2017:11:17-01:07:44 fw-2 repctl[4063]: [i] execute(1768): .
2017:11:17-01:07:45 fw-2 repctl[4063]: [i] execute(1768): .
2017:11:17-01:07:46 fw-2 repctl[4063]: [i] execute(1768): done
2017:11:17-01:07:46 fw-2 repctl[4063]: [i] execute(1768): server started
2017:11:17-01:07:46 fw-2 ha_daemon[4002]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 48 46.446" name="HA control: cmd = 'sync stop 1 database'"
2017:11:17-01:07:46 fw-2 ha_daemon[4002]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 49 46.446" name="Deactivating sync process for database on node 1"
2017:11:17-01:07:46 fw-2 repctl[4063]: [i] setup_replication(278): checkinterval 300
2017:11:17-01:07:46 fw-2 repctl[4063]: [i] setup_replication(278): checkinterval 300
2017:11:17-01:10:45 fw-2 ha_daemon[4002]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 50 45.183" name="Initial synchronization finished!"
2017:11:17-01:10:45 fw-2 ha_daemon[4002]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 51 45.183" name="state change SYNCING(2) -> ACTIVE(0)"
2017:11:17-01:10:45 fw-1 ha_daemon[4134]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 785 45.753" name="Node 2 changed state: SYNCING(2) -> ACTIVE(0)"
2017:11:17-01:10:46 fw-1 repctl[4200]: [i] recheck(1057): got ALRM: replication recheck triggered Setup_replication_done = 1
2017:11:17-01:12:46 fw-2 repctl[4063]: [i] recheck(1057): got ALRM: replication recheck triggered Setup_replication_done = 1



This thread was automatically locked due to age.
Parents
  • What happens if you:

    1. Disable HA causing the Slave to do a Factory Rest and shutdown
    2. Enable HA on the Master
    3. Power up the Slave.

    Does it then sync and become READY?  If not, I'd be tempted to re-image the Slave.  Please share your results.

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
Reply
  • What happens if you:

    1. Disable HA causing the Slave to do a Factory Rest and shutdown
    2. Enable HA on the Master
    3. Power up the Slave.

    Does it then sync and become READY?  If not, I'd be tempted to re-image the Slave.  Please share your results.

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
Children
  • I ended up having to reimage - however, we'd reimaged this pair only a couple of months ago for a similar issue.  In that instance, the sync failure happened after a firmware upgrade.  In this instance, no firmware upgrade happened - it was just a routine ISP outage.  When the multipathing failed to kick over to the alternate routes properly, I rebooted the Master, which ended up 'Syncing' for hours.

    Since I can't recreate the condition, I'll assume this is closed for now, but HA breaking down over something this routine is sort of worrying.

  • I've got many clients on HA and this is not a problem I've seen more than twice since Hot-Standby was made available.  It would have been nice for Sophos Support to have gotten the High Availability logs before the re-image as that might have uncovered a hardware glitch.

    Just out of paranoia, set fixed speed/duplex for the sync NIC to 1000/Full on the 'Hardware' tab.  There shouldn't be an auto-negotiation problem between two identical devices, but paranoia doesn't seem uncalled for since you've already been zapped twice. [;)]  I assume you have these two NICs connected directly by an Ethernet cable, and I'd be tempted to replace it also.

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA