How to ensure ha slave is fully synced with a manual check

Question

I am wanting to update my Sophos UTM. It is in HA configuration. Both nodes are Virtual Machines. 
 This particular install is running UTM 9.355-1. 
 
 I want to ensure the slave node is properly synced. Is there a way to run a manual check to ensure this? 
 The HA WebUI page says the slave is in sync and ready. 
 However, the high-availability logs continually log the following every hour, for node 2, which makes be doubt what is being reported: 
 2018:11:12-16:08:46 sophos-ha-1 ha_daemon[4173]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 519 46.895" name="Executing (wait) /usr/local/bin/confd-setha mode master master_ip 198.19.250.1 slave_ip 198.19.250.2" 
 2018:11:12-16:08:47 sophos-ha-1 ha_daemon[4173]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 520 47.053" name="Executing (nowait) /etc/init.d/ha_mode check" 
 2018:11:12-16:08:47 sophos-ha-1 ha_mode[31969]: calling check 
 2018:11:12-16:08:47 sophos-ha-1 ha_mode[31969]: check: waiting for last ha_mode done 
 2018:11:12-16:08:47 sophos-ha-1 ha_mode[31969]: check_ha() role=MASTER, status=ACTIVE 
 2018:11:12-16:08:47 sophos-ha-1 ha_mode[31969]: check done (started at 16:08:47) 
 2018:11:12-16:18:29 sophos-ha-2 ha_daemon[3660]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 539 29.889" name="Executing (wait) /usr/local/bin/confd-setha mode slave" 
 2018:11:12-16:18:29 sophos-ha-2 ha_daemon[3660]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 540 29.986" name="Executing (nowait) /etc/init.d/ha_mode check" 
 2018:11:12-16:18:29 sophos-ha-2 ha_mode[27985]: calling check 
 2018:11:12-16:18:29 sophos-ha-2 ha_mode[27985]: check: waiting for last ha_mode done 
 2018:11:12-16:18:29 sophos-ha-2 ha_mode[27985]: check_ha() role=SLAVE, status=ACTIVE 
 2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: 2018-11-12 16:18:33 [28000] [e] [db_connect(2058)] error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.1): could not connect to server: Connection refused 
 2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: Is the server running on host "198.19.250.1" and accepting 
 2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: TCP/IP connections on port 5432? 
 2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: 2018-11-12 16:18:33 [28000] [e] [master_connection(1904)] could not connect to server: Connection refused 
 2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: Is the server running on host "198.19.250.1" and accepting 
 2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: TCP/IP connections on port 5432? 
 2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: parse at repctl.pl line 156. 
 2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: HA SELFMON WARN: Restarting repctl for SLAVE(ACTIVE) 
 2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: check done (started at 16:18:29)

rglaue · Accepted Answer

The solution was to fix the Sophos UTM replication. 
 
 The Slave was running okay, but was not receiving a full replication because the Master's PostgreSQL server was not running. 
 It is possible that an upgrade from a previous version may have invited certain changes that caused PostgreSQL to stop working. The first version was 9.3.10, and final version 9.3.55. However there are some know issues with replication in this version. 
 The procedure in resolving this was to: 
 
 stop repctl on both Slave and Master 
 stop PostgreSQL on the Slave, and ensure it was not running on the Master too (it is not because it is broken) 
 rebuild the PostgreSQL database on both Slave and Master 
 start PostgreSQL on the Master, and then the Slave 
 start repctl 
 check to make sure replication is running fully 
 in replication status, ensure the PostgreSQL database server is running properly. There should be a primary and secondary with minimal lag time. 
 
 PostgreSQL stores reports (e.g. graphing data) and mail transactions. Rebuilding the PostgreSQL database may cause these to be lost - however, none of the data in this Sophos UTM was lost. I am not using the mail services in this UTM, so that is not a factor that was evaluated. 
 Replication is running, and there are no errors in the high-availability log.

rglaue · Answer

This ended up not being the end of the story. The HA pair, nodes A and B, were not completely in sync. I called Sophos Tech support and this is what we did. 
 When I updated the A node (up2date) from 9.355-1 to 9.411, and the A node rebooted, the B node did not route traffic. The B node switched two interfaces around in its configuration. It took one interface offline to "unassigned" and moved configuration of another interface to the network card previously used by the first interface with the "unassigned" network card. We changed those back, and routing worked again. 
 We then performed a ha_daemon -c takeover back to the A node. Routing still worked, but the A node had trouble too. The eth0 from the B node showed up on the A node. Now the A node had two eth0 interfaces with MAC addresses from the A and B nodes' eth0 devices. We also saw MAC addresses from both A and B nodes assigned to different hardware cards. So we rebooted the A node, which made B take over. 
 After B took over, we saw the same issue with MAC addresses on the B node. We performed a takeover again back to A, and had similar issues on the A node. So we rebooted the A node again. 
 The B node took over, and it was okay this time. The A node came up and we performed the takeover back to it. It had one hardware network card with the B node MAC Address. So we rebooted the A node again. 
 After the final reboot of the A node, it seems the interfaces are correct on the A node. 
 
 So with replication broken as it was, I hypothesized (not the technician) that confd some how did not get the interface configuration in sync properly. 
 We checked the NIC and MAC assignments in the `/etc/udev/rules.d/70-persistent-net.rules` file, and it was always correct. But was showing up incorrectly in WebAdmin, and subsequently the Sophos Hardware information and interface configuration.

rglaue · Answer

I did not fully explain what may be an important detail 
 I had two interfaces, each assigned to a different NIC. 
 Interface1 assigned to eth0 
 Interface2 assigned to eth2 
 
 In the previous written case when first failed over to node B, Interface2 was offline and "unassigned" (was eth2). Interface 1 was assigned to eth2. 
 So when we performed the ha_daemon -c takeover back to node A, that issue on node B must have been related to having two eth0 network cards on node A. 
 So it would seem that the replication/synchronization assigned node-A-eth0 and node-B-eth0 to node A. 
 But, like I wrote, after several reboots the nodes got back in sync with the correct NIC assignments. 
 The interfaces always seems to be synchronized correctly. But the NICs available on each node were not synchronized properly. I wrote that the B node MAC addresses showed up on the A node NICs. However, as I give it some thought to why I had two eth0 NICs at the beginning, I now believe that the synchronization process was assigning NICs in the confd configuration on the other node in the pair. 
 At the command line interface it looked correct. But it would seem the confd/WebAdmin had configured A-node NICs on the B-node, and vice versa.