This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to ensure ha slave is fully synced with a manual check

I am wanting to update my Sophos UTM. It is in HA configuration. Both nodes are Virtual Machines.

This particular install is running UTM 9.355-1.

 

I want to ensure the slave node is properly synced. Is there a way to run a manual check to ensure this?

The HA WebUI page says the slave is in sync and ready.

However, the high-availability logs continually log the following every hour, for node 2, which makes be doubt what is being reported:

2018:11:12-16:08:46 sophos-ha-1 ha_daemon[4173]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 519 46.895" name="Executing (wait) /usr/local/bin/confd-setha mode master master_ip 198.19.250.1 slave_ip 198.19.250.2"
2018:11:12-16:08:47 sophos-ha-1 ha_daemon[4173]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 520 47.053" name="Executing (nowait) /etc/init.d/ha_mode check"
2018:11:12-16:08:47 sophos-ha-1 ha_mode[31969]: calling check
2018:11:12-16:08:47 sophos-ha-1 ha_mode[31969]: check: waiting for last ha_mode done
2018:11:12-16:08:47 sophos-ha-1 ha_mode[31969]: check_ha() role=MASTER, status=ACTIVE
2018:11:12-16:08:47 sophos-ha-1 ha_mode[31969]: check done (started at 16:08:47)
2018:11:12-16:18:29 sophos-ha-2 ha_daemon[3660]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 539 29.889" name="Executing (wait) /usr/local/bin/confd-setha mode slave"
2018:11:12-16:18:29 sophos-ha-2 ha_daemon[3660]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 540 29.986" name="Executing (nowait) /etc/init.d/ha_mode check"
2018:11:12-16:18:29 sophos-ha-2 ha_mode[27985]: calling check
2018:11:12-16:18:29 sophos-ha-2 ha_mode[27985]: check: waiting for last ha_mode done
2018:11:12-16:18:29 sophos-ha-2 ha_mode[27985]: check_ha() role=SLAVE, status=ACTIVE
2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: 2018-11-12 16:18:33 [28000] [e] [db_connect(2058)] error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.1): could not connect to server: Connection refused
2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: Is the server running on host "198.19.250.1" and accepting
2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: TCP/IP connections on port 5432?
2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: 2018-11-12 16:18:33 [28000] [e] [master_connection(1904)] could not connect to server: Connection refused
2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: Is the server running on host "198.19.250.1" and accepting
2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: TCP/IP connections on port 5432?
2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: parse at repctl.pl line 156.
2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: HA SELFMON WARN: Restarting repctl for SLAVE(ACTIVE)
2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: check done (started at 16:18:29)
 


This thread was automatically locked due to age.
  • This ended up not being the end of the story. The HA pair, nodes A and B, were not completely in sync. I called Sophos Tech support and this is what we did.

    When I updated the A node (up2date) from 9.355-1 to 9.411, and the A node rebooted, the B node did not route traffic. The B node switched two interfaces around in its configuration. It took one interface offline to "unassigned" and moved configuration of another interface to the network card previously used by the first interface with the "unassigned" network card. We changed those back, and routing worked again.

    We then performed a ha_daemon -c takeover back to the A node. Routing still worked, but the A node had trouble too. The eth0 from the B node showed up on the A node. Now the A node had two eth0 interfaces with MAC addresses from the A and B nodes' eth0 devices. We also saw MAC addresses from both A and B nodes assigned to different hardware cards. So we rebooted the A node, which made B take over.

    After B took over, we saw the same issue with MAC addresses on the B node. We performed a takeover again back to A, and had similar issues on the A node. So we rebooted the A node again.

    The B node took over, and it was okay this time. The A node came up and we performed the takeover back to it. It had one hardware network card with the B node MAC Address. So we rebooted the A node again.

    After the final reboot of the A node, it seems the interfaces are correct on the A node.

     

    So with replication broken as it was, I hypothesized (not the technician) that confd some how did not get the interface configuration in sync properly.

    We checked the NIC and MAC assignments in the `/etc/udev/rules.d/70-persistent-net.rules` file, and it was always correct. But was showing up incorrectly in WebAdmin, and subsequently the Sophos Hardware information and interface configuration.

     

  • I did not fully explain what may be an important detail

    I had two interfaces, each assigned to a different NIC.

    Interface1 assigned to eth0

    Interface2 assigned to eth2

     

    In the previous written case when first failed over to node B, Interface2 was offline and "unassigned" (was eth2). Interface 1 was assigned to eth2.

    So when we performed the ha_daemon -c takeover back to node A, that issue on node B must have been related to having two eth0 network cards on node A.

    So it would seem that the replication/synchronization assigned node-A-eth0 and node-B-eth0 to node A.

    But, like I wrote, after several reboots the nodes got back in sync with the correct NIC assignments.

    The interfaces always seems to be synchronized correctly. But the NICs available on each node were not synchronized properly. I wrote that the B node MAC addresses showed up on the A node NICs. However, as I give it some thought to why I had two eth0 NICs at the beginning, I now believe that the synchronization process was assigning NICs in the confd configuration on the other node in the pair.

    At the command line interface it looked correct. But it would seem the confd/WebAdmin had configured A-node NICs on the B-node, and vice versa.