This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to ensure ha slave is fully synced with a manual check

I am wanting to update my Sophos UTM. It is in HA configuration. Both nodes are Virtual Machines.

This particular install is running UTM 9.355-1.

 

I want to ensure the slave node is properly synced. Is there a way to run a manual check to ensure this?

The HA WebUI page says the slave is in sync and ready.

However, the high-availability logs continually log the following every hour, for node 2, which makes be doubt what is being reported:

2018:11:12-16:08:46 sophos-ha-1 ha_daemon[4173]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 519 46.895" name="Executing (wait) /usr/local/bin/confd-setha mode master master_ip 198.19.250.1 slave_ip 198.19.250.2"
2018:11:12-16:08:47 sophos-ha-1 ha_daemon[4173]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 520 47.053" name="Executing (nowait) /etc/init.d/ha_mode check"
2018:11:12-16:08:47 sophos-ha-1 ha_mode[31969]: calling check
2018:11:12-16:08:47 sophos-ha-1 ha_mode[31969]: check: waiting for last ha_mode done
2018:11:12-16:08:47 sophos-ha-1 ha_mode[31969]: check_ha() role=MASTER, status=ACTIVE
2018:11:12-16:08:47 sophos-ha-1 ha_mode[31969]: check done (started at 16:08:47)
2018:11:12-16:18:29 sophos-ha-2 ha_daemon[3660]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 539 29.889" name="Executing (wait) /usr/local/bin/confd-setha mode slave"
2018:11:12-16:18:29 sophos-ha-2 ha_daemon[3660]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 540 29.986" name="Executing (nowait) /etc/init.d/ha_mode check"
2018:11:12-16:18:29 sophos-ha-2 ha_mode[27985]: calling check
2018:11:12-16:18:29 sophos-ha-2 ha_mode[27985]: check: waiting for last ha_mode done
2018:11:12-16:18:29 sophos-ha-2 ha_mode[27985]: check_ha() role=SLAVE, status=ACTIVE
2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: 2018-11-12 16:18:33 [28000] [e] [db_connect(2058)] error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.1): could not connect to server: Connection refused
2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: Is the server running on host "198.19.250.1" and accepting
2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: TCP/IP connections on port 5432?
2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: 2018-11-12 16:18:33 [28000] [e] [master_connection(1904)] could not connect to server: Connection refused
2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: Is the server running on host "198.19.250.1" and accepting
2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: TCP/IP connections on port 5432?
2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: parse at repctl.pl line 156.
2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: HA SELFMON WARN: Restarting repctl for SLAVE(ACTIVE)
2018:11:12-16:18:33 sophos-ha-2 ha_mode[27985]: check done (started at 16:18:29)
 


This thread was automatically locked due to age.
Parents
  • Read the first two threads found by Googling:

    site:community.sophos.com/products/unified-threat-management/f "error while connecting to database(DBI:Pg:dbname=repmgr;host" "could not connect to server: Connection refused"

    And then get a case open with Support.  Problem solved?

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
  • I already read those articles. The solution is not to rebuild the PostgreSQL database, because this error is due to a bug in this specific version of the UTM experiencing this issue.

    This cannot be resolved, other than by upgrading the firmware.

     

     

    I am asking on this thread for help with knowing a command I can run to compare the master and slave configuration to double-check they are in sync.

    The `ha_utils status` and the WebAdmin HA WebUI both say both nodes are ACTIVE. And the slave node is "Ready"

    But, because of the errors I am seeing in the /var/log/high-availability.log, I do not have any confidence in that.

     

    And I have contacted support, and have an open case. I am waiting for them to figure out what I can do to manually compare the configuration. I would guess this should not be difficult, just perform a configuration dump on the master, and on the slave, and then eyeball or diff it. But, it seems there is no command-line dump ???? Anyone know?

    How does the `ha_utils status` and the WebAdmin HA WebUI both determine that the slave node is "ACTIVE" and "Ready". What is this mechanism?

     

    This is the procedure I will likely follow.

    1. Verify the slave node is fully in-sync with the master

    2. perform a `ha_daemon -c takeover` to switch the master role to the slave, without rebooting either node.

    3. If the new master node is not good, shut it off, which will cause the old master to take over again

    4. If the new master is good, immediately perform the up2date for firmware upgrading on the old master, which will update it, reboot it, and then do the same for the second node.

     

    But I want to ensure the current slave is fully in sync with the master before I proceed.

    Step #2 will incur down time if the second node is not good as the new master.

  • Escalation should be requested for this Support case.

    I wonder if you first wouldn't want to disable HA, thus forcing the Slave to do a Factory Reset, then re-enable HA, wait a few minutes and then power up the Slave.  If you're still seeing the same worrisome entries in the logs, I think I would re-image the Slave from ISO.

    Your other problem is that you are so far behind in applying Up2Dates.  The implication is that the partition for / is likely so full that you cannot apply any Up2Dates.

    Please keep us apprised of your situation.

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
  • I already did these things before I posted here. And no, / (root) is not full.

     

     

    I already broke HA, and reestablished it. No change.

    I then broke HA, deleted the virtual disk, recreated the virtual disk, reinstalled from an ISO. No change.

    I disabled auto up2date - setting it to manual. I manually loaded the updates I need to get each node to version 9.4. The up2date WebUI shows the updates to 9.4. (root) / is at 73% utilized. up2date did not complain of a lack of disk space when I loaded those updates from the command line.

     

    I have a case already open, and their feedback so far was to install a new UTM in isolation, load a backup onto it, and then switch the UTMs out. I don't have a way to do this.

    But I was only asking for a way to check the config on the master and slave to ensure they are in sync. Like a way to dump the config on each node. Perhaps creating a backup of each node, and then extracting the backup to compare the config (I have not found a way to open the newest version of the abf format used by the Sophos backup).

    If I could read the abf, that might be a route. However, I don't know that if I run a backup on the slave node if it is backing up what is stored on the slave node, or instead the master node.

     

    Any ideas?

  • I believe that the following command will create a configuration backup of whatever node you're on:

    backup.plx -o yourfilename.abf

    I've never tried to compare two such files, so I'll be interested to learn what you find.

    Kudos - the manual Up2Date process you describe is the recommended approach when one is so far behind the current version.

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
  • Any idea or pointers on how to open the abf file, to extract the content into something readable?

    The older format is bzip2, but the current format I could not determine the format.

  • No idea.  I'd probably use cksum to see if they were identical or maybe cmp.  You can use rsync to transfer a file to the /home directory on the other node.

    Cheers - Bob
    PS Hmm - sounds like you've been around for awhile - an administrator can merge your old and new identities.

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
  • Looks like backup is not possible

    <S> sophos-1:/ # backup.plx -o cfg_sophos-2-20181119-151200
    Running on slave or worker node. Exiting.

    Seems like a simple thing I want to do, but I guess not so easy.

     

    This is my only identity.

    I had another in the Astaro days, but I never posted.

  • Thanks for documenting that about backup.plx.  I can confirm.

    I also checked three more HA installations and did not find the "error while connecting to database" anywhere in 2018 in any of them.  Please let us know what Support comes up with after this gets escalated.

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
  • The solution was to fix the Sophos UTM replication.

     

    The Slave was running okay, but was not receiving a full replication because the Master's PostgreSQL server was not running.

    It is possible that an upgrade from a previous version may have invited certain changes that caused PostgreSQL to stop working. The first version was 9.3.10, and final version 9.3.55. However there are some know issues with replication in this version.

    The procedure in resolving this was to:

    1. stop repctl on both Slave and Master
    2. stop PostgreSQL on the Slave, and ensure it was not running on the Master too (it is not because it is broken)
    3. rebuild the PostgreSQL database on both Slave and Master
    4. start PostgreSQL on the Master, and then the Slave
    5. start repctl
    6. check to make sure replication is running fully
    7. in replication status, ensure the PostgreSQL database server is running properly. There should be a primary and secondary with minimal lag time.

    PostgreSQL stores reports (e.g. graphing data) and mail transactions. Rebuilding the PostgreSQL database may cause these to be lost - however, none of the data in this Sophos UTM was lost. I am not using the mail services in this UTM, so that is not a factor that was evaluated.

    Replication is running, and there are no errors in the high-availability log.

  • This ended up not being the end of the story. The HA pair, nodes A and B, were not completely in sync. I called Sophos Tech support and this is what we did.

    When I updated the A node (up2date) from 9.355-1 to 9.411, and the A node rebooted, the B node did not route traffic. The B node switched two interfaces around in its configuration. It took one interface offline to "unassigned" and moved configuration of another interface to the network card previously used by the first interface with the "unassigned" network card. We changed those back, and routing worked again.

    We then performed a ha_daemon -c takeover back to the A node. Routing still worked, but the A node had trouble too. The eth0 from the B node showed up on the A node. Now the A node had two eth0 interfaces with MAC addresses from the A and B nodes' eth0 devices. We also saw MAC addresses from both A and B nodes assigned to different hardware cards. So we rebooted the A node, which made B take over.

    After B took over, we saw the same issue with MAC addresses on the B node. We performed a takeover again back to A, and had similar issues on the A node. So we rebooted the A node again.

    The B node took over, and it was okay this time. The A node came up and we performed the takeover back to it. It had one hardware network card with the B node MAC Address. So we rebooted the A node again.

    After the final reboot of the A node, it seems the interfaces are correct on the A node.

     

    So with replication broken as it was, I hypothesized (not the technician) that confd some how did not get the interface configuration in sync properly.

    We checked the NIC and MAC assignments in the `/etc/udev/rules.d/70-persistent-net.rules` file, and it was always correct. But was showing up incorrectly in WebAdmin, and subsequently the Sophos Hardware information and interface configuration.

     

  • I did not fully explain what may be an important detail

    I had two interfaces, each assigned to a different NIC.

    Interface1 assigned to eth0

    Interface2 assigned to eth2

     

    In the previous written case when first failed over to node B, Interface2 was offline and "unassigned" (was eth2). Interface 1 was assigned to eth2.

    So when we performed the ha_daemon -c takeover back to node A, that issue on node B must have been related to having two eth0 network cards on node A.

    So it would seem that the replication/synchronization assigned node-A-eth0 and node-B-eth0 to node A.

    But, like I wrote, after several reboots the nodes got back in sync with the correct NIC assignments.

    The interfaces always seems to be synchronized correctly. But the NICs available on each node were not synchronized properly. I wrote that the B node MAC addresses showed up on the A node NICs. However, as I give it some thought to why I had two eth0 NICs at the beginning, I now believe that the synchronization process was assigning NICs in the confd configuration on the other node in the pair.

    At the command line interface it looked correct. But it would seem the confd/WebAdmin had configured A-node NICs on the B-node, and vice versa.

Reply
  • I did not fully explain what may be an important detail

    I had two interfaces, each assigned to a different NIC.

    Interface1 assigned to eth0

    Interface2 assigned to eth2

     

    In the previous written case when first failed over to node B, Interface2 was offline and "unassigned" (was eth2). Interface 1 was assigned to eth2.

    So when we performed the ha_daemon -c takeover back to node A, that issue on node B must have been related to having two eth0 network cards on node A.

    So it would seem that the replication/synchronization assigned node-A-eth0 and node-B-eth0 to node A.

    But, like I wrote, after several reboots the nodes got back in sync with the correct NIC assignments.

    The interfaces always seems to be synchronized correctly. But the NICs available on each node were not synchronized properly. I wrote that the B node MAC addresses showed up on the A node NICs. However, as I give it some thought to why I had two eth0 NICs at the beginning, I now believe that the synchronization process was assigning NICs in the confd configuration on the other node in the pair.

    At the command line interface it looked correct. But it would seem the confd/WebAdmin had configured A-node NICs on the B-node, and vice versa.

Children
No Data