HA: Cannot initialize Active/Passive

Question

Hello folks 
 I encountered a strange situation which I cannot resolve using backup/restore and/or clean re-install. 
 The Setup 
 I have a 2 node virtualized setup like in the attached simplified image. I'm using the home edition of Sophos UTM 9.405-5 and I wanted to setup an active/passive redundancy with near-identical VMs. In this setup, I would like FRW-1 to be the preferred master, because it runs on slightly more reliable hardware. 
 
 The VMs are completely unaware of any VLANs. The packets are being tagged by the host-machines. Also, all network interfaces are bonded via LACP using 2 physical NICs on each node. 
 As I understand it, the heartbeat/sync traffic should go over a dedicated interface on both Sophos appliances. No problem for me, they have even their very own VLAN tag number. Well, the dedicated interface is just a virtual NIC provided from the KVM host. Both VMs get the same number of interfaces with same VLAN tags and order, same disk sizes (20G), same amount of RAM (2.5G). 
 What did I do? 
 Installed UTM with 9.405-5 iso on FRW-2. Performed basic setup without license, assigned static IP address 10.20.20.12 
 On the running and active node (FRW-1, 10.20.20.1, UTM 9.405-5) in the HA section, I set the operation mode to "Automatic configuration" in the config tab, selected eth3 as the sync interface and applied it. Waited a few seconds. On the secondary node FRW-2 I did the same: automatic, eth3. I opened the live log of FRW-1 and the slave could be found. 
 What did I expect? 
 I expected FRW-2 to become a Slave node and being displayed on the master node in the HA section of WebAdmin. FRW-1 should become Master because of its longer uptime, and the WebAdmin connection to FRW-2 should terminate and disappear. 
 What was the result? 
 On FRW-1, the WebAdmin showed FRW-1 to be the Master node in the status tab. But there was no other node displayed, like not at all. Connection to FRW-2 remained on the same IP (10.20.20.12) and the status tab in its WebAdmin interface displayed both Nodes as Masters. The logs of FRW-2 had a lot of "Another master around!" entries. 
 So, FRW-1 sees FRW-2 for a brief moment, but then not anymore after connecting. FRW-2 sees FRW-1, but switches to Master role anyway. 
 Additional Information 
 The strange thing is, this setup worked before . 2 weeks ago I successfully set up a working HA configuration using same environmental conditions. However, VMM-2 encountered a hardware failure and had not been powered on for several days. I replaced its motherboard after the replacement arrived. For FRW-1 this didn't matter because it remained master anyway (as expected). It's just that I cannot re-initialize this kind of HA setup for some reason. 
 I tried to initialize manually, but with same results. I also reinstalled FRW-1 and FRW-2 using the older firmware 9.404-5, still same results. 
 General connection over the bonding works. Other VMs can happily ping and connect to each other without any issues. 
 I hope I'm missing something obvious and this is not some weird bug :P 
 Logs 
 So I rebooted the hypervisor just to be sure and tried again. These are the logs when configuring HA (automatic) 
 FRW-1: 
 
 2016:08:21-12:32:04 frw-1 conntrack-tools[4092]: no dedicated links available! 
 2016:08:21-12:32:04 frw-1 conntrack-tools[4092]: no dedicated links available! 
 2016:08:21-12:32:04 frw-1 conntrack-tools[4092]: no dedicated links available! 
 2016:08:21-12:40:04 frw-1 ha_zeroconf[14619]: id="3800" severity="info" sys="System" sub="ha" name="set zeroconf interface to eth3" 
 2016:08:21-12:40:04 frw-1 conntrack-tools[4092]: no dedicated links available! 
 2016:08:21-12:40:04 frw-1 ha_zeroconf[14620]: id="3800" severity="info" sys="System" sub="ha" name="Interface eth3 is up! Searching for active HA system!" 
 2016:08:21-12:40:04 frw-1 ha_zeroconf[14620]: id="3800" severity="info" sys="System" sub="ha" name="Using temporary ip 198.19.250.114" 
 2016:08:21-12:40:11 frw-1 ha_zeroconf[14620]: id="3800" severity="info" sys="System" sub="ha" name="No HA active, trying asg zeroconf!" 
 2016:08:21-12:40:36 frw-1 ha_zeroconf[14620]: id="3800" severity="info" sys="System" sub="ha" name="No other asg zeroconf node found!" 
 2016:08:21-12:40:36 frw-1 ha_zeroconf[14620]: id="3800" severity="info" sys="System" sub="ha" name="Interface eth3 is up! Searching for active HA system!" 
 2016:08:21-12:40:36 frw-1 ha_zeroconf[14620]: id="3800" severity="info" sys="System" sub="ha" name="Using temporary ip 198.19.250.105" 
 2016:08:21-12:40:44 frw-1 ha_zeroconf[14620]: id="3800" severity="info" sys="System" sub="ha" name="No HA active, trying asg zeroconf!" 
 2016:08:21-12:40:49 frw-1 ha_zeroconf[14620]: id="3800" severity="info" sys="System" sub="ha" name="Found other node!" 
 2016:08:21-12:40:49 frw-1 ha_zeroconf[14620]: id="3800" severity="info" sys="System" sub="ha" name="Declaring local node as MASTER! Local uptime 314760 higher than 190" 
 2016:08:21-12:40:49 frw-1 conntrack-tools[4092]: no dedicated links available! 
 2016:08:21-12:40:49 frw-1 conntrack-tools[4092]: no dedicated links available! 
 2016:08:21-12:40:54 frw-1 conntrack-tools[4092]: no dedicated links available! 
 2016:08:21-12:41:57 frw-1 ha_daemon[15454]: id="38A0" severity="info" sys="System" sub="ha" seq="W: 0 57.549" name="------------------------------" 
 2016:08:21-12:41:57 frw-1 ha_daemon[15454]: id="38A0" severity="info" sys="System" sub="ha" seq="W: 1 57.549" name="Starting ASG HA daemon v2.0.0" 
 2016:08:21-12:41:57 frw-1 ha_daemon[15454]: id="38A0" severity="info" sys="System" sub="ha" seq="W: 2 57.549" name="Reading cluster configuration" 
 2016:08:21-12:41:57 frw-1 ha_daemon[15454]: id="38A0" severity="info" sys="System" sub="ha" seq="W: 3 57.550" name="Set UTM version to 9.405005 
 2016:08:21-12:41:57 frw-1 ha_daemon[15454]: id="38A0" severity="info" sys="System" sub="ha" seq="W: 4 57.551" name="Using interface eth3 with name Node1" 
 2016:08:21-12:41:57 frw-1 ha_daemon[15454]: id="38A0" severity="info" sys="System" sub="ha" seq="W: 5 57.553" name="Start of discovery" 
 2016:08:21-12:41:57 frw-1 ha_daemon[15454]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 6 57.553" name="Executing (nowait) /etc/init.d/ha_mode disable" 
 2016:08:21-12:41:57 frw-1 ha_daemon[15454]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 7 57.553" name="--- Node is disabled ---" 
 2016:08:21-12:41:57 frw-1 ha_daemon[15454]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 8 57.559" name="Monitoring interfaces for link beat: eth4 eth2 eth1 eth0" 
 2016:08:21-12:41:57 frw-1 conntrack-tools[4092]: ---- shutdown received ---- 
 2016:08:21-12:41:57 frw-1 ha_proxy[4176]: Shutting down. 
 2016:08:21-12:42:00 frw-1-1 ha_daemon[15454]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 9 00.552" name="Initdead: Found no master node, taking over." 
 2016:08:21-12:42:00 frw-1-1 ha_daemon[15454]: id="38B0" severity="info" sys="System" sub="ha" seq="M: 10 00.552" name="Switching to Master mode" 
 2016:08:21-12:42:00 frw-1-1 ha_daemon[15454]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 11 00.552" name="start/reset initial synchronization timer = 0" 
 2016:08:21-12:42:00 frw-1-1 ha_daemon[15454]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 12 00.552" name="Executing (wait) /usr/local/bin/confd-setha mode master master_ip 198.19.250.1 slave_ip ''" 
 2016:08:21-12:42:01 frw-1-1 ha_daemon[15454]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 13 01.181" name="Executing (nowait) /etc/init.d/ha_mode enable" 
 2016:08:21-12:42:01 frw-1-1 ha_daemon[15454]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 14 01.181" name="--- Node is enabled ---" 
 2016:08:21-12:42:01 frw-1-1 ha_daemon[15454]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 15 01.181" name="Executing (nowait) /etc/init.d/ha_mode master" 
 2016:08:21-12:42:01 frw-1-1 ha_daemon[15454]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 16 01.182" name="Initdead time over, operate normal" 
 2016:08:21-12:42:01 frw-1-1 ha_mode[15459]: calling disable 
 2016:08:21-12:42:01 frw-1-1 ha_mode[15948]: calling enable 
 2016:08:21-12:42:01 frw-1-1 ha_mode[15459]: disable: waiting for last ha_mode done 
 2016:08:21-12:42:01 frw-1-1 ha_mode[15459]: Switching disable mode 
 2016:08:21-12:42:01 frw-1-1 ha_mode[15950]: calling master 
 2016:08:21-12:42:01 frw-1-1 ha_mode[15459]: disable done (started at 12:42:01) 
 2016:08:21-12:42:01 frw-1-1 ha_mode[15948]: enable: waiting for last ha_mode done 
 2016:08:21-12:42:01 frw-1-1 ha_mode[15948]: Switching enable mode 
 2016:08:21-12:42:01 frw-1-1 ha_daemon[15454]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 17 01.620" name="Reading cluster configuration" 
 2016:08:21-12:42:01 frw-1-1 ha_mode[15948]: daemonized... 
 2016:08:21-12:42:01 frw-1-1 ha_mode[15948]: enable done (started at 12:42:01) 
 2016:08:21-12:42:01 frw-1-1 repctl[16013]: [i] execute(1647): pg_ctl: server is running (PID: 3879) 
 2016:08:21-12:42:01 frw-1-1 repctl[16013]: [i] execute(1647): pg_ctl: server is running (PID: 3879) 
 2016:08:21-12:42:01 frw-1-1 conntrack-tools[16017]: reliable ctnetlink event delivery is ENABLED. 
 2016:08:21-12:42:01 frw-1-1 conntrack-tools[16017]: using user-space event filtering 
 2016:08:21-12:42:01 frw-1-1 conntrack-tools[16017]: netlink event socket buffer size has been set to 4194304 bytes 
 2016:08:21-12:42:01 frw-1-1 conntrack-tools[16017]: initialization completed 
 2016:08:21-12:42:01 frw-1-1 conntrack-tools[16029]: -- starting in daemon mode -- 
 2016:08:21-12:42:01 frw-1-1 ha_mode[15950]: master: waiting for last ha_mode done 
 2016:08:21-12:42:01 frw-1-1 ha_mode[15950]: Going to Master mode 
 2016:08:21-12:42:01 frw-1-1 conntrack-tools[16029]: committing all external caches 
 2016:08:21-12:42:01 frw-1-1 conntrack-tools[16029]: Committed 0 new entries 
 2016:08:21-12:42:01 frw-1-1 conntrack-tools[16029]: commit has taken 0.001679 seconds 
 2016:08:21-12:42:01 frw-1-1 conntrack-tools[16029]: flushing caches 
 2016:08:21-12:42:01 frw-1-1 conntrack-tools[16029]: resync with master conntrack table 
 2016:08:21-12:42:01 frw-1-1 conntrack-tools[16029]: committing all external caches 
 2016:08:21-12:42:01 frw-1-1 conntrack-tools[16029]: Committed 0 new entries 
 2016:08:21-12:42:01 frw-1-1 conntrack-tools[16029]: commit has taken 0.001499 seconds 
 2016:08:21-12:42:01 frw-1-1 conntrack-tools[16029]: flushing caches 
 2016:08:21-12:42:01 frw-1-1 conntrack-tools[16029]: resync with master conntrack table 
 2016:08:21-12:42:02 frw-1-1 repctl[16013]: [i] setup_replication(242): checkinterval 300 
 2016:08:21-12:42:02 frw-1-1 ha_mode[15950]: /var/mdw/scripts/confd-sync: /usr/local/bin/confd-sync started 
 2016:08:21-12:42:02 frw-1-1 ha_mode[15950]: daemonized... 
 2016:08:21-12:42:02 frw-1-1 repctl[16088]: [i] daemonize_check(1382): trying to signal daemon 
 2016:08:21-12:42:02 frw-1-1 ha_mode[15950]: [33m[1m:: Restarting SMTP Proxy[m 
 2016:08:21-12:42:02 frw-1-1 ha_mode[15950]: :: Stopping SMTP Proxy 
 2016:08:21-12:42:02 frw-1-1 ha_daemon[15454]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 18 02.758" name="Monitoring interfaces for link beat: eth4 eth2 eth1 eth0" 
 2016:08:21-12:42:02 frw-1-1 ha_daemon[15454]: id="38A3" severity="debug" sys="System" sub="ha" seq="M: 19 02.758" name="Netlink: Lost link beat on eth0!" 
 2016:08:21-12:42:02 frw-1-1 ha_daemon[15454]: id="38A3" severity="debug" sys="System" sub="ha" seq="M: 20 02.758" name="Netlink: Found link beat on eth4 again!" 
 2016:08:21-12:42:02 frw-1-1 ha_daemon[15454]: id="38A3" severity="debug" sys="System" sub="ha" seq="M: 21 02.758" name="Netlink: Found link beat on eth2 again!" 
 2016:08:21-12:42:02 frw-1-1 ha_daemon[15454]: id="38A3" severity="debug" sys="System" sub="ha" seq="M: 22 02.758" name="Netlink: Found link beat on eth1 again!" 
 2016:08:21-12:42:02 frw-1-1 ha_daemon[15454]: id="38A3" severity="debug" sys="System" sub="ha" seq="M: 23 02.759" name="Netlink: Lost link beat on eth4!" 
 2016:08:21-12:42:02 frw-1-1 ha_daemon[15454]: id="38A3" severity="debug" sys="System" sub="ha" seq="M: 24 02.759" name="Netlink: Lost link beat on eth2!" 
 2016:08:21-12:42:02 frw-1-1 ha_daemon[15454]: id="38A3" severity="debug" sys="System" sub="ha" seq="M: 25 02.759" name="Netlink: Lost link beat on eth1!" 
 2016:08:21-12:42:02 frw-1-1 ha_daemon[15454]: id="38A3" severity="debug" sys="System" sub="ha" seq="M: 26 02.759" name="Netlink: Found link beat on eth0 again!" 
 2016:08:21-12:42:02 frw-1-1 ha_daemon[15454]: id="38A3" severity="debug" sys="System" sub="ha" seq="M: 27 02.759" name="Netlink: Found link beat on eth4 again!" 
 2016:08:21-12:42:02 frw-1-1 ha_daemon[15454]: id="38A3" severity="debug" sys="System" sub="ha" seq="M: 28 02.760" name="Netlink: Found link beat on eth2 again!" 
 2016:08:21-12:42:02 frw-1-1 ha_daemon[15454]: id="38A3" severity="debug" sys="System" sub="ha" seq="M: 29 02.760" name="Netlink: Found link beat on eth1 again!" 
 2016:08:21-12:42:02 frw-1-1 ha_proxy[16124]: Initializing tinyproxy ... 
 2016:08:21-12:42:02 frw-1-1 ha_proxy[16124]: Reloading config file 
 2016:08:21-12:42:02 frw-1-1 ha_proxy[16124]: Listening on IP 0.0.0.0 
 2016:08:21-12:42:02 frw-1-1 ha_proxy[16124]: Setting "Via" header to 'Astaro HA Proxy' 
 2016:08:21-12:42:02 frw-1-1 ha_proxy[16124]: Adding Port [443] to the list allowed by CONNECT 
 2016:08:21-12:42:02 frw-1-1 ha_proxy[16124]: Now running as group "nogroup". 
 2016:08:21-12:42:02 frw-1-1 ha_proxy[16124]: Now running as user "nobody". 
 2016:08:21-12:42:02 frw-1-1 ha_proxy[16126]: Creating child number 1 of 1 ... 
 2016:08:21-12:42:02 frw-1-1 ha_proxy[16126]: Finished creating all children. 
 2016:08:21-12:42:02 frw-1-1 ha_proxy[16126]: Setting the various signals. 
 2016:08:21-12:42:02 frw-1-1 ha_proxy[16126]: Starting main loop. Accepting connections. 
 2016:08:21-12:42:03 frw-1-1 ha_daemon[15454]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 30 03.552" name="End of discovery" 
 2016:08:21-12:42:04 frw-1-1 ha_mode[15950]: [ ok ] 
 2016:08:21-12:42:04 frw-1-1 ha_mode[15950]: :: Starting SMTP Proxy 
 2016:08:21-12:42:04 frw-1-1 ha_mode[15950]: [ ok ] 
 2016:08:21-12:42:04 frw-1-1 ha_mode[15950]: [ ok ] 
 2016:08:21-12:42:05 frw-1-1 ha_mode[15950]: master done (started at 12:42:01) 
 2016:08:21-12:42:18 frw-1-1 ha_daemon[15454]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 31 18.070" name="Monitoring interfaces for link beat: eth4 eth2 eth1 eth0" 
 2016:08:21-12:42:31 frw-1-1 ha_daemon[15454]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 32 31.084" name="Reading cluster configuration" 
 2016:08:21-12:42:46 frw-1-1 ha_daemon[15454]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 33 46.141" name="Monitoring interfaces for link beat: eth4 eth2 eth1 eth0" 
 These are the logs for FRW-2 
 
 2016:08:21-12:40:28 frw-2 ha_zeroconf[6311]: id="3800" severity="info" sys="System" sub="ha" name="set zeroconf interface to eth3" 
 2016:08:21-12:40:28 frw-2 ha_zeroconf[6312]: id="3800" severity="info" sys="System" sub="ha" name="Interface eth3 is up! Searching for active HA system!" 
 2016:08:21-12:40:28 frw-2 ha_zeroconf[6312]: id="3800" severity="info" sys="System" sub="ha" name="Using temporary ip 198.19.250.117" 
 2016:08:21-12:40:35 frw-2 ha_zeroconf[6312]: id="3800" severity="info" sys="System" sub="ha" name="No HA active, trying asg zeroconf!" 
 2016:08:21-12:40:45 frw-2 ha_zeroconf[6312]: id="3800" severity="info" sys="System" sub="ha" name="Found other node!" 
 2016:08:21-12:40:45 frw-2 ha_zeroconf[6312]: id="3800" severity="info" sys="System" sub="ha" name="Declaring local node as SLAVE! Local uptime 190 lower than 314755" 
 
I am a bit worried about the "no dedicated links available!" bits of FRW-1. Or is it just normal for a VM?

metalchrigel · Accepted Answer

I figured it out by myself. The error lied not in Sophos UTM, but rather in my setup. This is what I did: 
 
 Disabled any HA configuration on both FRWs 
 Updated VMM-2 (apt-get) 
 Updated VMM-1 (apt-get) 
 Stopped all VMs on VMM-2 and rebooted the host and the VMs (with the exception of FRW-2) 
 Stopped all VMs on VMM-1 and rebooted the host (with FRW-1, so that it would become master because of its longer uptime) 
 Once FRW-1 was up and ready, I started FRW-2. 
 Log into FRW-2 and setup HA (automatic/eth3) 
 Log into FRW-1 and setup HA (automatic/eth3) 
 ??? 
 Profit 
 
 Yes it worked after the reboots of the hypervisors and I tested the failover/recovery. If I had to guess, the error was somehow network and/or hardware related, since this occured after a motherboard failure of VMM-2. Why FRW-1 would not behave correctly, I cannot tell. However, for me this issue is now resolved. Just leaving this for future reference. 
 Thank you and best regards.