HA UNLINKED on Sophos UTM after nothing

Question

Hello, 
 First of all, i wish you a Happy new year ! 
 I'm writing to you because our customer has a Sophos UTM firewall in version 9.7 for almost 2 years. 
 Our customer claims to have tested the Slave firewall in the meantime and it was working fine. 
 However, today and for some time the HA SLAVE firewall remains in "UNLINKED" status. I am unable to determine how long it has been like this as the logs are limited to 30 days due to another log storage problem that was full (We did something about this firewall 3 months ago when we tried to rebuild the DB is maybe related?). 
 In any case, the MASTER and SLAVE firewalls have already been restarted several times. Here are the references I can give you: - Firmware of both UTMs: 9.707-5 - Last big update: more than 1.5 years ago 
 Here are the actions that have been taken: - Restart of the two firewalls -> NOK - Checked the cabling -> They have not touched anything for a long time - Verification of the HA link, it goes up physically in the CLI (ETH3) 
 customername:/home/login # ip link show
5: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2000 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 7c:5a:1c:48:cb:53 brd ff:ff:ff:ff:ff 
 - Verification that I am pinging the other firewall on the HA link: 
 customername:/home/login # ping 198.19.250.2
PING 198.19.250.2 (198.19.250.2) 56(84) bytes of data.
64 bytes from 198.19.250.2: icmp_seq=1 ttl=64 time=0.155 ms
64 bytes from 198.19.250.2: icmp_seq=2 ttl=64 time=0.156 ms
64 bytes from 198.19.250.2: icmp_seq=3 ttl=64 time=0.246 ms
^C
--- 198.19.250.2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 0.155/0.185/0.246/0.045 ms 
 - Checking the HA logs: 
 2022:01:25-14:24:12 customername-1 ha_daemon[4979]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 251 12.413" name="Executing (nowait) /etc/init.d/ha_mode check"
2022:01:25-14:24:12 customername ha_mode[3002]: calling check
2022:01:25-14:24:12 customername ha_mode[3002]: check: waiting for last ha_mode done
2022:01:25-14:24:12 customername ha_mode[3002]: check_ha() role=MASTER, status=ACTIVE
2022:01:25-14:24:12 customername ha_mode[3002]: check done (started at 14:24:12)
2022:01:25-14:24:16 customername repctl[5075]: [i] recheck(1057): got ALRM: replication recheck triggered Setup_replication_done = 1
2022:01:25-14:27:25 customername2 repctl[4464]: [i] recheck(1057): got ALRM: replication recheck triggered Setup_replication_done = 1
2022:01:25-14:29:16 customername repctl[5075]: [i] recheck(1057): got ALRM: replication recheck triggered Setup_replication_done = 1
2022:01:25-14:32:25 customername2 repctl[4464]: [i] recheck(1057): got ALRM: replication recheck triggered Setup_replication_done = 1
2022:01:25-14:34:12 customername2 ha_daemon[4416]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 158 12.023" name="Executing (wait) /usr/local/bin/confd-setha mode slave"
2022:01:25-14:34:12 customername2 ha_daemon[4416]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 159 12.117" name="Executing (nowait) /etc/init.d/ha_mode check"
2022:01:25-14:34:12 customername2 ha_mode[23854]: calling check
2022:01:25-14:34:12 customername2 ha_mode[23854]: check: waiting for last ha_mode done
2022:01:25-14:34:12 customername2 ha_mode[23854]: check_ha() role=SLAVE, status=UNLINKED 
 - Checking the status of the HA service : 
 customername:/home/login # service ha_mode status
ha_mode[12752]: calling status
ha_mode[12752]: Missing HA_* variables, not called by ha_daemon? (exit 2) 
 - Test to disconnect the HA on 25.01.2022 between 12h40 and 12h44 -> Nothing comes up, no VPN, no access to the portal by the public IP... 
 Can you help me to solve this very worrying problem because it means that if there is a problem, the customer has no more production. 
 Best regards, Rapha&euml;lle

Holger Gran · Accepted Answer

It seems eth4 is the interface in question. If this interface is not connected to an device, please disable the &ldquo;HA link monitoring&rdquo; for this interface in the WebAdmin.

RaphaelleB · Answer

Oh, I didn't understand that! The "UNLINKED" status means that one of the cables in HA monitoring mode is not connected on the SLAVE not just the HA cable (eth3)! 
 I unchecked HA monitoring on eth4 and the SLAVE went to "READY". 
 
 2022:01:26-13:37:34 customername-2 ha_daemon[4416]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 211 34.504" name="Monitoring interfaces for link beat: eth2 eth1 eth0"
2022:01:26-13:37:34 customername-2 ha_daemon[4416]: id="38A1" severity="warn" sys="System" sub="ha" seq="S: 212 34.504" name="All monitored interfaces with link again!"
2022:01:26-13:37:34 customername-2 ha_daemon[4416]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 213 34.504" name="state change UNLINKED(1) -> ACTIVE(0)"
2022:01:26-13:37:34 customername-1 ha_daemon[4979]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 307 34.837" name="Node 2 changed state: UNLINKED(1) -> ACTIVE(0)"
2022:01:26-13:37:44 customername-1 ha_daemon[4979]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 308 44.380" name="Monitoring interfaces for link beat: eth2 eth1 eth0" 
 
 thanks to you ! 
 Best regards, Raphaelle

HA UNLINKED on Sophos UTM after nothing

Top Replies