High Availability - Master is DEAD. What is my next step to troubleshoot?

Hi-

Last night I got an automated email from the Sophos UTM (9.601-5) which is running with a pair of SG125s in HA mode that HA mode was recovered -- just one email. Thats unusual as the only time I get these kinds of emails is during a manual firmware Up2Date and then I get several as the HAs update and synch and go back to production status. I checked HA status and it was fine. Chalked it up to a glitch or momentary hickup.

This morning, the Primary device (which is physically 192.168.0.1 and the WebAdmin address) is currently in SLAVE mode with a DEAD status, and I have all kinds of HA automated emails about one UTM being down and the other recovering for it etc etc etc.

As you can see, the BackupSophos (which is physically 192.168.0.20 but now being accessed by LAN's 192.168.0.1 since its now in Master state) is running the show.  I checked the boxes and there are lights on both UTMs so the PrimarySophos is not without juice -- status indicator lights are flashing on it just like the backup box (which is now operating in primary mode). 

I dont want to screw something up, so what are my first steps to troubleshoot and see if I can get the problematic UTM synched back up properly and in MASTER mode again?

Possible options as I see it:

1. Reboot the BackupSophos in the WebAdmin in the HA panel, but Im not sure this will have any affect on the dead UTM?

2. Physically reboot the problematic box with the power button (it has power to it and appears to have good status lights)

3. Replace the network cable attaching the problematic box to the LAN from eth0 (I see good lights however)

4. Replace the network cable chaining the Primary and Backup boxes from their DMZ ports (I see good lights however).

 

Not sure what my chess moves should be and in what order. Any advice. I have not done any Firmware Up2Dates recently so I dont believe that would affect anything. Not sure if a automated Pattern Up2Date could cause this or if I legitimately have a malfunctioning Primary UTM.

 

  • If you have physical access to that device, try to power off and reboot the dead device.

    If the device is under warranty do RMA process. If not try to find the fault.

    Try to use a console cable (not sure which one is possible at SG125) a tell us about the output. Maybe try to reinstall the dead UTM with iso.

    Don’t shut down the active device!

    Best regards

    Alex

  • In reply to Alexander Busch:

    Welp. A forceful unplug/reboot of what was the primary SG125 seems to have done the trick.  Once it started back up, the SYNCH status showed in the HA area of WebAdmin.  It was synching while the BackupSophos was still running a MASTER. Once they synched up, they switched positions and PrimarySophos (the problematic one that was rebooted) changed from SLAVE to MASTER and BackupSophos slid back into its customary SLAVE role.  So, Humpty Dumpty is back together again. But not sure why the Primary SG125 went dead to begin with. I ruled out cabling.  Maybe a Pattern Up2Date didnt load properly and caused it to FUBAR, forcing a rollover to the backup?  I had not made any changes in the WebAdmin for weeks so it was not my doing. Either hardware failed or some automatic update failed. Lets see if it happens again....

     

  • In reply to UDPride:

    Glad to hear. So you see your HA is perfectly working. And keep that in memory, if this node is failing again you should examine this more deeply.

    Best regards

    Alex

  • In reply to UDPride:

    With two identical pieces of hardware, I normally recommend to not choose a preferred master.

    Cheers - Bob