How to replace an XG in HA or maybe not

After one year of operation with ups and downs mainly regarding firmware versions 2019 started with a hardware issue. The primary appliance stopped working due to a dying harddisk. Unfortunately this was detected only because internet and email stopped working. XG UI then was not available so I initiated a reboot through console which was still available. As expected the auxiliary appliance took over but the former primary appliance did not boot completely. So I connected a display and keyboard to see that there where hundreds of errors during boot.

I would have expected the auxiliary device to take over much earlier and not through manual reboot but this might be something Sophos could work on in the future.

I raised a ticket at our reseller who forwarded the issue to Sophos. The same day a spare appliance was shipped and arrived two days later.

Although I had to report the firmware installed on the device the spare came with an older firmware version. So the first step after initial setup was to install the latest firmware 17.5 GA which was also installed on our two HA devices. Unfortunately we still had the first version of 17.5 GA installed whereas we only could download the later version of 17.5 GA for the spare.

The second problem we had was that we disconnected HA on the former auxiliary device as we thought we had to. Unfortunately the licences where bound to the appliance which stopped working so the active device lost all the licences and at the same time spam came through to our mailboxes unfiltered.

So what to do? The running device had no licences and an older firmware and HA was not available because of the different firmwares. We decided to install the latest firmware also on the active device which worked without problems. Then we set up HA which also worked. But still the licences were missing. After searching online I found the way to transfer the licences to the spare device but still they did not become active. The issue here was that we had to switch operation to the spare device which now had the licences.

Finally we managed to get everything back to work as expected but two things remain:

- The auxiliary device should taken over operation earlier and automatically

- Breaking HA with the device beeing primary which not has the licences bound to it should not immediatly end up in an unprotected situation

  • Hi just trying to understand this for my own knowledge, not trying to criticize.

    In regards to the aux taking over I found the below from here:  https://community.sophos.com/kb/en-us/123174

    • The device failover detection time (peer timeout) is 4 seconds. When the primary device stops sending heartbeat packets, it is declared dead at the end of 4 seconds (250 milliseconds x 16 timeouts). The peer is considered active if a Heartbeat is received within 14 timeouts. Failover is triggered at the end of 7 seconds (3-second link uptime + 4-second device failover detection time) from the time cluster has come up. You can’t change the failover threshold.

    Was this not the case?

     

    Also as far as the licenses bound aspect, I agree, in the absence of a primary device as part of the fail over there should be a grace period.  Even if it is only 24 or 48 hours.

     

     

     

  • In reply to Badrobot:

    Internet and email were failing, so did the UI. But the console still worked so maybe heartbeat did also. But there was no failover. Auxiliary took over when I rebooted the primary device.

  • In reply to Jelle:

    Actually there should be a way that the primary can do self test and if there is any hardware error it shoud stop sending the heartbeat to the auiliary to trigger the tacke over , this will make sure that the whole HA take over process is automatic.

  • In reply to Nidal Malla:

    That would be a good thing. In ths case as described it didn't work as it seems that the heartbeat was still sent and it does not rely on any self-test mechanisms.