Important note about SSL VPN compatibility for 20.0 MR1 with EoL SFOS versions and UTM9 OS. Learn more in the release notes.

This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

XGS SSD Firmware - others also having issues HA nodes not coming up?

I started the SSD firmware update KB-000045380 on XGS136 HA A/P Cluster.

First I applied the update to the AUX node 2. It was successful and the machine re-entered the cluster and A/P cluster was all green in the end.

I switched the PRI HA node from node 1 to node 2 and waited until A/P cluster was all green again. So node 2 is now PRI.

The AUX node 1 is now down for 25min after the SSD update command has been given.

I'll wait until tomorrow and then power cycle it.

Anyone else having such circumstances?

XGS136_XN01_SFOS 19.5.3 MR-3-Build652 HA-Standalone# cish
console> system ha show details
 HA details
 HA status                           |   Enabled
 HA mode                             |   Active-passive
 Cluster ID                          |   0
 Initial primary                     |   X1310xxxxxBQ44 (Node1)
 Preferred primary                   |   No preference
 Load balancing                      |   Not applicable
 Dedicated port                      |   Port10
 Monitoring port                     |   -
 Keepalive request interval          |   250
 Keepalive attempts                  |   16
 Hypervisor-assigned MAC addresses   |   Disabled

 Local node
 Serial number (nodename)            |   X1310xxxxx8X84 (Node2)
 Current HA role                     |   Standalone
 Dedicated link's IP address         |   10.1.178.6
 Last status change                  |   09:41:15 PM, Jan 24, 2024

 Peer node
 Serial number (nodename)            |   X1310xxxxxBQ44 (Node1)
 Current HA role                     |   Fault
 Dedicated link's IP address         |   10.1.178.5
 Last status change                  |   09:41:15 PM, Jan 24, 2024



This thread was automatically locked due to age.
Parents
  • LHerzog, 

    If you already waited 25 mins, you can power cycle the device now, no need to wait until tomorrow. 

    Yes in some cases a power cycle is required, which was why we put it in the notice & KBA so customers are prepared for it. 

Reply
  • LHerzog, 

    If you already waited 25 mins, you can power cycle the device now, no need to wait until tomorrow. 

    Yes in some cases a power cycle is required, which was why we put it in the notice & KBA so customers are prepared for it. 

Children
  • Thanks   yes, I noticed that warning and was prepared. It was described as "This applies to the rare cases where manual power-on may be necessary." in the mail we received from sophos and I was a bit surprised that we're having issues with just the second machine we're updating.

    The failed node was powered off and needed to be turned on manually. Does that make sense in a cluster?  Both machines are identically - what is the logic that one reboots after the update and one is off?

    At least the cluster is fine now and both have applied the SSD firmware.

    For our other remote clusters we now need to plan this for increased cost on-site maintenance.

  • Hi     LHerzog    ,

    The KBA reasons this out at a high level -

    I'll give a bit more technical insights to your question - " Both machines are identically - what is the logic that one reboots after the update and one is off? "

    The specific SSDs we are dealing with in this SSD firmware requires power to be switched off for the storage controller and then switched on after the new firmware is installed. In order to eliminate manual power cycle for most instances, we have implemented such that the power module of the sophos firewall would cut the power internally for the storage controller for limited allowed fraction of time. In most cases - such internally simulated power cycle works and eliminates manual power cycle efforts for our customers.

    However, Such internally simulated power cycle  for limited allowed fraction of time  may not work in some isolated cases.

    Hence, it is advised in the KBA - " As a precautionary measure, ensure you have access to power cycle the appliance, should that be required."  

  • Thank you tanking your time to share some technical details here. I really appreciate that!

    Our XGS136 setups include 2 original PSU per appliance. May that increase the chance of node down after firmware update?

  • If you already waited 25 mins, you can power cycle the device now, no need to wait until tomorrow. 

    First thing I would say is be careful about the timing. We had one of these and the XGS116 took 25 minutes to come back up after re-powering.

    On a more general point, yes I read the the KBA in detail and was aware of the "rare cases" clause! When I read that, and look at the four sites that would be affected, I think about what we would do if the worst happened, but don't expect it with only four units to upgrade.

    As it happened, after upgrading the units overnight, waiting for a successful recovery before starting the next one, the last unit failed to come back up. Having done this in the middle of the night, I wasn't best pleased to find out I would have to get up early (they start at 07:00) to talk them through re-powering the XGS. It was doubly unfortunate that unknown to me they had all the directors on site for a board meeting and as their servers are isolated from the workstations via the XGS, nothing was working.

    Anyway, having had my whinge, I do wonder if the instances of this are as "rare" as the KBA describes. I know my results are not statistically significant as the sample size is small but I hope someone is considering whether "rare" was the correct adjective to use in the KBA. The fact that this was even mentioned in the KBA makes me suspect that they had enough instances of it in testing to mention it at all. 's experience with two identical units, also makes me suspect that the firmware update process is not as reliable as suggested in the KBA.

    I understand the need for this update and it was good that the KBA explained how to recover from the update if it went wrong but if there was a more realistic description of the risk level, we would have gone about this a different way.

  • Having just been dumped with a list of 230 units that need this manual process if it's as bad as this I'm going to throw myself down some stairs. Having said that we've had Sophos GES manually applying the patch and there hasn't been any issues (though that's a small sample size of 10+ units).