XGS SSD Firmware - others also having issues HA nodes not coming up?

I started the SSD firmware update KB-000045380 on XGS136 HA A/P Cluster.

First I applied the update to the AUX node 2. It was successful and the machine re-entered the cluster and A/P cluster was all green in the end.

I switched the PRI HA node from node 1 to node 2 and waited until A/P cluster was all green again. So node 2 is now PRI.

The AUX node 1 is now down for 25min after the SSD update command has been given.

I'll wait until tomorrow and then power cycle it.

Anyone else having such circumstances?

XGS136_XN01_SFOS 19.5.3 MR-3-Build652 HA-Standalone# cish
console> system ha show details
 HA details
 HA status                           |   Enabled
 HA mode                             |   Active-passive
 Cluster ID                          |   0
 Initial primary                     |   X1310xxxxxBQ44 (Node1)
 Preferred primary                   |   No preference
 Load balancing                      |   Not applicable
 Dedicated port                      |   Port10
 Monitoring port                     |   -
 Keepalive request interval          |   250
 Keepalive attempts                  |   16
 Hypervisor-assigned MAC addresses   |   Disabled

 Local node
 Serial number (nodename)            |   X1310xxxxx8X84 (Node2)
 Current HA role                     |   Standalone
 Dedicated link's IP address         |   10.1.178.6
 Last status change                  |   09:41:15 PM, Jan 24, 2024

 Peer node
 Serial number (nodename)            |   X1310xxxxxBQ44 (Node1)
 Current HA role                     |   Fault
 Dedicated link's IP address         |   10.1.178.5
 Last status change                  |   09:41:15 PM, Jan 24, 2024



Edited TAGs
[edited by: Erick Jan at 1:16 AM (GMT -8) on 25 Jan 2024]
Parents
  • LHerzog, 

    If you already waited 25 mins, you can power cycle the device now, no need to wait until tomorrow. 

    Yes in some cases a power cycle is required, which was why we put it in the notice & KBA so customers are prepared for it. 

  • Thanks   yes, I noticed that warning and was prepared. It was described as "This applies to the rare cases where manual power-on may be necessary." in the mail we received from sophos and I was a bit surprised that we're having issues with just the second machine we're updating.

    The failed node was powered off and needed to be turned on manually. Does that make sense in a cluster?  Both machines are identically - what is the logic that one reboots after the update and one is off?

    At least the cluster is fine now and both have applied the SSD firmware.

    For our other remote clusters we now need to plan this for increased cost on-site maintenance.

Reply
  • Thanks   yes, I noticed that warning and was prepared. It was described as "This applies to the rare cases where manual power-on may be necessary." in the mail we received from sophos and I was a bit surprised that we're having issues with just the second machine we're updating.

    The failed node was powered off and needed to be turned on manually. Does that make sense in a cluster?  Both machines are identically - what is the logic that one reboots after the update and one is off?

    At least the cluster is fine now and both have applied the SSD firmware.

    For our other remote clusters we now need to plan this for increased cost on-site maintenance.

Children
  • Hi     LHerzog    ,

    The KBA reasons this out at a high level -

    I'll give a bit more technical insights to your question - " Both machines are identically - what is the logic that one reboots after the update and one is off? "

    The specific SSDs we are dealing with in this SSD firmware requires power to be switched off for the storage controller and then switched on after the new firmware is installed. In order to eliminate manual power cycle for most instances, we have implemented such that the power module of the sophos firewall would cut the power internally for the storage controller for limited allowed fraction of time. In most cases - such internally simulated power cycle works and eliminates manual power cycle efforts for our customers.

    However, Such internally simulated power cycle  for limited allowed fraction of time  may not work in some isolated cases.

    Hence, it is advised in the KBA - " As a precautionary measure, ensure you have access to power cycle the appliance, should that be required."  

  • Thank you tanking your time to share some technical details here. I really appreciate that!

    Our XGS136 setups include 2 original PSU per appliance. May that increase the chance of node down after firmware update?