XGS SSD Firmware - others also having issues HA nodes not coming up?

I started the SSD firmware update KB-000045380 on XGS136 HA A/P Cluster.

First I applied the update to the AUX node 2. It was successful and the machine re-entered the cluster and A/P cluster was all green in the end.

I switched the PRI HA node from node 1 to node 2 and waited until A/P cluster was all green again. So node 2 is now PRI.

The AUX node 1 is now down for 25min after the SSD update command has been given.

I'll wait until tomorrow and then power cycle it.

Anyone else having such circumstances?

XGS136_XN01_SFOS 19.5.3 MR-3-Build652 HA-Standalone# cish
console> system ha show details
 HA details
 HA status                           |   Enabled
 HA mode                             |   Active-passive
 Cluster ID                          |   0
 Initial primary                     |   X1310xxxxxBQ44 (Node1)
 Preferred primary                   |   No preference
 Load balancing                      |   Not applicable
 Dedicated port                      |   Port10
 Monitoring port                     |   -
 Keepalive request interval          |   250
 Keepalive attempts                  |   16
 Hypervisor-assigned MAC addresses   |   Disabled

 Local node
 Serial number (nodename)            |   X1310xxxxx8X84 (Node2)
 Current HA role                     |   Standalone
 Dedicated link's IP address         |   10.1.178.6
 Last status change                  |   09:41:15 PM, Jan 24, 2024

 Peer node
 Serial number (nodename)            |   X1310xxxxxBQ44 (Node1)
 Current HA role                     |   Fault
 Dedicated link's IP address         |   10.1.178.5
 Last status change                  |   09:41:15 PM, Jan 24, 2024



Edited TAGs
[edited by: Erick Jan at 1:16 AM (GMT -8) on 25 Jan 2024]
Parents
  • LHerzog, 

    If you already waited 25 mins, you can power cycle the device now, no need to wait until tomorrow. 

    Yes in some cases a power cycle is required, which was why we put it in the notice & KBA so customers are prepared for it. 

  • If you already waited 25 mins, you can power cycle the device now, no need to wait until tomorrow. 

    First thing I would say is be careful about the timing. We had one of these and the XGS116 took 25 minutes to come back up after re-powering.

    On a more general point, yes I read the the KBA in detail and was aware of the "rare cases" clause! When I read that, and look at the four sites that would be affected, I think about what we would do if the worst happened, but don't expect it with only four units to upgrade.

    As it happened, after upgrading the units overnight, waiting for a successful recovery before starting the next one, the last unit failed to come back up. Having done this in the middle of the night, I wasn't best pleased to find out I would have to get up early (they start at 07:00) to talk them through re-powering the XGS. It was doubly unfortunate that unknown to me they had all the directors on site for a board meeting and as their servers are isolated from the workstations via the XGS, nothing was working.

    Anyway, having had my whinge, I do wonder if the instances of this are as "rare" as the KBA describes. I know my results are not statistically significant as the sample size is small but I hope someone is considering whether "rare" was the correct adjective to use in the KBA. The fact that this was even mentioned in the KBA makes me suspect that they had enough instances of it in testing to mention it at all. 's experience with two identical units, also makes me suspect that the firmware update process is not as reliable as suggested in the KBA.

    I understand the need for this update and it was good that the KBA explained how to recover from the update if it went wrong but if there was a more realistic description of the risk level, we would have gone about this a different way.

Reply
  • If you already waited 25 mins, you can power cycle the device now, no need to wait until tomorrow. 

    First thing I would say is be careful about the timing. We had one of these and the XGS116 took 25 minutes to come back up after re-powering.

    On a more general point, yes I read the the KBA in detail and was aware of the "rare cases" clause! When I read that, and look at the four sites that would be affected, I think about what we would do if the worst happened, but don't expect it with only four units to upgrade.

    As it happened, after upgrading the units overnight, waiting for a successful recovery before starting the next one, the last unit failed to come back up. Having done this in the middle of the night, I wasn't best pleased to find out I would have to get up early (they start at 07:00) to talk them through re-powering the XGS. It was doubly unfortunate that unknown to me they had all the directors on site for a board meeting and as their servers are isolated from the workstations via the XGS, nothing was working.

    Anyway, having had my whinge, I do wonder if the instances of this are as "rare" as the KBA describes. I know my results are not statistically significant as the sample size is small but I hope someone is considering whether "rare" was the correct adjective to use in the KBA. The fact that this was even mentioned in the KBA makes me suspect that they had enough instances of it in testing to mention it at all. 's experience with two identical units, also makes me suspect that the firmware update process is not as reliable as suggested in the KBA.

    I understand the need for this update and it was good that the KBA explained how to recover from the update if it went wrong but if there was a more realistic description of the risk level, we would have gone about this a different way.

Children