Important note about SSL VPN compatibility for 20.0 MR1 with EoL SFOS versions and UTM9 OS. Learn more in the release notes.

v21 HA Active passive - Aux node fails - system startup failed - fault state

I have a HA cluster where it happens now second time that the Aux node was not able to join the cluster or went into fault state after a time of Primary node being down.

First it happened while HA initial setup.

  fixed it via remote access. I don't know what was done to fix it.

Now the primary node was off 1 day. After it came back, it went primary again but standalone. The Aux node now has the same IP on Port1 (management) as the Primary node.

I went to the Aux node from the Primary using SSH and the HA IP:

This may be a known issue. But how can I fix it for the long term?

Text and topic changed because the problem is different



changed subject and text due to the issue is not related to the management IP address
[bearbeitet von: LHerzog um 4:05 PM (GMT -7) am 18 Oct 2024]
Parents Reply Children
  • yes, one replaced.

    all locally, no switches involved. it's just the basic setup. Port10 directly connected. node1-node2

    from node 2:

    XGS136_XN02_SFOS 21.0.0 GA-Build169 HA-Auxiliary# tail /log/ha.log
    Oct 18 15:12:35Z [INFO] ha_state_notifier: opcode ha_http_sync_cb execution successfully for HA state transitions from Ready to Auxiliary.
    Oct 18 15:12:35Z [INFO] ha_state_notifier: Initiating opcode ha_iview_sync_cb for HA state transition from Ready to Auxiliary.
    Oct 18 15:12:36Z [INFO] ha_state_notifier: opcode ha_iview_sync_cb execution successfully for HA state transitions from Ready to Auxiliary.
    Oct 18 15:12:36Z [INFO] Execution of PRE-hook for HA state transition from Ready to Auxiliary has been completed.
    Oct 18 15:12:36Z [INFO] ha: starting system services
    Oct 18 15:12:48Z [INFO] ha: preempt: prim originialrole = 3 , aux originalrole = 0
    Oct 18 15:12:48Z [INFO] ha: preempt: setting originalrole to HA_AUX
    Oct 18 15:16:11Z [ERROR] ha: system startup failed !!!
    Oct 18 15:16:11Z [INFO] ha: unfreezing the peer csc
    Oct 18 15:16:11Z [INFO] ha: failsafe mode in aux state so treating it as fault state !!!

    XGS136_XN02_SFOS 21.0.0 GA-Build169 HA-Auxiliary# tail /log/msync.log
    Fri Oct 18 15:12:47 2024:526280Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/bin/sh /scripts/licensing/lic_ge', cli_fd 10, serv_fd 11
    Fri Oct 18 15:12:47 2024:531018Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/bin/syncfile /tmp/peer_lic_ha ', cli_fd 10, serv_fd 11
    Fri Oct 18 15:12:47 2024:779781Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/bin/rm -r -f /tmp/peer_lic_ha ', cli_fd 10, serv_fd 11
    Fri Oct 18 15:12:47 2024:785065Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/bin/syncfile /conf/sysfiles/hots', cli_fd 10, serv_fd 11
    Fri Oct 18 15:12:49 2024:812420Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/sbin/pg_dump -U pgroot signature', cli_fd 10, serv_fd 11
    Fri Oct 18 15:12:53 2024:050807Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/sbin/pg_dump -p 5433 -U pgrouser', cli_fd 10, serv_fd 11
    Fri Oct 18 15:12:53 2024:056495Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/bin/syncfile /tmp/tblspxdetails ', cli_fd 10, serv_fd 11
    Fri Oct 18 15:12:53 2024:068306Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/bin/rm -f /tmp/tblspxdetails ', cli_fd 10, serv_fd 11
    Fri Oct 18 15:13:06 2024:888194Z:2050:GTM:BACK:INFO:sync_entity.c:584sesid 24: msg MSG_SET_DLNOTRACK:1
    Fri Oct 18 15:16:11 2024:111152Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/bin/csc custom unfreeze ', cli_fd 10, serv_fd 11

  • is it just a license issue due to one node replaced? because there are lots of lines about licensing.