Important note about SSL VPN compatibility for 20.0 MR1 with EoL SFOS versions and UTM9 OS. Learn more in the release notes.

v21 HA Active passive - Aux node fails - system startup failed - fault state

I have a HA cluster where it happens now second time that the Aux node was not able to join the cluster or went into fault state after a time of Primary node being down.

First it happened while HA initial setup.

  fixed it via remote access. I don't know what was done to fix it.

Now the primary node was off 1 day. After it came back, it went primary again but standalone. The Aux node now has the same IP on Port1 (management) as the Primary node.

I went to the Aux node from the Primary using SSH and the HA IP:

This may be a known issue. But how can I fix it for the long term?

Text and topic changed because the problem is different



changed subject and text due to the issue is not related to the management IP address
[bearbeitet von: LHerzog um 4:05 PM (GMT -7) am 18 Oct 2024]
Parents
  • This is not an issue. The peer administration IP is always an "alias" and not the IP of the interface itself.

    The interface will see naturally traffic but not respond to the packets. 

    RX: Received Packets. 

    TX: is very low. 

    __________________________________________________________________________________________________________________

Reply
  • This is not an issue. The peer administration IP is always an "alias" and not the IP of the interface itself.

    The interface will see naturally traffic but not respond to the packets. 

    RX: Received Packets. 

    TX: is very low. 

    __________________________________________________________________________________________________________________

Children
  • so true. sorry, I was completely wrong here. I will reinstall that cluster once more. No idea what the HA issue is and cannot spend more time for debug.

  • Can you double check the SSMK, if you can setup the SSMk on both appliance, before doing the HA? 

    __________________________________________________________________________________________________________________

  • Again I cannot set up a reliable HA with these two machines. I don't know how I would ever use them in production.

    1. deleted HA on primary

    2. turned off node 2 and reimaged it with 21GA

    3. updated the Primary node 1 from EAP 21 to 21 GA

    Port1: 172.16.16.16/24; Port2: WAN DHCP; Port10 10.1.178.5/30

    4. prepared node 2 with initial setup, same admin pw and same SSMK:

    Port1: 172.16.16.17/24; Port2: WAN DHCP; Port10 10.1.178.6/30

    set the NTP server to default/enabled

    5. initiated HA interactive on node2

    6. initiated HA interactive on node1

    7. HA seems to be enabled and I could see the node 2 rebooting, and traffic on HA. There were also hauser logins on node1

  • One appliance was already replaced - correct? 
    About the network: 
    How are they connected ? 
    Are you using a cable between each other and what is on the other end of both appliances? 

    __________________________________________________________________________________________________________________

  • yes, one replaced.

    all locally, no switches involved. it's just the basic setup. Port10 directly connected. node1-node2

    from node 2:

    XGS136_XN02_SFOS 21.0.0 GA-Build169 HA-Auxiliary# tail /log/ha.log
    Oct 18 15:12:35Z [INFO] ha_state_notifier: opcode ha_http_sync_cb execution successfully for HA state transitions from Ready to Auxiliary.
    Oct 18 15:12:35Z [INFO] ha_state_notifier: Initiating opcode ha_iview_sync_cb for HA state transition from Ready to Auxiliary.
    Oct 18 15:12:36Z [INFO] ha_state_notifier: opcode ha_iview_sync_cb execution successfully for HA state transitions from Ready to Auxiliary.
    Oct 18 15:12:36Z [INFO] Execution of PRE-hook for HA state transition from Ready to Auxiliary has been completed.
    Oct 18 15:12:36Z [INFO] ha: starting system services
    Oct 18 15:12:48Z [INFO] ha: preempt: prim originialrole = 3 , aux originalrole = 0
    Oct 18 15:12:48Z [INFO] ha: preempt: setting originalrole to HA_AUX
    Oct 18 15:16:11Z [ERROR] ha: system startup failed !!!
    Oct 18 15:16:11Z [INFO] ha: unfreezing the peer csc
    Oct 18 15:16:11Z [INFO] ha: failsafe mode in aux state so treating it as fault state !!!

    XGS136_XN02_SFOS 21.0.0 GA-Build169 HA-Auxiliary# tail /log/msync.log
    Fri Oct 18 15:12:47 2024:526280Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/bin/sh /scripts/licensing/lic_ge', cli_fd 10, serv_fd 11
    Fri Oct 18 15:12:47 2024:531018Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/bin/syncfile /tmp/peer_lic_ha ', cli_fd 10, serv_fd 11
    Fri Oct 18 15:12:47 2024:779781Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/bin/rm -r -f /tmp/peer_lic_ha ', cli_fd 10, serv_fd 11
    Fri Oct 18 15:12:47 2024:785065Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/bin/syncfile /conf/sysfiles/hots', cli_fd 10, serv_fd 11
    Fri Oct 18 15:12:49 2024:812420Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/sbin/pg_dump -U pgroot signature', cli_fd 10, serv_fd 11
    Fri Oct 18 15:12:53 2024:050807Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/sbin/pg_dump -p 5433 -U pgrouser', cli_fd 10, serv_fd 11
    Fri Oct 18 15:12:53 2024:056495Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/bin/syncfile /tmp/tblspxdetails ', cli_fd 10, serv_fd 11
    Fri Oct 18 15:12:53 2024:068306Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/bin/rm -f /tmp/tblspxdetails ', cli_fd 10, serv_fd 11
    Fri Oct 18 15:13:06 2024:888194Z:2050:GTM:BACK:INFO:sync_entity.c:584sesid 24: msg MSG_SET_DLNOTRACK:1
    Fri Oct 18 15:16:11 2024:111152Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/bin/csc custom unfreeze ', cli_fd 10, serv_fd 11

  • is it just a license issue due to one node replaced? because there are lots of lines about licensing.