Important note about SSL VPN compatibility for 20.0 MR1 with EoL SFOS versions and UTM9 OS. Learn more in the release notes.

v21 HA Active passive - Aux node fails - system startup failed - fault state

I have a HA cluster where it happens now second time that the Aux node was not able to join the cluster or went into fault state after a time of Primary node being down.

First it happened while HA initial setup.

  fixed it via remote access. I don't know what was done to fix it.

Now the primary node was off 1 day. After it came back, it went primary again but standalone. The Aux node now has the same IP on Port1 (management) as the Primary node.

I went to the Aux node from the Primary using SSH and the HA IP:

This may be a known issue. But how can I fix it for the long term?

Text and topic changed because the problem is different



Added TAGs
[edited by: Erick Jan at 12:24 AM (GMT -7) on 21 Oct 2024]
Parents
  • This is not an issue. The peer administration IP is always an "alias" and not the IP of the interface itself.

    The interface will see naturally traffic but not respond to the packets. 

    RX: Received Packets. 

    TX: is very low. 

    __________________________________________________________________________________________________________________

  • so true. sorry, I was completely wrong here. I will reinstall that cluster once more. No idea what the HA issue is and cannot spend more time for debug.

  • Can you double check the SSMK, if you can setup the SSMk on both appliance, before doing the HA? 

    __________________________________________________________________________________________________________________

  • Again I cannot set up a reliable HA with these two machines. I don't know how I would ever use them in production.

    1. deleted HA on primary

    2. turned off node 2 and reimaged it with 21GA

    3. updated the Primary node 1 from EAP 21 to 21 GA

    Port1: 172.16.16.16/24; Port2: WAN DHCP; Port10 10.1.178.5/30

    4. prepared node 2 with initial setup, same admin pw and same SSMK:

    Port1: 172.16.16.17/24; Port2: WAN DHCP; Port10 10.1.178.6/30

    set the NTP server to default/enabled

    5. initiated HA interactive on node2

    6. initiated HA interactive on node1

    7. HA seems to be enabled and I could see the node 2 rebooting, and traffic on HA. There were also hauser logins on node1

  • One appliance was already replaced - correct? 
    About the network: 
    How are they connected ? 
    Are you using a cable between each other and what is on the other end of both appliances? 

    __________________________________________________________________________________________________________________

  • yes, one replaced.

    all locally, no switches involved. it's just the basic setup. Port10 directly connected. node1-node2

    from node 2:

    XGS136_XN02_SFOS 21.0.0 GA-Build169 HA-Auxiliary# tail /log/ha.log
    Oct 18 15:12:35Z [INFO] ha_state_notifier: opcode ha_http_sync_cb execution successfully for HA state transitions from Ready to Auxiliary.
    Oct 18 15:12:35Z [INFO] ha_state_notifier: Initiating opcode ha_iview_sync_cb for HA state transition from Ready to Auxiliary.
    Oct 18 15:12:36Z [INFO] ha_state_notifier: opcode ha_iview_sync_cb execution successfully for HA state transitions from Ready to Auxiliary.
    Oct 18 15:12:36Z [INFO] Execution of PRE-hook for HA state transition from Ready to Auxiliary has been completed.
    Oct 18 15:12:36Z [INFO] ha: starting system services
    Oct 18 15:12:48Z [INFO] ha: preempt: prim originialrole = 3 , aux originalrole = 0
    Oct 18 15:12:48Z [INFO] ha: preempt: setting originalrole to HA_AUX
    Oct 18 15:16:11Z [ERROR] ha: system startup failed !!!
    Oct 18 15:16:11Z [INFO] ha: unfreezing the peer csc
    Oct 18 15:16:11Z [INFO] ha: failsafe mode in aux state so treating it as fault state !!!

    XGS136_XN02_SFOS 21.0.0 GA-Build169 HA-Auxiliary# tail /log/msync.log
    Fri Oct 18 15:12:47 2024:526280Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/bin/sh /scripts/licensing/lic_ge', cli_fd 10, serv_fd 11
    Fri Oct 18 15:12:47 2024:531018Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/bin/syncfile /tmp/peer_lic_ha ', cli_fd 10, serv_fd 11
    Fri Oct 18 15:12:47 2024:779781Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/bin/rm -r -f /tmp/peer_lic_ha ', cli_fd 10, serv_fd 11
    Fri Oct 18 15:12:47 2024:785065Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/bin/syncfile /conf/sysfiles/hots', cli_fd 10, serv_fd 11
    Fri Oct 18 15:12:49 2024:812420Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/sbin/pg_dump -U pgroot signature', cli_fd 10, serv_fd 11
    Fri Oct 18 15:12:53 2024:050807Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/sbin/pg_dump -p 5433 -U pgrouser', cli_fd 10, serv_fd 11
    Fri Oct 18 15:12:53 2024:056495Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/bin/syncfile /tmp/tblspxdetails ', cli_fd 10, serv_fd 11
    Fri Oct 18 15:12:53 2024:068306Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/bin/rm -f /tmp/tblspxdetails ', cli_fd 10, serv_fd 11
    Fri Oct 18 15:13:06 2024:888194Z:2050:GTM:BACK:INFO:sync_entity.c:584sesid 24: msg MSG_SET_DLNOTRACK:1
    Fri Oct 18 15:16:11 2024:111152Z:2050:GTM:BACK:ERROR:event.c:572: error found for cmd '/bin/csc custom unfreeze ', cli_fd 10, serv_fd 11

  • is it just a license issue due to one node replaced? because there are lots of lines about licensing.

  • Aux node:

    garner is stopped

    reboot firewall

    ha.log

    XGS136_XN02_SFOS 21.0.0 GA-Build169 HA-Auxiliary# tail /log/ha.log -n 100       
    Oct 18 15:12:48Z [INFO] ha: preempt: setting originalrole to HA_AUX             
    Oct 18 15:16:11Z [ERROR] ha: system startup failed !!!                          
    Oct 18 15:16:11Z [INFO] ha: unfreezing the peer csc                             
    Oct 18 15:16:11Z [INFO] ha: failsafe mode in aux state so treating it as fault state !!!  
    
    reboot                                                                      
    
    Oct 21 07:10:58Z[INFO] ha: sysstop called                                       
    Jan 01 00:00:49Z [INFO] ha: Booting up in XGS136_XN02_SFOS 21.0.0 GA-Build169   
                                                                                    
    Jan 01 00:00:50Z [INFO] msync:before_start iteration '1' to check halink up, retry till '30'                                                                    
    Jan 01 00:00:50Z [INFO] msync:before_start Port10:                              
    Port10    Link encap:Ethernet  HWaddr C8:4F:86:A8:B1:29                         
              inet addr:10.1.178.6  Bcast:10.1.178.7  Mask:255.255.255.252          
              inet6 addr: fe80::ca4f:86ff:fea8:b129/64 Scope:Link                   
              UP BROADCAST MULTICAST  MTU:1500  Metric:1                            
              RX packets:0 errors:0 dropped:0 overruns:0 frame:0                    
              TX packets:0 errors:0 dropped:0 overruns:0 carrier:0                  
              collisions:0 txqueuelen:1000                                          
              RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                                
                                                                                    
                                                                                    
    Jan 01 00:00:51Z [INFO] msync:before_start iteration '2' to check halink up, retry till '30'                                                                    
    Jan 01 00:00:51Z [INFO] msync:before_start Port10:                              
    Port10    Link encap:Ethernet  HWaddr C8:4F:86:A8:B1:29                         
              inet addr:10.1.178.6  Bcast:10.1.178.7  Mask:255.255.255.252          
              inet6 addr: fe80::ca4f:86ff:fea8:b129/64 Scope:Link                   
              UP BROADCAST MULTICAST  MTU:1500  Metric:1                            
              RX packets:0 errors:0 dropped:0 overruns:0 frame:0                    
              TX packets:0 errors:0 dropped:0 overruns:0 carrier:0                  
              collisions:0 txqueuelen:1000                                          
              RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                                
                                                                                    
                                                                                    
    Jan 01 00:00:52Z [INFO] msync:before_start iteration '3' to check halink up, retry till '30'                                                                    
    Jan 01 00:00:52Z [INFO] msync:before_start Port10:                              
    Port10    Link encap:Ethernet  HWaddr C8:4F:86:A8:B1:29                         
              inet addr:10.1.178.6  Bcast:10.1.178.7  Mask:255.255.255.252          
              inet6 addr: fe80::ca4f:86ff:fea8:b129/64 Scope:Link                   
              UP BROADCAST MULTICAST  MTU:1500  Metric:1                            
              RX packets:0 errors:0 dropped:0 overruns:0 frame:0                    
              TX packets:0 errors:0 dropped:0 overruns:0 carrier:0                  
              collisions:0 txqueuelen:1000                                          
              RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                                
                                                                                    
                                                                                    
    Jan 01 00:00:53Z [INFO] msync:before_start iteration '4' to check halink up, retry till '30'                                                                    
    Jan 01 00:00:53Z [INFO] msync:before_start Port10:                              
    Port10    Link encap:Ethernet  HWaddr C8:4F:86:A8:B1:29                         
              inet addr:10.1.178.6  Bcast:10.1.178.7  Mask:255.255.255.252          
              inet6 addr: fe80::ca4f:86ff:fea8:b129/64 Scope:Link                   
              UP BROADCAST MULTICAST  MTU:1500  Metric:1                            
              RX packets:0 errors:0 dropped:0 overruns:0 frame:0                    
              TX packets:0 errors:0 dropped:0 overruns:0 carrier:0                  
              collisions:0 txqueuelen:1000                                          
              RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                                
                                                                                    
                                                                                    
    Jan 01 00:00:55Z [INFO] msync:before_start iteration '5' to check halink up, retry till '30'                                                                    
    Jan 01 00:00:55Z [INFO] msync:before_start Port10:                              
    Port10    Link encap:Ethernet  HWaddr C8:4F:86:A8:B1:29                         
              inet addr:10.1.178.6  Bcast:10.1.178.7  Mask:255.255.255.252          
              inet6 addr: fe80::ca4f:86ff:fea8:b129/64 Scope:Link                   
              UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1                    
              RX packets:0 errors:0 dropped:0 overruns:0 frame:0                    
              TX packets:0 errors:0 dropped:0 overruns:0 carrier:0                  
              collisions:0 txqueuelen:1000                                          
              RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                                
                                                                                    
                                                                                    
    Jan 01 00:00:59Z [INFO] ha: handle_stat_change: 0:5 [ NA=0 AUX=1 STAND=2 PRIM=3 FAULT=4 READY=5 GOTO_PRIM=6 ]                                                   
    Jan 01 00:00:59Z [INFO] ha: handle_stat_change: g_ha_hsc=1 is set.              
    Jan 01 00:00:59Z [INFO] ha: handle_stat_change: 0:5 done.                       
    Jan 01 00:00:59Z [INFO] ha: handle_stat_change: g_ha_hsc=0 is set.              
    Jan 01 00:01:01Z [INFO] ha: handle_stat_change: 5:1 [ NA=0 AUX=1 STAND=2 PRIM=3 FAULT=4 READY=5 GOTO_PRIM=6 ]                                                   
    Jan 01 00:01:01Z [INFO] ha: handle_stat_change: g_ha_hsc=1 is set.              
    Jan 01 00:01:01Z [INFO] ha: g_ha_transmode=2 [ CONFIG=1 INIT=2 EVENT=0 ]        
    Jan 01 00:01:01Z [INFO] ha: peer is running same firmware version (own=21_0_0_169 and peer=21_0_0_169), so syncing from peer                                    
    Jan 01 00:01:01Z [INFO] Execution of PRE-hook for HA state transition from Ready to Auxiliary has begun.                                                        
    Jan 01 00:01:01Z [INFO] ha_state_notifier: Initiating opcode ha_ssmk_enc_check_cb for HA state transition from Ready to Auxiliary.                              
    Jan 01 00:01:01Z [INFO] ha_state_notifier: opcode ha_ssmk_enc_check_cb execution successfully for HA state transitions from Ready to Auxiliary.                 
    Jan 01 00:01:01Z [INFO] ha_state_notifier: Initiating opcode ha_freeze_cb for HA state transition from Ready to Auxiliary.                                      
    Jan 01 00:01:01Z [INFO] ha: freezing the peer csc done in retry 0               
    Jan 01 00:01:01Z [INFO] ha_state_notifier: opcode ha_freeze_cb execution successfully for HA state transitions from Ready to Auxiliary.                         
    Jan 01 00:01:01Z [INFO] ha_state_notifier: Initiating opcode ha_atr_file_sync_cb for HA state transition from Ready to Auxiliary.                               
    Jan 01 00:01:01Z [INFO] ha_state_notifier: opcode ha_atr_file_sync_cb execution successfully for HA state transitions from Ready to Auxiliary.                  
    Jan 01 00:01:01Z [INFO] ha_state_notifier: Initiating opcode ha_date_time_cb for HA state transition from Ready to Auxiliary.                                   
    Jan 01 00:01:02Z [ERROR] ha_atr_file_sync_cb: HA state notifier opcode gets failed.                                                                             
    Oct 21 07:12:34Z [INFO] ha_state_notifier: opcode ha_date_time_cb execution successfully for HA state transitions from Ready to Auxiliary.                      
    Oct 21 07:12:34Z [INFO] ha_state_notifier: Initiating opcode ha_admin_secret_cb for HA state transition from Ready to Auxiliary.                                
    Oct 21 07:12:34Z [INFO] ha_state_notifier: opcode ha_admin_secret_cb execution successfully for HA state transitions from Ready to Auxiliary.                   
    Oct 21 07:12:34Z [INFO] ha_state_notifier: Initiating opcode ha_certificate_sync_cb for HA state transition from Ready to Auxiliary.                            
    Oct 21 07:12:35Z [INFO] ha_state_notifier: opcode ha_certificate_sync_cb execution successfully for HA state transitions from Ready to Auxiliary.               
    Oct 21 07:12:35Z [INFO] ha_state_notifier: Initiating opcode ha_http_sync_cb for HA state transition from Ready to Auxiliary.                                   
    Oct 21 07:12:35Z [INFO] ha_state_notifier: opcode ha_http_sync_cb execution successfully for HA state transitions from Ready to Auxiliary.                      
    Oct 21 07:12:35Z [INFO] ha_state_notifier: Initiating opcode ha_iview_sync_cb for HA state transition from Ready to Auxiliary.                                  
    Oct 21 07:12:36Z [INFO] ha_state_notifier: opcode ha_iview_sync_cb execution successfully for HA state transitions from Ready to Auxiliary.                     
    Oct 21 07:12:36Z [INFO] Execution of PRE-hook for HA state transition from Ready to Auxiliary has been completed.                                               
    Oct 21 07:12:36Z [INFO] ha: starting system services                            
    Oct 21 07:12:47Z [INFO] ha: preempt: prim originialrole = 3 , aux originalrole = 1                                                                              
    Oct 21 07:16:10Z [ERROR] ha: system startup failed !!!                          
    Oct 21 07:16:10Z [INFO] ha: unfreezing the peer csc                             
    Oct 21 07:16:10Z [INFO] ha: failsafe mode in aux state so treating it as fault state !!!                                                                        
    XGS136_XN02_SFOS 21.0.0 GA-Build169 HA-Auxiliary#                               
    

    support suggests to disable HA on the aux node, check if garner is running then and then to re-initiate HA.

    I'm afraid this will not help but try that.

  • just following up here in case someone has similar issues:

    after disabling HA on the node1 and rebooting both nodes, node1 was expectedly without HA config while node2 has then switched from HA Auxiliary mode to HA Standalone. BUT: garner was now starting no node2.

    Webadmin was now re-accessible and also the machine was showing as online in Sophos Central again and we could disable HA also on node2 this way.

    Support put csc debugs and log persistance on it and we re-enabled HA in exactly the same config mode as I did before.

    We did that for both appliances via Sophos Central access.

    After establishing HA on both nodes in interactive mode the exact same issue happened again.

    node1: status Standalone showing node2 as "not available"

    node2: status Auxiliary not able to start garner (Starting for long time) and not able to connect to node1

    Jan 01 00:01:01Z [INFO] ha_state_notifier: opcode ha_atr_file_sync_cb execution 
    successfully for HA state transitions from Ready to Auxiliary.                  
    Jan 01 00:01:01Z [INFO] ha_state_notifier: Initiating opcode ha_date_time_cb for
     HA state transition from Ready to Auxiliary.                                   
    Jan 01 00:01:02Z [ERROR] ha_atr_file_sync_cb: HA state notifier opcode gets fail
    ed.                                                                             
    Oct 21 10:13:49Z [INFO] ha_state_notifier: opcode ha_date_time_cb execution succ
    essfully for HA state transitions from Ready to Auxiliary.                      
    Oct 21 10:13:49Z [INFO] ha_state_notifier: Initiating opcode ha_admin_secret_cb 
    for HA state transition from Ready to Auxiliary.                                
    Oct 21 10:13:49Z [INFO] ha_state_notifier: opcode ha_admin_secret_cb execution s
    uccessfully for HA state transitions from Ready to Auxiliary.                   
    Oct 21 10:13:49Z [INFO] ha_state_notifier: Initiating opcode ha_certificate_sync
    _cb for HA state transition from Ready to Auxiliary.                            
    Oct 21 10:13:50Z [INFO] ha_state_notifier: opcode ha_certificate_sync_cb executi
    on successfully for HA state transitions from Ready to Auxiliary.               
    Oct 21 10:13:50Z [INFO] ha_state_notifier: Initiating opcode ha_http_sync_cb for
     HA state transition from Ready to Auxiliary.                                   
    Oct 21 10:13:50Z [INFO] ha_state_notifier: opcode ha_http_sync_cb execution succ
    essfully for HA state transitions from Ready to Auxiliary.                      
    Oct 21 10:13:50Z [INFO] ha_state_notifier: Initiating opcode ha_iview_sync_cb fo
    r HA state transition from Ready to Auxiliary.                                  
    Oct 21 10:13:51Z [INFO] ha_state_notifier: opcode ha_iview_sync_cb execution suc
    cessfully for HA state transitions from Ready to Auxiliary.                     
    Oct 21 10:13:51Z [INFO] Execution of PRE-hook for HA state transition from Ready
     to Auxiliary has been completed.                                               
    Oct 21 10:13:51Z [INFO] ha: starting system services                            
    Oct 21 10:14:02Z [INFO] ha: preempt: prim originialrole = 3 , aux originalrole =
     0                                                                              
    Oct 21 10:14:02Z [INFO] ha: preempt: setting originalrole to HA_AUX             
    Oct 21 10:17:27Z [ERROR] ha: system startup failed !!!                          
    Oct 21 10:17:27Z [INFO] ha: unfreezing the peer csc                             
    Oct 21 10:17:27Z [INFO] ha: failsafe mode in aux state so treating it as fault s
    tate !!!                                                                        
    XGS136_XN02_SFOS 21.0.0 GA-Build169 HA-Auxiliary# 

    node1: Port1: 172.16.16.16/24; Port2: WAN DHCP; Port10 10.1.178.5/30

    node2: Port1: 172.16.16.16/24; Port2: WAN DHCP; Port10 10.1.178.6/30

    Support will now try to find issues in the logs a Support level up.

  • Current solution:

    node2 garner service revived after support deleted old garner database reports (or logs?) on node2.

    garner was able to start now.

    node2 was rebooted and HA was successfully established.

    _

    before the database modifications, there were also a lot of database related errors in csc.log like this:

    ERROR     Oct 21 07:12:48Z  [sigdb:1807]: execute_prepare_query:DB handle returned from perl is not OK.                                                         
    ERROR     Oct 21 07:12:48Z  [sigdb:1807]: get_query_status: DB has returned error code: 1                                                                       
    ERROR     Oct 21 07:12:48Z  [sigdb:1807]: get_query_status:Query Error: connection to server at "127.0.0.1", port 5434 failed: Connection refused               
    ERROR     Oct 21 07:12:48Z  [sigdb:1807]: csc_prep_query: execute_prepare_query failed for SELECT txid_current().                                               
    ERROR     Oct 21 07:12:48Z  [sigdb:1807]: execute_prepare_query:DB handle returned from perl is not OK.                                                         
    ERROR     Oct 21 07:12:48Z  [sigdb:1807]: get_query_status: DB has returned error code: 1                                                                       
    ERROR     Oct 21 07:12:48Z  [sigdb:1807]: get_query_status:Query Error: connection to server at "127.0.0.1", port 5434 failed: Connection refused               
    ERROR     Oct 21 07:12:48Z  [sigdb:1807]: do_prep_query: Failed PREPSTMT: 'SELECT CURRENT_TIMESTAMP(3) AT TIME ZONE 'GMT' || ' GMT''                            
    ERROR     Oct 21 07:12:49Z  [sigdb:1807]: csc_execve: Child exited with status 1

Reply
  • Current solution:

    node2 garner service revived after support deleted old garner database reports (or logs?) on node2.

    garner was able to start now.

    node2 was rebooted and HA was successfully established.

    _

    before the database modifications, there were also a lot of database related errors in csc.log like this:

    ERROR     Oct 21 07:12:48Z  [sigdb:1807]: execute_prepare_query:DB handle returned from perl is not OK.                                                         
    ERROR     Oct 21 07:12:48Z  [sigdb:1807]: get_query_status: DB has returned error code: 1                                                                       
    ERROR     Oct 21 07:12:48Z  [sigdb:1807]: get_query_status:Query Error: connection to server at "127.0.0.1", port 5434 failed: Connection refused               
    ERROR     Oct 21 07:12:48Z  [sigdb:1807]: csc_prep_query: execute_prepare_query failed for SELECT txid_current().                                               
    ERROR     Oct 21 07:12:48Z  [sigdb:1807]: execute_prepare_query:DB handle returned from perl is not OK.                                                         
    ERROR     Oct 21 07:12:48Z  [sigdb:1807]: get_query_status: DB has returned error code: 1                                                                       
    ERROR     Oct 21 07:12:48Z  [sigdb:1807]: get_query_status:Query Error: connection to server at "127.0.0.1", port 5434 failed: Connection refused               
    ERROR     Oct 21 07:12:48Z  [sigdb:1807]: do_prep_query: Failed PREPSTMT: 'SELECT CURRENT_TIMESTAMP(3) AT TIME ZONE 'GMT' || ' GMT''                            
    ERROR     Oct 21 07:12:49Z  [sigdb:1807]: csc_execve: Child exited with status 1

Children
No Data