Important note about SSL VPN compatibility for 20.0 MR1 with EoL SFOS versions and UTM9 OS. Learn more in the release notes.

v21 HA Active passive - Aux node fails - system startup failed - fault state

I have a HA cluster where it happens now second time that the Aux node was not able to join the cluster or went into fault state after a time of Primary node being down.

First it happened while HA initial setup.

  fixed it via remote access. I don't know what was done to fix it.

Now the primary node was off 1 day. After it came back, it went primary again but standalone. The Aux node now has the same IP on Port1 (management) as the Primary node.

I went to the Aux node from the Primary using SSH and the HA IP:

This may be a known issue. But how can I fix it for the long term?

Text and topic changed because the problem is different



Added TAGs
[edited by: Erick Jan at 12:24 AM (GMT -7) on 21 Oct 2024]
  • Current solution:

    node2 garner service revived after support deleted old garner database reports (or logs?) on node2.

    garner was able to start now.

    node2 was rebooted and HA was successfully established.

    _

    before the database modifications, there were also a lot of database related errors in csc.log like this:

    ERROR     Oct 21 07:12:48Z  [sigdb:1807]: execute_prepare_query:DB handle returned from perl is not OK.                                                         
    ERROR     Oct 21 07:12:48Z  [sigdb:1807]: get_query_status: DB has returned error code: 1                                                                       
    ERROR     Oct 21 07:12:48Z  [sigdb:1807]: get_query_status:Query Error: connection to server at "127.0.0.1", port 5434 failed: Connection refused               
    ERROR     Oct 21 07:12:48Z  [sigdb:1807]: csc_prep_query: execute_prepare_query failed for SELECT txid_current().                                               
    ERROR     Oct 21 07:12:48Z  [sigdb:1807]: execute_prepare_query:DB handle returned from perl is not OK.                                                         
    ERROR     Oct 21 07:12:48Z  [sigdb:1807]: get_query_status: DB has returned error code: 1                                                                       
    ERROR     Oct 21 07:12:48Z  [sigdb:1807]: get_query_status:Query Error: connection to server at "127.0.0.1", port 5434 failed: Connection refused               
    ERROR     Oct 21 07:12:48Z  [sigdb:1807]: do_prep_query: Failed PREPSTMT: 'SELECT CURRENT_TIMESTAMP(3) AT TIME ZONE 'GMT' || ' GMT''                            
    ERROR     Oct 21 07:12:49Z  [sigdb:1807]: csc_execve: Child exited with status 1