HA_pair service stopped

I tried to Quick HA pair a set of XG210's using ports 6, then 3, but they would not pair.

I was able to manually set the HA on port 6, but the HA_pair service continues to warn it is stopped.

The units are paired.

I see in the log where it warns SSH could not be configured, :

ha_pair.log

sleeping...
...done
{ "status": "544", "statusmessage": "configure SSH failed" }
sleeping...
...done
{ "status": "544", "statusmessage": "configure SSH failed" }
sleeping...
...done
{ "statusmessage": "configure SSH failed", "status": "544" }
sleeping...

Anyone know a way to clear the warning?

 

Thanks,

Paul 

  • I wouldn't hold your breath: HA has not worked for me since EAP1, and Sophos shows no interest in fixing what they admitted was a bug, even though they keep enabling more HA features with each EAP release. It all makes no sense to me. Kind of like building a house on a broken foundation. I am about ready to give up and go back to 17.

  • Hi Paul, 

    The ha_pair service is responsible to set up the QuickHA between two nodes. It configures the DMZ port and assigns the IP address, enables ssh on DMZ zone, set up the SSH OTP (passphrase), generate the ssh key-pair, connects to auxiliary node and exchange the ssh key-pair with it and then after enables the HA.

    From the logs it seems the primary node is unable to configure ssh with auxiliary node. To drill down the issue we would require more details about your deployment steps which you followed to setup the QuickHA. Please contact us via private message so we can provide you further assistance.

     

    Thanks,

    Vishal

  • Thanks Paul for reporting the issue.

    The ha_pair service will get shown STOPPED on UI in two cases like 1) After successfully creating ha pair using QuickHA 2) clicking on "Stop Discovery". In EAP3, as of now we don't have any way to remove that warning from UI except rebooting that device manually. We have a Jira item with reference id NC-52884 to fix this issue.

    Thanks,

    Vishal

  • Revisiting this, as the manual reboot brought to light additional issues.

    System is now operating on Auxiliary as Standalone, and Primary is in Fault.

    Original:
    Primary
    Port 1 - 192.168.10.1
    Port 2 - public IP
    Port 4 - 172.30.255.10
    Port 6 - 10.40.1.1
    Auxiliary
    Port 1 - 192.168.10.1
    Port 2 - public IP
    Port 4 - 172.30.255.10
    Port 6 - 10.40.1.2

    Current:
    Fault (old primary)
    Port 1 - 172.16.16.16
    Port 2 - 192.168.2.1 Bcast 192.168.15.255 Mask 255.255.240.0
    Port 6 - 10.40.1.1
    Standalone
    Port 1 - 192.168.10.1
    Port 2 - public IP
    Port 4 - 172.30.255.10
    Port 6 - 10.40.1.2

    Hard disk and memory tests both show success.

    Output from msync.log is empty on both devices.

    Output from applog.log on original Primary

    XG210_WP03_SFOS 18.0.0 EAP3# cat /log/applog.log | grep "ha:"
    Tue Jan 14 06:52:14 EST 2020 ha: sysstop called
    Jan 14 06:52:14 ha: Booting up in XG210_WP03_SFOS 18.0.0 EAP3
    Jan 14 06:52:14 ha: msync: before_start: lspci Port1 Port2 Port3 Port4 Port5 Port6 Port7 Port8
    Jan 14 06:52:14 ha: msync: before_start: Port6 is static interface
    Jan 14 06:52:20 ha: handle_stat_change: 0:5 [ NA=0 AUX=1 STAND=2 PRIM=3 FAULT=4 READY=5 GOTO_PRIM=6 ]
    Jan 14 06:52:20 ha: handle_stat_change: g_ha_hsc=1 is set.
    Jan 14 06:52:20 ha: handle_stat_change: 0:5 done.
    Jan 14 06:52:20 ha: handle_stat_change: g_ha_hsc=0 is set.
    Jan 14 06:53:47 ha: handle_stat_change: 5:1 [ NA=0 AUX=1 STAND=2 PRIM=3 FAULT=4 READY=5 GOTO_PRIM=6 ]
    Jan 14 06:53:47 ha: handle_stat_change: g_ha_hsc=1 is set.
    Jan 14 06:53:47 ha: g_ha_transmode=2 [ CONFIG=1 INIT=2 EVENT=0 ]
    Jan 14 06:53:49 ha: peer is running same firmware version (own=18_0_0_255 and peer=18_0_0_255), so syncing from peer
    Jan 14 06:53:59 ha: freezing the peer csc failed in retry 1 !!!
    Jan 14 06:54:09 ha: freezing the peer csc failed in retry 2 !!!
    Jan 14 06:54:19 ha: freezing the peer csc failed in retry 3 !!!
    Jan 14 06:54:19 ha: setting date 202001140655.08
    Jan 14 06:55:10 ha: starting system services
    Jan 14 06:55:19 ha: corporate database restore failed !!!
    Jan 14 06:55:19 ha: preempt: prim originialrole = 0 , aux originalrole = HASH(0x961dac0)
    Jan 14 06:55:19 ha: preempt: setting originalrole to HA_NA
    Jan 14 06:55:22 ha_port_down_notification: message_id : log_data : Interface Port1 went down. Appliance HA state BACKJan 14 06:55:23 ha_port_down_notificatio n: message_id : log_data : Interface Port2 went down. Appliance HA state BACKJan 14 06:55:24 sysinit: Service sigdb failedJan 14 06:55:24 ha_port_down_notifi cation: message_id : log_data : Interface Port4 went down. Appliance HA state BACKJan 14 06:55:24 ha: system startup failed !!!
    Jan 14 06:56:25 ha: unfreezing the peer csc failed!!!
    Jan 14 06:56:25 ha: failsafe mode in aux state so treating it as fault state !!!
    Jan 14 07:13:05 ha: Booting up in XG210_WP03_SFOS 18.0.0 EAP3
    Jan 14 07:13:05 ha: msync: before_start: lspci Port1 Port2 Port3 Port4 Port5 Port6 Port7 Port8
    Jan 14 07:13:05 ha: msync: before_start: Port6 is static interface
    Jan 14 07:13:11 ha: handle_stat_change: 0:5 [ NA=0 AUX=1 STAND=2 PRIM=3 FAULT=4 READY=5 GOTO_PRIM=6 ]
    Jan 14 07:13:11 ha: handle_stat_change: g_ha_hsc=1 is set.
    Jan 14 07:13:11 ha: handle_stat_change: 0:5 done.
    Jan 14 07:13:11 ha: handle_stat_change: g_ha_hsc=0 is set.
    Jan 14 07:13:12 ha: handle_stat_change: 5:1 [ NA=0 AUX=1 STAND=2 PRIM=3 FAULT=4 READY=5 GOTO_PRIM=6 ]
    Jan 14 07:13:12 ha: handle_stat_change: g_ha_hsc=1 is set.
    Jan 14 07:13:12 ha: g_ha_transmode=2 [ CONFIG=1 INIT=2 EVENT=0 ]
    Jan 14 07:13:12 ha: peer is running same firmware version (own=18_0_0_255 and peer=18_0_0_255), so syncing from peer
    Jan 14 07:13:22 ha: freezing the peer csc done in retry 1
    Jan 14 07:13:22 ha: setting date 202001140713.22
    Jan 14 07:13:23 ha: starting system services
    Jan 14 07:13:25 sysinit: Service postgres failed. Binding factory default IPsJan 14 07:13:26 ha_port_down_notification: message_id : log_data : Interface Por t1 went down. Appliance HA state BACKJan 14 07:13:28 ha_port_down_notification: message_id : log_data : Interface Port2 went down. Appliance HA state BACKJan 14 07:13:28 ha: system startup failed !!!
    Jan 14 07:13:28 ha: failsafe mode in aux state so treating it as fault state !!!
    Jan 14 07:25:28 ha: Booting up in XG210_WP03_SFOS 18.0.0 EAP3
    Jan 14 07:25:28 ha: msync: before_start: lspci Port1 Port2 Port3 Port4 Port5 Port6 Port7 Port8
    Jan 14 07:25:28 ha: msync: before_start: Port6 is static interface
    Jan 14 07:25:34 ha: handle_stat_change: 0:5 [ NA=0 AUX=1 STAND=2 PRIM=3 FAULT=4 READY=5 GOTO_PRIM=6 ]
    Jan 14 07:25:34 ha: handle_stat_change: g_ha_hsc=1 is set.
    Jan 14 07:25:34 ha: handle_stat_change: 0:5 done.
    Jan 14 07:25:34 ha: handle_stat_change: g_ha_hsc=0 is set.
    Jan 14 07:25:35 ha: handle_stat_change: 5:1 [ NA=0 AUX=1 STAND=2 PRIM=3 FAULT=4 READY=5 GOTO_PRIM=6 ]
    Jan 14 07:25:35 ha: handle_stat_change: g_ha_hsc=1 is set.
    Jan 14 07:25:35 ha: g_ha_transmode=2 [ CONFIG=1 INIT=2 EVENT=0 ]
    Jan 14 07:25:35 ha: peer is running same firmware version (own=18_0_0_255 and peer=18_0_0_255), so syncing from peer
    Jan 14 07:25:45 ha: freezing the peer csc done in retry 1
    Jan 14 07:25:45 ha: setting date 202001140725.47
    Jan 14 07:25:48 ha: starting system services
    Jan 14 07:25:50 sysinit: Service postgres failed. Binding factory default IPsJan 14 07:25:51 ha_port_down_notification: message_id : log_data : Interface Por t1 went down. Appliance HA state BACKJan 14 07:25:53 ha: system startup failed !!!
    Jan 14 07:25:53 ha_port_down_notification: message_id : log_data : Interface Port2 went down. Appliance HA state BACKJan 14 07:25:53 ha: failsafe mode in aux state so treating it as fault state !!!
    Jan 14 08:04:57 ha: Booting up in XG210_WP03_SFOS 18.0.0 EAP3
    Jan 14 08:04:57 ha: msync: before_start: lspci Port1 Port2 Port3 Port4 Port5 Port6 Port7 Port8
    Jan 14 08:04:57 ha: msync: before_start: Port6 is static interface
    Jan 14 08:05:06 ha: handle_stat_change: 0:5 [ NA=0 AUX=1 STAND=2 PRIM=3 FAULT=4 READY=5 GOTO_PRIM=6 ]
    Jan 14 08:05:06 ha: handle_stat_change: g_ha_hsc=1 is set.
    Jan 14 08:05:06 ha: handle_stat_change: 0:5 done.
    Jan 14 08:05:06 ha: handle_stat_change: g_ha_hsc=0 is set.
    Jan 14 08:05:10 ha: handle_stat_change: 5:2 [ NA=0 AUX=1 STAND=2 PRIM=3 FAULT=4 READY=5 GOTO_PRIM=6 ]
    Jan 14 08:05:10 ha: handle_stat_change: g_ha_hsc=1 is set.
    Jan 14 08:05:10 ha: g_ha_transmode=2 [ CONFIG=1 INIT=2 EVENT=0 ]
    Jan 14 08:05:10 ha: starting system services
    Jan 14 08:05:12 sysinit: Service postgres failed. Binding factory default IPsJan 14 08:05:13 ha_port_down_notification: message_id : log_data : Interface Por t1 went down. Appliance HA state STANDJan 14 08:05:14 ha: system startup failed !!!
    Jan 14 08:05:14 ha_port_down_notification: message_id : log_data : Interface Port2 went down. Appliance HA state STANDTue Jan 14 08:05:15 EST 2020 ha: postst artupwait.sh: init sleeping
    Tue Jan 14 08:05:25 EST 2020 ha: poststartupwait.sh: init sleeping over
    Jan 14 08:05:25 ha: msync:garpha: gratituous arp failed !!!
    Jan 14 08:05:25 ha: msync:garphav6: gratituous Neighbour Advertisement failed !!!
    Jan 14 08:05:25 ha: initcomp failed !!!
    Tue Jan 14 08:05:25 EST 2020 ha: poststartupwait.sh: init state transition completion failed !!!
    Jan 14 08:14:27 ha: Booting up in XG210_WP03_SFOS 18.0.0 EAP3
    Jan 14 08:14:27 ha: msync: before_start: lspci Port1 Port2 Port3 Port4 Port5 Port6 Port7 Port8
    Jan 14 08:14:27 ha: msync: before_start: Port6 is static interface
    Jan 14 08:14:33 ha: handle_stat_change: 0:5 [ NA=0 AUX=1 STAND=2 PRIM=3 FAULT=4 READY=5 GOTO_PRIM=6 ]
    Jan 14 08:14:33 ha: handle_stat_change: g_ha_hsc=1 is set.
    Jan 14 08:14:33 ha: handle_stat_change: 0:5 done.
    Jan 14 08:14:33 ha: handle_stat_change: g_ha_hsc=0 is set.
    Jan 14 08:14:35 ha: handle_stat_change: 5:1 [ NA=0 AUX=1 STAND=2 PRIM=3 FAULT=4 READY=5 GOTO_PRIM=6 ]
    Jan 14 08:14:35 ha: handle_stat_change: g_ha_hsc=1 is set.
    Jan 14 08:14:35 ha: g_ha_transmode=2 [ CONFIG=1 INIT=2 EVENT=0 ]
    Jan 14 08:14:35 ha: peer is running same firmware version (own=18_0_0_255 and peer=18_0_0_255), so syncing from peer
    Jan 14 08:14:45 ha: freezing the peer csc done in retry 1
    Jan 14 08:14:45 ha: setting date 202001140814.46
    Jan 14 08:14:47 ha: starting system services
    Jan 14 08:14:49 sysinit: Service postgres failed. Binding factory default IPsJan 14 08:14:50 ha_port_down_notification: message_id : log_data : Interface Por t1 went down. Appliance HA state BACKJan 14 08:14:52 ha: system startup failed !!!
    Jan 14 08:14:52 ha_port_down_notification: message_id : log_data : Interface Port2 went down. Appliance HA state BACKJan 14 08:14:52 ha: failsafe mode in aux state so treating it as fault state !!!
    Jan 15 02:59:30 ha: Booting up in XG210_WP03_SFOS 18.0.0 EAP3
    Jan 15 02:59:30 ha: msync: before_start: lspci Port1 Port2 Port3 Port4 Port5 Port6 Port7 Port8
    Jan 15 02:59:30 ha: msync: before_start: Port6 is static interface
    Jan 15 02:59:36 ha: handle_stat_change: 0:5 [ NA=0 AUX=1 STAND=2 PRIM=3 FAULT=4 READY=5 GOTO_PRIM=6 ]
    Jan 15 02:59:36 ha: handle_stat_change: g_ha_hsc=1 is set.
    Jan 15 02:59:36 ha: handle_stat_change: 0:5 done.
    Jan 15 02:59:36 ha: handle_stat_change: g_ha_hsc=0 is set.
    Jan 15 02:59:38 ha: handle_stat_change: 5:1 [ NA=0 AUX=1 STAND=2 PRIM=3 FAULT=4 READY=5 GOTO_PRIM=6 ]
    Jan 15 02:59:38 ha: handle_stat_change: g_ha_hsc=1 is set.
    Jan 15 02:59:38 ha: g_ha_transmode=2 [ CONFIG=1 INIT=2 EVENT=0 ]
    Jan 15 02:59:38 ha: peer is running same firmware version (own=18_0_0_255 and peer=18_0_0_255), so syncing from peer
    Jan 15 02:59:48 ha: freezing the peer csc done in retry 1
    Jan 15 02:59:48 ha: setting date 202001150259.51
    Jan 15 02:59:52 ha: starting system services
    Jan 15 02:59:54 sysinit: Service postgres failed. Binding factory default IPsJan 15 02:59:55 ha_port_down_notification: message_id : log_data : Interface Por t1 went down. Appliance HA state BACKJan 15 02:59:57 ha: system startup failed !!!
    Jan 15 02:59:57 ha_port_down_notification: message_id : log_data : Interface Port2 went down. Appliance HA state BACKJan 15 02:59:57 ha: failsafe mode in aux state so treating it as fault state !!!
    XG210_WP03_SFOS 18.0.0 EAP3#


    Output from applog.log on original Auxiliary
    XG210_WP03_SFOS 18.0.0 EAP3# cat /log/applog.log | grep "ha:"
    Jan 14 06:52:29 ha_port_down_notification: message_id : log_data : Interface Port6 went down. Appliance HA state BACKJan 14 06:52:30 ha: handle_stat_change: 1:2 [ NA=0 AUX=1 STAND=2 PRIM=3 FAULT=4 READY=5 GOTO_PRIM=6 ]
    Jan 14 06:52:30 ha: handle_stat_change: g_ha_hsc=1 is set.
    Jan 14 06:52:30 ha: g_ha_transmode=0 [ CONFIG=1 INIT=2 EVENT=0 ]
    Jan 14 06:52:30 ha: fwm:disablearpha successfully done
    Jan 14 06:52:30 ha: ctsyncd commited
    Jan 14 06:52:30 ha: ctsyncd external cache flushed
    Jan 14 06:52:30 ha: msync:applyha: stop tracking the monitoring interfaces
    Jan 14 06:52:31 ha: msync:applyha: virtual macs are assigned
    Jan 14 06:52:31 ha: msync:applyha: going to start tracking the monitoring interfaces
    Tue Jan 14 06:52:31 EST 2020 ha: trackdevicewait.sh: device sleeping
    Jan 14 06:52:32 ha: Restart DHCP / PPP client on event of enable ha
    Jan 14 06:52:32 ha: networkd:dynaiface_client_start called
    Jan 14 06:52:32 ha: msync:applyha: network part done
    Jan 14 06:52:32 ha: fwm:applyha successfully done
    Jan 14 06:52:32 ha: msync:garpha: send_arp 192.168.10.1 C8:4F:86:FC:00:01 192.168.10.255 ff:ff:ff:ff:ff:ff Port1
    Jan 14 06:52:32 ha: msync:garpha: send_arp 172.30.255.10 C8:4F:86:FC:00:04 172.30.255.255 ff:ff:ff:ff:ff:ff Port4
    Jan 14 06:52:33 ha: fwm:enablearpha successfully done
    Tue Jan 14 06:52:41 EST 2020 ha: trackdevicewait.sh: device sleeping over
    Tue Jan 14 06:52:41 EST 2020 ha: trackdevicewait.sh: start tracking the device done
    Jan 14 06:54:34 ha: mail sent successfully
    Jan 14 06:54:34 ha: networkd:dynaiface_client_start called
    Jan 14 06:54:34 ha: syncing conntracks
    Jan 14 06:54:34 ha: handle_stat_change: 1:2 done.
    Jan 14 06:54:34 ha: handle_stat_change: g_ha_hsc=0 is set.
    Jan 14 07:18:28 ha: appcached_ha_sync function is called...!!!!
    Jan 14 07:18:29 ha: redis DB dump file sync is done !!!
    Jan 14 07:30:53 ha: appcached_ha_sync function is called...!!!!
    Jan 14 07:30:54 ha: redis DB dump file sync is done !!!
    Jan 14 08:19:53 ha: appcached_ha_sync function is called...!!!!
    Jan 14 08:19:53 ha: redis DB dump file sync is done !!!
    Jan 15 03:04:58 ha: appcached_ha_sync function is called...!!!!
    Jan 15 03:04:59 ha: redis DB dump file sync is done !!!

  • I apologize for any confusion.

    I just noticed there is another HA post going on, but my original post was to just clear a flag.

    https://community.sophos.com/products/xg-firewall/sfos-eap/sfos-v18-early-access-program/f/feedback-and-issues/117346/eap-3-quick-ha-problems

    One difference for me, is I tried Quick HA, and cancelled it, as it was hanging for quite a while.

    I then proceeded to do a manual pairing.

  • Just to complete the loop.

    After cancelling the HA_pair (due over 30 minute hang), manually pairing, and having primary enter a failed state (after rebooting to clear warning), I found primary was using a 2 month old configuration (from the looks of the IP scheme and ports available, assuming from my original v18 EAP1 config).

    Drive and memory tests were done successfully.

    Did a factory reset on faulting primary with v18 EAP3 firmware in USB slot 1.

    Restored auxiliary backup (last primary backup [post HA_pair attempt] was corrupt) to unit, but did not enable HA on the primary.

    Leaving old primary as a standalone, until time allows a reset on the auxiliary (to ensure it's HA doesn't putz with things). Then will try HA again (most likely with GA release, unless time allows rebuilding an EAP release).

  • BTW, if anyone knows how I can get this deregistered from Central, I'd love to get my heartbeat back.

    Tried deregistering from firewall (didn't work), removed from Central, and tried firewall again (still no go).

    Please don't tell me to reboot. :)

  • Hi Aeron,

           Thanks for your feedback.I will send you PM for more details purpose.

  • Prashil,

    Thank you you for clearing the registration issue.

    It is now working as expected.

    Paul