VM with HA goes to failsafe mode on Synology Rackstation but works on HP Proliant with Unraid.

Question

I am trying to get HA up between two Sophos Firewall VM's. Both VMs are run using QEMU on the servers. 
 The first VM is running on a HP Proliant Gen8 with Unraid . (I have been running a Sophos Firewall on this machine for several years) 
 The second VM is running on a Synology Rackstation . 
 When HA is not activated both VM's work fine on both servers. 
 When HA is activated with the Proliant VM as the primary and the Rackstation VM as the auxiliary the Rackstation VM ends up in Failsafe Mode. 
 If I run both VM's on the Proliant then HA works perfectly. 
 I have tested many different settings on the Rackstation, both using Synologys VMM interface to control the VM and also running in directly using virsh and my own XML. 
 
 The problem it seems is a part of the syncing process taking to long or crashing which results in a timeout after 600 seconds. 
 I have included applog.log and csc.log. 
 
 applog.log 
 Feb 22 13:36:44Z apiInterface:entityjson::::::::dashboard::dashboardshutdownappliance=HASH(0x9fcf1a0)
Feb 22 13:36:44Z Info:: Transaction will not be rolled back for opcode shutdown_appliance. If any operation fails, request is part of multiple request : 
Feb 22 13:36:45Z sysstop with '%s' called
Feb 22 13:36:45Z ha: sysstop called
Feb 22 13:41:41Z ha: Booting up in SF01V_KV01_SFOS 20.0.0 GA-Build222

Feb 22 13:41:42Z msync:before_start iteration '1' to check halink up, retry till '10'
Feb 22 13:41:42Z msync:before_start PortD:
PortD Link encap:Ethernet HWaddr 02:86:47:33:43:B4 
 inet addr:10.1.0.2 Bcast:10.1.0.3 Mask:255.255.255.252
 inet6 addr: fe80::86:47ff:fe33:43b4/64 Scope:Link
 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
 RX packets:140 errors:0 dropped:0 overruns:0 frame:0
 TX packets:11 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:1000 
 RX bytes:10295 (10.0 KiB) TX bytes:1452 (1.4 KiB)

Feb 22 13:41:47Z ha: handle_stat_change: 0:5	[ NA=0 AUX=1 STAND=2 PRIM=3 FAULT=4 READY=5 GOTO_PRIM=6 ]
Feb 22 13:41:47Z ha: handle_stat_change: g_ha_hsc=1 is set.
Feb 22 13:41:47Z 
Feb 22 13:41:47Z ha: handle_stat_change: 0:5 done.
Feb 22 13:41:47Z ha: handle_stat_change: g_ha_hsc=0 is set.
Feb 22 13:41:48Z ha: handle_stat_change: 5:1	[ NA=0 AUX=1 STAND=2 PRIM=3 FAULT=4 READY=5 GOTO_PRIM=6 ]
Feb 22 13:41:48Z ha: handle_stat_change: g_ha_hsc=1 is set.
Feb 22 13:41:48Z ha: g_ha_transmode=2	[ CONFIG=1 INIT=2 EVENT=0 ]
Feb 22 13:41:48Z ha: peer is running same firmware version (own=20_0_0_222 and peer=20_0_0_222), so syncing from peer
Feb 22 13:41:49Z ha: prime enc system sanity check success!
Feb 22 13:41:49Z ha: freezing the peer csc done in retry 0
Feb 22 13:41:50Z ha: setting date 202402221441.56
Feb 22 13:41:58Z ha: dhcpv6 leases sync failed. Expected if dhcpv6 is not enabled.
Feb 22 13:41:58Z ha: starting system services
Feb 22 13:42:02Z testpass encryption: testpass encryption is not proceeded. testpass status: encrypted and decryptable, encryption status(0-insecure, 1-secure): 1
Feb 22 13:42:02Z prepare_virtual_app_setup: Syncing network interface.
Feb 22 13:42:03Z ha: start to sync sophos tmp1
Feb 22 13:42:03Z ha: tmp1 sync succeeded!
Feb 22 13:42:03Z ha: start to sync sophos tmp2
Feb 22 13:42:03Z ha: tmp2 sync succeeded!
Feb 22 13:52:01Z sysinit: Service postgres failed. Binding factory default IPs
Feb 22 13:52:01Z ha: system startup failed !!!
Feb 22 13:52:01Z ha: unfreezing the peer csc
Feb 22 13:52:01Z ha: failsafe mode in aux state so treating it as fault state !!!
Feb 22 13:54:23Z ha: preempt: prim originialrole = 0 , aux originalrole = 0
Feb 22 13:54:23Z continue_otp called 
 csc.log 
 ERROR Feb 22 13:41:47Z [dbh:1186]: get_new_handle_id: PQconnectdb failed 'could not connect to server: Connection refused
	Is the server running on host "127.0.0.1" and accepting
	TCP/IP connections on port 5432?
'
ERROR Feb 22 13:41:47Z [worker:1254]: worker_proc: do_init_atomic_dbhandle failed
MESSAGE Feb 22 13:41:47Z [sync_message:1254]: opcode 'sync_message': time taken: 0.103359417 seconds with return status: '200'
MESSAGE Feb 22 13:41:47Z [handle_stat_change:1267]: opcode 'handle_stat_change': time taken: 0.216285772 seconds with return status: '200'
MESSAGE Feb 22 13:41:47Z [ha_tunnel:1142]: {"ha_tunnel":{"method":"service","name":"ha_tunnel:start","version":"1.6","type":"text","length":0}}
MESSAGE Feb 22 13:41:47Z [ha_tunnel:1142]: service command 'start': time taken: 0.001227753 seconds with return status: '200'
MESSAGE Feb 22 13:41:47Z [msync:1144]: service command 'astart': time taken: 6.236089261 seconds with return status: '200'
MESSAGE Feb 22 13:41:48Z [worker:1298]: {"request":{"method":"opcode","name":"handle_stat_change","version":"1.2","type":"text","length":3,"data":5:1}}
ERROR Feb 22 13:41:48Z [dbh:1186]: get_new_handle_id: PQconnectdb failed 'could not connect to server: Connection refused
	Is the server running on host "127.0.0.1" and accepting
	TCP/IP connections on port 5432?
'
ERROR Feb 22 13:41:48Z [worker:1298]: worker_proc: do_init_atomic_dbhandle failed
ERROR Feb 22 13:41:49Z [handle_stat_change:1298]: csc_execve: Child exited with status 1
ERROR Feb 22 13:41:49Z [handle_stat_change:1298]: log_exec: Failed Command: /bin/scp -r -F /static/ha/hauser.conf 10.1.0.1:/conf/atr/* /conf/atr/
ERROR Feb 22 13:41:56Z [handle_stat_change:1298]: csc_execve: Child exited with status 240
ERROR Feb 22 13:41:56Z [handle_stat_change:1298]: log_exec: Failed Command: /bin/ssh -F /static/ha/hauser.conf 10.1.0.1 nvram get '#admin.secret'
WARNING Feb 22 13:41:56Z [handle_stat_change:1298]: Action with NOFAIL Failed.
ERROR Feb 22 13:41:58Z [handle_stat_change:1298]: csc_execve: Child exited with status 1
ERROR Feb 22 13:41:58Z [handle_stat_change:1298]: log_exec: Failed Command: /bin/scp -r -F /static/ha/hauser.conf 10.1.0.1:/tmp/dhcpd6.leases /tmp/dhcpd6.leases
MESSAGE Feb 22 13:41:58Z [worker:1297]: {"request":{"method":"opcode","name":"system_startup","version":"1.2","type":"text","length":0}}
ERROR Feb 22 13:41:58Z [dbh:1186]: get_new_handle_id: PQconnectdb failed 'could not connect to server: Connection refused
	Is the server running on host "127.0.0.1" and accepting
	TCP/IP connections on port 5432?
'
ERROR Feb 22 13:41:58Z [worker:1297]: worker_proc: do_init_atomic_dbhandle failed
MESSAGE Feb 22 13:41:58Z [worker:1085]: {"request":{"method":"opcode","name":"log_rotate","version":"1.6","type":"text","length":0}}
ERROR Feb 22 13:41:58Z [dbh:1186]: get_new_handle_id: PQconnectdb failed 'could not connect to server: Connection refused
	Is the server running on host "127.0.0.1" and accepting
	TCP/IP connections on port 5432?
'
ERROR Feb 22 13:41:58Z [worker:1085]: worker_proc: do_init_atomic_dbhandle failed
MESSAGE Feb 22 13:41:58Z [log_rotate:1085]: opcode 'log_rotate': time taken: 0.131864676 seconds with return status: '200'
MESSAGE Feb 22 13:41:58Z [syslog-ng:1097]: {"syslog-ng":{"method":"service","name":"syslog-ng:start","version":"1.6","type":"text","length":0}}
ERROR Feb 22 13:41:59Z [syslog-ng:1097]: csc_execve: Child exited with status 1
ERROR Feb 22 13:41:59Z [syslog-ng:1097]: log_exec: Failed Command: /bin/syslog-ng-ctl stop
WARNING Feb 22 13:41:59Z [syslog-ng:1097]: Action with NOFAIL Failed.
MESSAGE Feb 22 13:41:59Z [syslog-ng:1097]: service command 'astart': time taken: 0.859065874 seconds with return status: '200'
MESSAGE Feb 22 13:41:59Z [postgres:1089]: {"postgres":{"method":"service","name":"postgres:start","version":"1.6","type":"text","length":0}}
WARNING Feb 22 13:41:59Z [postgres:1089]: Action with NOFAIL Failed.
WARNING Feb 22 13:41:59Z [postgres:1089]: Action with NOFAIL Failed.
ERROR Feb 22 13:42:03Z [postgres:1089]: csc_execve: Child exited with status 2
ERROR Feb 22 13:42:03Z [postgres:1089]: log_exec: Failed Command: /bin/nsgenc status
pid(1297) :send_data_to_listener: sent time 'Thu Feb 22 14:41:59 2024'
pid(1297) :receive_data_from_listener: recv time 'Thu Feb 22 14:52:01 2024': timeout period '600'
ERROR Feb 22 13:52:01Z [system_startup:1297]: csc_execve: Child exited with status 1
ERROR Feb 22 13:52:01Z [system_startup:1297]: log_exec: Failed Command: /bin/killall -SIGHUP csh
WARNING Feb 22 13:52:01Z [system_startup:1297]: Action with NOFAIL Failed.
IFNAME_PREFIX: Port
IFNAME_SUFFIX: ALPHA
IFLIST: PORT_A-Z
MESSAGE Feb 22 13:52:01Z [system_startup:1297]: opcode 'system_startup': time taken: 602.672713502 seconds with return status: '500'
ERROR Feb 22 13:52:01Z [handle_stat_change:1298]: csc_execve: Child exited with status 255
ERROR Feb 22 13:52:01Z [handle_stat_change:1298]: log_exec: Failed Command: /bin/opcode -ds nosync -T 10800 system_startup
MESSAGE Feb 22 13:52:01Z [smtpd:1093]: {"smtpd":{"method":"nservice","name":"smtpd:reconfig","version":"1.6","type":"text","length":0}}
MESSAGE Feb 22 13:52:01Z [smtpd:1093]: service command 'reconfig': time taken: 0.000001063 seconds with return status: '200'
MESSAGE Feb 22 13:52:01Z [awarrensmtp:1092]: {"awarrensmtp":{"method":"nservice","name":"awarrensmtp:enable_ha","version":"1.6","type":"text","length":0}}
MESSAGE Feb 22 13:52:01Z [appcached:1177]: {"appcached":{"method":"nservice","name":"appcached:reconfig","version":"1.6","type":"text","length":0}}
MESSAGE Feb 22 13:52:01Z [appcached:1177]: service command 'reconfig': time taken: 0.000000880 seconds with return status: '200'
MESSAGE Feb 22 13:52:01Z [awarrensmtp:1092]: service command 'enable_ha': time taken: 0.104944928 seconds with return status: '200'
MESSAGE Feb 22 13:52:01Z [handle_stat_change:1298]: opcode 'handle_stat_change': time taken: 608.243427103 seconds with return status: '200'
MESSAGE Feb 22 13:54:22Z [worker:1686]: {"request":{"method":"opcode","name":"lic_ha_set","version":"1.6","type":"text","length":0}}
ERROR Feb 22 13:54:22Z [lic_ha_set:1686]: csc_execve: Child exited with status 224
ERROR Feb 22 13:54:22Z [lic_ha_set:1686]: log_exec: Failed Command: /bin/nvram del #li.epsup
WARNING Feb 22 13:54:22Z [lic_ha_set:1686]: Action with NOFAIL Failed.
ERROR Feb 22 13:54:22Z [lic_ha_set:1686]: csc_execve: Child exited with status 224
ERROR Feb 22 13:54:22Z [lic_ha_set:1686]: log_exec: Failed Command: /bin/nvram del #li.initdate
WARNING Feb 22 13:54:22Z [lic_ha_set:1686]: Action with NOFAIL Failed.
ERROR Feb 22 13:54:22Z [lic_ha_set:1686]: csc_execve: Child exited with status 224
ERROR Feb 22 13:54:22Z [lic_ha_set:1686]: log_exec: Failed Command: /bin/nvram del #li.failuredate
WARNING Feb 22 13:54:22Z [lic_ha_set:1686]: Action with NOFAIL Failed.
ERROR Feb 22 13:54:22Z [lic_ha_set:1686]: csc_execve: Child exited with status 224
ERROR Feb 22 13:54:22Z [lic_ha_set:1686]: log_exec: Failed Command: /bin/nvram del #li.failurecount
WARNING Feb 22 13:54:22Z [lic_ha_set:1686]: Action with NOFAIL Failed.
ERROR Feb 22 13:54:22Z [lic_ha_set:1686]: csc_execve: Child exited with status 224
ERROR Feb 22 13:54:22Z [lic_ha_set:1686]: log_exec: Failed Command: /bin/nvram del #li.icdays
WARNING Feb 22 13:54:22Z [lic_ha_set:1686]: Action with NOFAIL Failed.
MESSAGE Feb 22 13:54:22Z [lic_ha_set:1686]: opcode 'lic_ha_set': time taken: 0.435435294 seconds with return status: '200'
MESSAGE Feb 22 13:54:23Z [worker:1768]: {"request":{"method":"opcode","name":"continue_otp","version":"1.6","type":"text","length":0}}
MESSAGE Feb 22 13:54:23Z [continue_otp:1768]: opcode 'continue_otp': time taken: 0.116839358 seconds with return status: '200'
MESSAGE Feb 22 13:54:23Z [postgres:1089]: service command 'astart': time taken: 744.004483092 seconds with return status: '200'
MESSAGE Feb 22 13:57:01Z [redis-appcache:1174]: {"redis-appcache":{"method":"service","name":"redis-appcache:stop","version":"1.2","type":"text","length":0}}
MESSAGE Feb 22 13:57:01Z [redis-appcache:1174]: do_stop(): status = UNTOUCHED
MESSAGE Feb 22 13:57:01Z [redis-appcache:1174]: service command 'stop': time taken: 0.000005565 seconds with return status: '200'
MESSAGE Feb 22 13:57:01Z [redis-appcache:1174]: {"redis-appcache":{"method":"service","name":"redis-appcache:start","version":"1.2","type":"text","length":0}}
MESSAGE Feb 22 13:57:01Z [redis-appcache:1174]: service command 'bstart': time taken: 0.102568393 seconds with return status: '200'
MESSAGE Feb 22 13:57:01Z [appcached:1177]: {"appcached":{"method":"service","name":"appcached:restart","version":"1.2","type":"text","length":0}}
MESSAGE Feb 22 13:57:01Z [appcached:1177]: do_stop(): status = UNTOUCHED
MESSAGE Feb 22 13:57:02Z [appcached:1177]: service command 'restart': time taken: 0.504840620 seconds with return status: '200'

In applog.log when we reach Feb 22 13:42:03Z ha: tmp2 sync succeeded! the startup process seems to hang and nothing happens until Feb 22 13:52:01Z sysinit: Service postgres failed. Binding factory default IPs where it seems the sync or init of the system fails. 
 The logs in csc.log shows us that something seems to have timeout 
 pid(1297) :send_data_to_listener: sent time 'Thu Feb 22 14:41:59 2024' pid(1297) :receive_data_from_listener: recv time 'Thu Feb 22 14:52:01 2024': timeout period '600' 
 I have not been able to gather much more information. I know we are able to turn on more in-depth debugging, but it seems that debugging does not stay on across a reboot. 
 Does anyone know how to start debugging of say CSC from boot or any ideas what might be wrong.

Tanny · Answer

Well even more testing and I found the problem. 
 Slow disk write speed and high iowait on the Rackstation. They are so slow that the sync process does not complete in time and therefore fails due to timeout. 
 So if you see the above problem check iostat from the terminal and see if the disks are slow while HA syncing (its easier if you have a VM where it works as well, so you can compare the output from iostat on them).

VM with HA goes to failsafe mode on Synology Rackstation but works on HP Proliant with Unraid.

Top Replies