I am trying to get HA up between two Sophos Firewall VM's. Both VMs are run using QEMU on the servers.
The first VM is running on a HP Proliant Gen8 with Unraid. (I have been running a Sophos Firewall on this machine for several years)
The second VM is running on a Synology Rackstation.
When HA is not activated both VM's work fine on both servers.
When HA is activated with the Proliant VM as the primary and the Rackstation VM as the auxiliary the Rackstation VM ends up in Failsafe Mode.
If I run both VM's on the Proliant then HA works perfectly.
I have tested many different settings on the Rackstation, both using Synologys VMM interface to control the VM and also running in directly using virsh and my own XML.
The problem it seems is a part of the syncing process taking to long or crashing which results in a timeout after 600 seconds.
I have included applog.log and csc.log.
applog.log
Feb 22 13:36:44Z apiInterface:entityjson::::::::dashboard::dashboardshutdownappliance=HASH(0x9fcf1a0) Feb 22 13:36:44Z Info:: Transaction will not be rolled back for opcode shutdown_appliance. If any operation fails, request is part of multiple request : Feb 22 13:36:45Z sysstop with '%s' called Feb 22 13:36:45Z ha: sysstop called Feb 22 13:41:41Z ha: Booting up in SF01V_KV01_SFOS 20.0.0 GA-Build222 Feb 22 13:41:42Z msync:before_start iteration '1' to check halink up, retry till '10' Feb 22 13:41:42Z msync:before_start PortD: PortD Link encap:Ethernet HWaddr 02:86:47:33:43:B4 inet addr:10.1.0.2 Bcast:10.1.0.3 Mask:255.255.255.252 inet6 addr: fe80::86:47ff:fe33:43b4/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:140 errors:0 dropped:0 overruns:0 frame:0 TX packets:11 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:10295 (10.0 KiB) TX bytes:1452 (1.4 KiB) Feb 22 13:41:47Z ha: handle_stat_change: 0:5 [ NA=0 AUX=1 STAND=2 PRIM=3 FAULT=4 READY=5 GOTO_PRIM=6 ] Feb 22 13:41:47Z ha: handle_stat_change: g_ha_hsc=1 is set. Feb 22 13:41:47Z Feb 22 13:41:47Z ha: handle_stat_change: 0:5 done. Feb 22 13:41:47Z ha: handle_stat_change: g_ha_hsc=0 is set. Feb 22 13:41:48Z ha: handle_stat_change: 5:1 [ NA=0 AUX=1 STAND=2 PRIM=3 FAULT=4 READY=5 GOTO_PRIM=6 ] Feb 22 13:41:48Z ha: handle_stat_change: g_ha_hsc=1 is set. Feb 22 13:41:48Z ha: g_ha_transmode=2 [ CONFIG=1 INIT=2 EVENT=0 ] Feb 22 13:41:48Z ha: peer is running same firmware version (own=20_0_0_222 and peer=20_0_0_222), so syncing from peer Feb 22 13:41:49Z ha: prime enc system sanity check success! Feb 22 13:41:49Z ha: freezing the peer csc done in retry 0 Feb 22 13:41:50Z ha: setting date 202402221441.56 Feb 22 13:41:58Z ha: dhcpv6 leases sync failed. Expected if dhcpv6 is not enabled. Feb 22 13:41:58Z ha: starting system services Feb 22 13:42:02Z testpass encryption: testpass encryption is not proceeded. testpass status: encrypted and decryptable, encryption status(0-insecure, 1-secure): 1 Feb 22 13:42:02Z prepare_virtual_app_setup: Syncing network interface. Feb 22 13:42:03Z ha: start to sync sophos tmp1 Feb 22 13:42:03Z ha: tmp1 sync succeeded! Feb 22 13:42:03Z ha: start to sync sophos tmp2 Feb 22 13:42:03Z ha: tmp2 sync succeeded! Feb 22 13:52:01Z sysinit: Service postgres failed. Binding factory default IPs Feb 22 13:52:01Z ha: system startup failed !!! Feb 22 13:52:01Z ha: unfreezing the peer csc Feb 22 13:52:01Z ha: failsafe mode in aux state so treating it as fault state !!! Feb 22 13:54:23Z ha: preempt: prim originialrole = 0 , aux originalrole = 0 Feb 22 13:54:23Z continue_otp called
csc.log
ERROR Feb 22 13:41:47Z [dbh:1186]: get_new_handle_id: PQconnectdb failed 'could not connect to server: Connection refused Is the server running on host "127.0.0.1" and accepting TCP/IP connections on port 5432? ' ERROR Feb 22 13:41:47Z [worker:1254]: worker_proc: do_init_atomic_dbhandle failed MESSAGE Feb 22 13:41:47Z [sync_message:1254]: opcode 'sync_message': time taken: 0.103359417 seconds with return status: '200' MESSAGE Feb 22 13:41:47Z [handle_stat_change:1267]: opcode 'handle_stat_change': time taken: 0.216285772 seconds with return status: '200' MESSAGE Feb 22 13:41:47Z [ha_tunnel:1142]: {"ha_tunnel":{"method":"service","name":"ha_tunnel:start","version":"1.6","type":"text","length":0}} MESSAGE Feb 22 13:41:47Z [ha_tunnel:1142]: service command 'start': time taken: 0.001227753 seconds with return status: '200' MESSAGE Feb 22 13:41:47Z [msync:1144]: service command 'astart': time taken: 6.236089261 seconds with return status: '200' MESSAGE Feb 22 13:41:48Z [worker:1298]: {"request":{"method":"opcode","name":"handle_stat_change","version":"1.2","type":"text","length":3,"data":5:1}} ERROR Feb 22 13:41:48Z [dbh:1186]: get_new_handle_id: PQconnectdb failed 'could not connect to server: Connection refused Is the server running on host "127.0.0.1" and accepting TCP/IP connections on port 5432? ' ERROR Feb 22 13:41:48Z [worker:1298]: worker_proc: do_init_atomic_dbhandle failed ERROR Feb 22 13:41:49Z [handle_stat_change:1298]: csc_execve: Child exited with status 1 ERROR Feb 22 13:41:49Z [handle_stat_change:1298]: log_exec: Failed Command: /bin/scp -r -F /static/ha/hauser.conf 10.1.0.1:/conf/atr/* /conf/atr/ ERROR Feb 22 13:41:56Z [handle_stat_change:1298]: csc_execve: Child exited with status 240 ERROR Feb 22 13:41:56Z [handle_stat_change:1298]: log_exec: Failed Command: /bin/ssh -F /static/ha/hauser.conf 10.1.0.1 nvram get '#admin.secret' WARNING Feb 22 13:41:56Z [handle_stat_change:1298]: Action with NOFAIL Failed. ERROR Feb 22 13:41:58Z [handle_stat_change:1298]: csc_execve: Child exited with status 1 ERROR Feb 22 13:41:58Z [handle_stat_change:1298]: log_exec: Failed Command: /bin/scp -r -F /static/ha/hauser.conf 10.1.0.1:/tmp/dhcpd6.leases /tmp/dhcpd6.leases MESSAGE Feb 22 13:41:58Z [worker:1297]: {"request":{"method":"opcode","name":"system_startup","version":"1.2","type":"text","length":0}} ERROR Feb 22 13:41:58Z [dbh:1186]: get_new_handle_id: PQconnectdb failed 'could not connect to server: Connection refused Is the server running on host "127.0.0.1" and accepting TCP/IP connections on port 5432? ' ERROR Feb 22 13:41:58Z [worker:1297]: worker_proc: do_init_atomic_dbhandle failed MESSAGE Feb 22 13:41:58Z [worker:1085]: {"request":{"method":"opcode","name":"log_rotate","version":"1.6","type":"text","length":0}} ERROR Feb 22 13:41:58Z [dbh:1186]: get_new_handle_id: PQconnectdb failed 'could not connect to server: Connection refused Is the server running on host "127.0.0.1" and accepting TCP/IP connections on port 5432? ' ERROR Feb 22 13:41:58Z [worker:1085]: worker_proc: do_init_atomic_dbhandle failed MESSAGE Feb 22 13:41:58Z [log_rotate:1085]: opcode 'log_rotate': time taken: 0.131864676 seconds with return status: '200' MESSAGE Feb 22 13:41:58Z [syslog-ng:1097]: {"syslog-ng":{"method":"service","name":"syslog-ng:start","version":"1.6","type":"text","length":0}} ERROR Feb 22 13:41:59Z [syslog-ng:1097]: csc_execve: Child exited with status 1 ERROR Feb 22 13:41:59Z [syslog-ng:1097]: log_exec: Failed Command: /bin/syslog-ng-ctl stop WARNING Feb 22 13:41:59Z [syslog-ng:1097]: Action with NOFAIL Failed. MESSAGE Feb 22 13:41:59Z [syslog-ng:1097]: service command 'astart': time taken: 0.859065874 seconds with return status: '200' MESSAGE Feb 22 13:41:59Z [postgres:1089]: {"postgres":{"method":"service","name":"postgres:start","version":"1.6","type":"text","length":0}} WARNING Feb 22 13:41:59Z [postgres:1089]: Action with NOFAIL Failed. WARNING Feb 22 13:41:59Z [postgres:1089]: Action with NOFAIL Failed. ERROR Feb 22 13:42:03Z [postgres:1089]: csc_execve: Child exited with status 2 ERROR Feb 22 13:42:03Z [postgres:1089]: log_exec: Failed Command: /bin/nsgenc status pid(1297) :send_data_to_listener: sent time 'Thu Feb 22 14:41:59 2024' pid(1297) :receive_data_from_listener: recv time 'Thu Feb 22 14:52:01 2024': timeout period '600' ERROR Feb 22 13:52:01Z [system_startup:1297]: csc_execve: Child exited with status 1 ERROR Feb 22 13:52:01Z [system_startup:1297]: log_exec: Failed Command: /bin/killall -SIGHUP csh WARNING Feb 22 13:52:01Z [system_startup:1297]: Action with NOFAIL Failed. IFNAME_PREFIX: Port IFNAME_SUFFIX: ALPHA IFLIST: PORT_A-Z MESSAGE Feb 22 13:52:01Z [system_startup:1297]: opcode 'system_startup': time taken: 602.672713502 seconds with return status: '500' ERROR Feb 22 13:52:01Z [handle_stat_change:1298]: csc_execve: Child exited with status 255 ERROR Feb 22 13:52:01Z [handle_stat_change:1298]: log_exec: Failed Command: /bin/opcode -ds nosync -T 10800 system_startup MESSAGE Feb 22 13:52:01Z [smtpd:1093]: {"smtpd":{"method":"nservice","name":"smtpd:reconfig","version":"1.6","type":"text","length":0}} MESSAGE Feb 22 13:52:01Z [smtpd:1093]: service command 'reconfig': time taken: 0.000001063 seconds with return status: '200' MESSAGE Feb 22 13:52:01Z [awarrensmtp:1092]: {"awarrensmtp":{"method":"nservice","name":"awarrensmtp:enable_ha","version":"1.6","type":"text","length":0}} MESSAGE Feb 22 13:52:01Z [appcached:1177]: {"appcached":{"method":"nservice","name":"appcached:reconfig","version":"1.6","type":"text","length":0}} MESSAGE Feb 22 13:52:01Z [appcached:1177]: service command 'reconfig': time taken: 0.000000880 seconds with return status: '200' MESSAGE Feb 22 13:52:01Z [awarrensmtp:1092]: service command 'enable_ha': time taken: 0.104944928 seconds with return status: '200' MESSAGE Feb 22 13:52:01Z [handle_stat_change:1298]: opcode 'handle_stat_change': time taken: 608.243427103 seconds with return status: '200' MESSAGE Feb 22 13:54:22Z [worker:1686]: {"request":{"method":"opcode","name":"lic_ha_set","version":"1.6","type":"text","length":0}} ERROR Feb 22 13:54:22Z [lic_ha_set:1686]: csc_execve: Child exited with status 224 ERROR Feb 22 13:54:22Z [lic_ha_set:1686]: log_exec: Failed Command: /bin/nvram del #li.epsup WARNING Feb 22 13:54:22Z [lic_ha_set:1686]: Action with NOFAIL Failed. ERROR Feb 22 13:54:22Z [lic_ha_set:1686]: csc_execve: Child exited with status 224 ERROR Feb 22 13:54:22Z [lic_ha_set:1686]: log_exec: Failed Command: /bin/nvram del #li.initdate WARNING Feb 22 13:54:22Z [lic_ha_set:1686]: Action with NOFAIL Failed. ERROR Feb 22 13:54:22Z [lic_ha_set:1686]: csc_execve: Child exited with status 224 ERROR Feb 22 13:54:22Z [lic_ha_set:1686]: log_exec: Failed Command: /bin/nvram del #li.failuredate WARNING Feb 22 13:54:22Z [lic_ha_set:1686]: Action with NOFAIL Failed. ERROR Feb 22 13:54:22Z [lic_ha_set:1686]: csc_execve: Child exited with status 224 ERROR Feb 22 13:54:22Z [lic_ha_set:1686]: log_exec: Failed Command: /bin/nvram del #li.failurecount WARNING Feb 22 13:54:22Z [lic_ha_set:1686]: Action with NOFAIL Failed. ERROR Feb 22 13:54:22Z [lic_ha_set:1686]: csc_execve: Child exited with status 224 ERROR Feb 22 13:54:22Z [lic_ha_set:1686]: log_exec: Failed Command: /bin/nvram del #li.icdays WARNING Feb 22 13:54:22Z [lic_ha_set:1686]: Action with NOFAIL Failed. MESSAGE Feb 22 13:54:22Z [lic_ha_set:1686]: opcode 'lic_ha_set': time taken: 0.435435294 seconds with return status: '200' MESSAGE Feb 22 13:54:23Z [worker:1768]: {"request":{"method":"opcode","name":"continue_otp","version":"1.6","type":"text","length":0}} MESSAGE Feb 22 13:54:23Z [continue_otp:1768]: opcode 'continue_otp': time taken: 0.116839358 seconds with return status: '200' MESSAGE Feb 22 13:54:23Z [postgres:1089]: service command 'astart': time taken: 744.004483092 seconds with return status: '200' MESSAGE Feb 22 13:57:01Z [redis-appcache:1174]: {"redis-appcache":{"method":"service","name":"redis-appcache:stop","version":"1.2","type":"text","length":0}} MESSAGE Feb 22 13:57:01Z [redis-appcache:1174]: do_stop(): status = UNTOUCHED MESSAGE Feb 22 13:57:01Z [redis-appcache:1174]: service command 'stop': time taken: 0.000005565 seconds with return status: '200' MESSAGE Feb 22 13:57:01Z [redis-appcache:1174]: {"redis-appcache":{"method":"service","name":"redis-appcache:start","version":"1.2","type":"text","length":0}} MESSAGE Feb 22 13:57:01Z [redis-appcache:1174]: service command 'bstart': time taken: 0.102568393 seconds with return status: '200' MESSAGE Feb 22 13:57:01Z [appcached:1177]: {"appcached":{"method":"service","name":"appcached:restart","version":"1.2","type":"text","length":0}} MESSAGE Feb 22 13:57:01Z [appcached:1177]: do_stop(): status = UNTOUCHED MESSAGE Feb 22 13:57:02Z [appcached:1177]: service command 'restart': time taken: 0.504840620 seconds with return status: '200'
In applog.log when we reach Feb 22 13:42:03Z ha: tmp2 sync succeeded! the startup process seems to hang and nothing happens until Feb 22 13:52:01Z sysinit: Service postgres failed. Binding factory default IPs where it seems the sync or init of the system fails.
The logs in csc.log shows us that something seems to have timeout
pid(1297) :send_data_to_listener: sent time 'Thu Feb 22 14:41:59 2024'
pid(1297) :receive_data_from_listener: recv time 'Thu Feb 22 14:52:01 2024': timeout period '600'
I have not been able to gather much more information. I know we are able to turn on more in-depth debugging, but it seems that debugging does not stay on across a reboot.
Does anyone know how to start debugging of say CSC from boot or any ideas what might be wrong.
This thread was automatically locked due to age.