Finding the root cause of crashes / sudden reboots

Question

Hello, 
 we reinstalled our Sophos XG 550 Cluster recently an a temporary hardware, reimaged our old hardware that has gone through x updates. We transferred the configuration through the import/export function. The cluster ran the latest SFOS 18.5.2 MR-2-Build380. After 5 days the slave crushed and one day later the primary rebooted. 
 
 In ha_tunnel.log I found the likely time of the crash of the second node: 
 Feb 17 19:46:43 ssh: connect to host hapeer port 22: Connection timed out Feb 17 19:46:48 ssh: connect to host hapeer port 22: Connection timed out XG550_RL02_SFOS 18.5.2 MR-2-Build380# Feb 22 15:37:33 Timeout, server 10.255.254.2 not responding. Feb 22 15:37:38 ssh: connect to host hapeer port 22: Connection timed out Feb 22 15:37:43 ssh: connect to host hapeer port 22: Connection timed out 
 
 In Syslog.log I found the Reboot: 
 Feb 23 08:52:23 localhost heartbeat: [SEND-TLV] No response from autherntication server expected. Feb 23 08:52:25 localhost kernel: [480245.123349] packet dropped in ipsec0 device Feb 23 08:52:28 localhost heartbeat: [SEND-TLV] No response from autherntication server expected. Feb 23 08:52:35 localhost : System will reboot Feb 23 08:52:35 localhost : The system is going down NOW! Feb 23 08:52:35 localhost kernel: klogd: exiting Feb 23 08:53:49 (none) syslog.info syslogd started: BusyBox v1.31.1 Feb 23 08:53:49 (none) user.notice kernel: klogd started: BusyBox v1.31.1 (2021-11-11 16:22:38 UTC) Feb 23 08:53:49 (none) user.notice kernel: [ 0.000000] Linux version 4.14.38 (jenkins@ci-44) (gcc version 7.3.0 (OpenWrt GCC 7.3.0 10364-gca67746e8)) #2 SMP Thu Nov 11 18:56:52 UTC 2021 Feb 23 08:53:49 (none) user.info kernel: [ 0.000000] Command line: BOOT_IMAGE=/18_5_2_380 quiet console=tty0 console=ttyS0,38400n8 pcie_aspm.policy=performance 
 (Pages over pages ...) 
 Is there an easy way to find out the root cause for the freezing of the auxiliary and the reboot of the primary? Do you have any Troubleshooting manuals that cover the above cases? 
 Are there any known issues that explain these two behaviours? I also noticed that the access log started right after the reboot so I did not see entries before the reboot. Regards, BeEf

emmosophos · Accepted Answer

Hello BeEf, 
 Thank you for contacting the Sophos Community. 
 You would need to check csc.log, applog.log, syslog.log, msync.log and networkd.log 
 Additionally any coredump on the date of the failure under /var/cores 
 Also checking the Graphics of the XG, for Memory and CPU might give you a clue if the device stopped responding. 
 Also the output of this command: grep 'NMI\|backtrace' /log/syslog.log 
 Regards,

LuCar Toni · Answer

Another approach would be: Get a serial cable linked to a client to the appliance (USB?) and do a putty session. Log everything into a file and wait. 
 If the appliance freeze, you could see the actual output of Serial and this could indicate the next steps of this issue, if there is a kernel panic or something else.

LuCar Toni · Answer

See;: https://support.sophos.com/support/s/article/KB-000040418?language=en_US

Finding the root cause of crashes / sudden reboots

Top Replies