Important note about SSL VPN compatibility for 20.0 MR1 with EoL SFOS versions and UTM9 OS. Learn more in the release notes.

This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Finding the root cause of crashes / sudden reboots

Hello,

we reinstalled our Sophos XG 550 Cluster recently an a temporary hardware, reimaged our old hardware that has gone through x updates. We transferred the configuration through the import/export function. The cluster ran the latest SFOS 18.5.2 MR-2-Build380.

After 5 days the slave crushed and one day later the primary rebooted.

In ha_tunnel.log I found the likely time of the crash of the second node:

Feb 17 19:46:43 ssh: connect to host hapeer port 22: Connection timed out
Feb 17 19:46:48 ssh: connect to host hapeer port 22: Connection timed out
XG550_RL02_SFOS 18.5.2 MR-2-Build380# Feb 22 15:37:33 Timeout, server 10.255.254.2 not responding.
Feb 22 15:37:38 ssh: connect to host hapeer port 22: Connection timed out
Feb 22 15:37:43 ssh: connect to host hapeer port 22: Connection timed out

 

In Syslog.log I found the Reboot:

Feb 23 08:52:23 localhost heartbeat: [SEND-TLV] No response from autherntication server expected.
Feb 23 08:52:25 localhost kernel: [480245.123349] packet dropped in ipsec0 device
Feb 23 08:52:28 localhost heartbeat: [SEND-TLV] No response from autherntication server expected.
Feb 23 08:52:35 localhost : System will reboot
Feb 23 08:52:35 localhost : The system is going down NOW!
Feb 23 08:52:35 localhost kernel: klogd: exiting
Feb 23 08:53:49 (none) syslog.info syslogd started: BusyBox v1.31.1
Feb 23 08:53:49 (none) user.notice kernel: klogd started: BusyBox v1.31.1 (2021-11-11 16:22:38 UTC)
Feb 23 08:53:49 (none) user.notice kernel: [ 0.000000] Linux version 4.14.38 (jenkins@ci-44) (gcc version 7.3.0 (OpenWrt GCC 7.3.0 10364-gca67746e8)) #2 SMP Thu Nov 11 18:56:52 UTC 2021
Feb 23 08:53:49 (none) user.info kernel: [ 0.000000] Command line: BOOT_IMAGE=/18_5_2_380 quiet console=tty0 console=ttyS0,38400n8 pcie_aspm.policy=performance

(Pages over pages ...)

Is there an easy way to find out the root cause for the freezing of the auxiliary and the reboot of the primary?

Do you have any Troubleshooting manuals that cover the above cases?

Are there any known issues that explain these two behaviours?

I also noticed that the access log started right after the reboot so I did not see entries before the reboot.

Regards,
BeEf



This thread was automatically locked due to age.
Parents
  • Another approach would be: Get a serial cable linked to a client to the appliance (USB?) and do a putty session. Log everything into a file and wait. 

    If the appliance freeze, you could see the actual output of Serial and this could indicate the next steps of this issue, if there is a kernel panic or something else. 

    __________________________________________________________________________________________________________________

Reply
  • Another approach would be: Get a serial cable linked to a client to the appliance (USB?) and do a putty session. Log everything into a file and wait. 

    If the appliance freeze, you could see the actual output of Serial and this could indicate the next steps of this issue, if there is a kernel panic or something else. 

    __________________________________________________________________________________________________________________

Children