Important note about SSL VPN compatibility for 20.0 MR1 with EoL SFOS versions and UTM9 OS. Learn more in the release notes.

This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Finding the root cause of crashes / sudden reboots

Hello,

we reinstalled our Sophos XG 550 Cluster recently an a temporary hardware, reimaged our old hardware that has gone through x updates. We transferred the configuration through the import/export function. The cluster ran the latest SFOS 18.5.2 MR-2-Build380.

After 5 days the slave crushed and one day later the primary rebooted.

In ha_tunnel.log I found the likely time of the crash of the second node:

Feb 17 19:46:43 ssh: connect to host hapeer port 22: Connection timed out
Feb 17 19:46:48 ssh: connect to host hapeer port 22: Connection timed out
XG550_RL02_SFOS 18.5.2 MR-2-Build380# Feb 22 15:37:33 Timeout, server 10.255.254.2 not responding.
Feb 22 15:37:38 ssh: connect to host hapeer port 22: Connection timed out
Feb 22 15:37:43 ssh: connect to host hapeer port 22: Connection timed out

 

In Syslog.log I found the Reboot:

Feb 23 08:52:23 localhost heartbeat: [SEND-TLV] No response from autherntication server expected.
Feb 23 08:52:25 localhost kernel: [480245.123349] packet dropped in ipsec0 device
Feb 23 08:52:28 localhost heartbeat: [SEND-TLV] No response from autherntication server expected.
Feb 23 08:52:35 localhost : System will reboot
Feb 23 08:52:35 localhost : The system is going down NOW!
Feb 23 08:52:35 localhost kernel: klogd: exiting
Feb 23 08:53:49 (none) syslog.info syslogd started: BusyBox v1.31.1
Feb 23 08:53:49 (none) user.notice kernel: klogd started: BusyBox v1.31.1 (2021-11-11 16:22:38 UTC)
Feb 23 08:53:49 (none) user.notice kernel: [ 0.000000] Linux version 4.14.38 (jenkins@ci-44) (gcc version 7.3.0 (OpenWrt GCC 7.3.0 10364-gca67746e8)) #2 SMP Thu Nov 11 18:56:52 UTC 2021
Feb 23 08:53:49 (none) user.info kernel: [ 0.000000] Command line: BOOT_IMAGE=/18_5_2_380 quiet console=tty0 console=ttyS0,38400n8 pcie_aspm.policy=performance

(Pages over pages ...)

Is there an easy way to find out the root cause for the freezing of the auxiliary and the reboot of the primary?

Do you have any Troubleshooting manuals that cover the above cases?

Are there any known issues that explain these two behaviours?

I also noticed that the access log started right after the reboot so I did not see entries before the reboot.

Regards,
BeEf



This thread was automatically locked due to age.
Parents
  • Hello BeEf,

    Thank you for contacting the Sophos Community.

    You would need to check csc.log, applog.log, syslog.log, msync.log and networkd.log

    Additionally any coredump on the date of the failure under /var/cores

    Also checking the Graphics of the XG, for Memory and CPU might give you a clue if the device stopped responding.

    Also the output of this command: grep 'NMI\|backtrace' /log/syslog.log

    Regards,


     
    Emmanuel (EmmoSophos)
    Technical Team Lead, Global Community Support
    Sophos Support VideosProduct Documentation  |  @SophosSupport  | Sign up for SMS Alerts
    If a post solves your question use the 'Verify Answer' link.
Reply
  • Hello BeEf,

    Thank you for contacting the Sophos Community.

    You would need to check csc.log, applog.log, syslog.log, msync.log and networkd.log

    Additionally any coredump on the date of the failure under /var/cores

    Also checking the Graphics of the XG, for Memory and CPU might give you a clue if the device stopped responding.

    Also the output of this command: grep 'NMI\|backtrace' /log/syslog.log

    Regards,


     
    Emmanuel (EmmoSophos)
    Technical Team Lead, Global Community Support
    Sophos Support VideosProduct Documentation  |  @SophosSupport  | Sign up for SMS Alerts
    If a post solves your question use the 'Verify Answer' link.
Children
  • On the Weekend the Auxiliary froze again. And the primary firewall rebooted again.
    So same situation as last time.

    Nothing regards  grep 'NMI\|backtrace' /log/syslog.log happens before the reboot.
    No coredumps under /var/cores on the primary and secondary firewall.
    After the reboot the primary showed the normal login screen (normal behaviour).

    On the auxiliy I saw the normal login (FW Version and login without the ability to login). All networks + COM Port unresponsive.
    Disconnected from power, switched on the firewall synched and everything up and running again.

  • Hello BeEf,

    Thank you for the update.

    What is the output of the following commands:

    # cish -c "system firewall-acceleration show"

    console> system auto-reboot-on-hang show

    # cat /proc/iomem | grep -i crash

    And do you see anything on the Graphs of the XG?

    Regards,


     
    Emmanuel (EmmoSophos)
    Technical Team Lead, Global Community Support
    Sophos Support VideosProduct Documentation  |  @SophosSupport  | Sign up for SMS Alerts
    If a post solves your question use the 'Verify Answer' link.
  • After the first two crashes (Primary) resp. the first two freezes (Auxiliary) I disabled the firewall accelaration. 

    Primary/Auxiliary:
    XG550_RL02_SFOS 18.5.2 MR-2-Build380# cish -c "system firewall-acceleration show"
    Firewall Acceleration is Disabled in Configuration.
    Firewall Acceleration is Unloaded.

    What is the impact of this setting with respect to stability of the appliance?


    Primary/Auxiliary:
    XG550_RL02_SFOS 18.5.2 MR-2-Build380# cish -c "system auto-reboot-on-hang show"
    Auto reboot system when kernel gets into a hang state is enabled

    Primary/Auxiliary:

    Nothing foung: cat /proc/iomem | grep -i crash

    Graphs:

    I did not see anything supspicious in the graph / monitoring program.

  • Hello BeEf,

    Thank you for the follow-up.

    The command would not affect the stability of the appliance, basically only it stops using the fast path. 

    I would recommend you get a case open so Support can start tracking what has been happening, most likely this would need to get escalated to see if something else is affecting the appliance, feel free to reference this Community Post when opening the case, and also as Luca suggested, connect a Console Cable, so the next time this happens Support has more info.

    Regards,


     
    Emmanuel (EmmoSophos)
    Technical Team Lead, Global Community Support
    Sophos Support VideosProduct Documentation  |  @SophosSupport  | Sign up for SMS Alerts
    If a post solves your question use the 'Verify Answer' link.