XG Firewall freezes up completely every month or so, nothing in logs so I can't determine the cause

At one of our client sites, an XG firewall works flawlessly most of the time, but every month or so it just stops working.  You can't ping it, it stops routing traffic, just nothing.  When you physically look at the firewall it looks fine and the activity lights blink.  Every time I've tried to track down a cause, I haven't been able to find one.  For example, in the system logs, those are normal up to the point where it stops functioning, and the logs don't resume until after a power cycle.  I have most of the logging enabled so it should catch at least something.  Any ideas to the cause of the freezing, or why nothing is caught in the logs?

  • Hi  

    I would request you to connect the device over serial console cable and enable log capturing in the putty as a session output, you have to keep it connected until the issue occurs to capture the logs that can provide an RCA behind this. - https://community.sophos.com/kb/en-us/123197

    You may also run fsck as well.

    Check file system integrity of all the partitions. Turning ON this option forcefully checks the file system integrity on
    next device reboot. By default, check is OFF but whenever device goes in failsafe due to following reasons, this check
    is automatically turned ON:

    • Unable to start Config/Report/Signature Database
    • Unable to Apply migration
    • Unable to find the deployment mode

    fsck-on-nextboot[ off | on | show ]

    Once the check is turned ON, on the boot, all the partitions will be checked. The check will be turned OFF again on
    the next boot.

    If the option is ON and the device boots up due following reasons, then file system check will not be enforced and
    option will be disabled after boot:

    • Factory reset
    • Flush Device Report

    Please also check system graphs for CPU usage.

  • In reply to Keyur:

    Hi Keyur, thanks for the reply.  I'm not onsite so the serial cable connection will have to wait.  I did check the system graphs though, take a look.

    Notice how it just drops off completely.  All the graphs show this for that time slot.  I'll connect with Putty and turn on fsck for next boot.

  • In reply to AdminofClouds:

    Hi  

    As per the graphs, it seems the device was having issues.

    I am sharing you with an article to try to capture logs to analyze if we found something.

    You require Advanced Shell access and try to capture dmesg and sysinit and syslog

    Connecting to the advanced shell

    1. To connect using SSH, you may use any SSH client to connect to port 22 of the SFOS device.
    2. Select option 5 Device Management.
    3. Select option 3 Advanced Shell.

    https://community.sophos.com/kb/en-us/132211


    https://community.sophos.com/kb/en-us/123185

  • Interesting, we managed approx 60 XG Firewalls most running 17.10 and have similar symptoms on approx 4-5 firewalls. We have a case open with GES at present. So far they seem to be focusing their attention on looking at whats occurring between the firewall & Sophos Firewall Manager.

  • In reply to Adam Rippon:

    Below is what we have received from Sophos so far & it appears to be same issue across our affected firewalls 

    "Hello


    Development team found that 'DB is In deadlock due to the following command.'
    DEBUG     Mar 29 10:50:01  [apiExport:4847]: exec: argv[2] = '/bin/sh /scripts/API/cleanupdb.sh'

    To gather more cause analysis and then to proceed with resolution, we will need output of requested command when issue occurs. 

    The command should be executed when / during you face the issue and not after rebooting / restarting the appliance.

    Also please let us know immediately once you face the issue as we will have to collect /log/corpvaccum.log

    Let me know in case you have any query. 


    Regards,"

  • In reply to Adam Rippon:

    Hi  

    Thank you for sharing the details, could you please PM us the service request number so that we can keep an eye on the progress and details.

  • In reply to Keyur:

    Thanks for reaching out Keyur.

    PM has been sent.

    Many thanks

    Adam

  • In reply to Adam Rippon:

    Hi  

    Thank you for sharing it.

  • In reply to Keyur:

    Hi,

    Below is the update we received last evening.

    “Development team have found following 2 lines from psql which indicates 2 parallel vacuum process running.

    23531 | vacuum full | active | 2020-05-11 19:19:58.115152+10 | 2020-05-11 19:19:58.116417+10 | 2020-05-11 19:19:58.116417+10 | 2020-05-11 19:19:58.116418+10

    9580 | vacuum full | active | 2020-05-11 20:31:47.524481+10 | 2020-05-11 20:31:47.52581+10 | 2020-05-11 20:31:47.52581+10 | 2020-05-11 20:31:47.525812+10

    To solve we need to change content of script /scripts/API/cleanupdb.sh and monitor it. “

  • In reply to Adam Rippon:

    So after all of this we received this reply

     

    "Hello All,
    The solution for the reported issue is to disable 'CCL' for each firewall from SFM.

    Please note that 'End of Life' for SFM is June 30, 2021

    Central Management which will be used instead of SFM does not have 'CCL' feature.
    www.sophos.com/.../xg-firewall-in-central.aspx

    For finding another workaround/solution we need to put CSC service in debug mode and will need output of some more Postgres queries. In case you are looking for another solution please provide me support Access ID of 2-3 different firewalls on which issue had occurred in the past.

    I will put the services in debugging and provide you steps on how to collect the required information when issues occur."


    Really disappointed with the response.