Scheduled maintenance on Saturday, August 8th from 7am to 10am (UTC). Licensing registrations and key activations will be unavailable during this period. More info here.
We'd love to hear about it! Click here to go to the product suggestion community
At one of our client sites, an XG firewall works flawlessly most of the time, but every month or so it just stops working. You can't ping it, it stops routing traffic, just nothing. When you physically look at the firewall it looks fine and the activity lights blink. Every time I've tried to track down a cause, I haven't been able to find one. For example, in the system logs, those are normal up to the point where it stops functioning, and the logs don't resume until after a power cycle. I have most of the logging enabled so it should catch at least something. Any ideas to the cause of the freezing, or why nothing is caught in the logs?
Hi AdminofClouds I would request you to connect the device over serial console cable and enable log capturing in the putty as a session output, you have to keep it connected until the issue occurs to capture the logs that can provide an RCA behind this. - https://community.sophos.com/kb/en-us/123197You may also run fsck as well.
Check file system integrity of all the partitions. Turning ON this option forcefully checks the file system integrity onnext device reboot. By default, check is OFF but whenever device goes in failsafe due to following reasons, this checkis automatically turned ON:
• Unable to start Config/Report/Signature Database• Unable to Apply migration• Unable to find the deployment mode
fsck-on-nextboot[ off | on | show ]
Once the check is turned ON, on the boot, all the partitions will be checked. The check will be turned OFF again onthe next boot.
If the option is ON and the device boots up due following reasons, then file system check will not be enforced andoption will be disabled after boot:
• Factory reset• Flush Device ReportPlease also check system graphs for CPU usage.
In reply to Keyur:
Hi Keyur, thanks for the reply. I'm not onsite so the serial cable connection will have to wait. I did check the system graphs though, take a look.
Notice how it just drops off completely. All the graphs show this for that time slot. I'll connect with Putty and turn on fsck for next boot.
In reply to AdminofClouds:
Hi AdminofClouds As per the graphs, it seems the device was having issues.I am sharing you with an article to try to capture logs to analyze if we found something.You require Advanced Shell access and try to capture dmesg and sysinit and syslog
Interesting, we managed approx 60 XG Firewalls most running 17.10 and have similar symptoms on approx 4-5 firewalls. We have a case open with GES at present. So far they seem to be focusing their attention on looking at whats occurring between the firewall & Sophos Firewall Manager.
In reply to Adam Rippon:
Below is what we have received from Sophos so far & it appears to be same issue across our affected firewalls "HelloDevelopment team found that 'DB is In deadlock due to the following command.'DEBUG Mar 29 10:50:01 [apiExport:4847]: exec: argv = '/bin/sh /scripts/API/cleanupdb.sh'To gather more cause analysis and then to proceed with resolution, we will need output of requested command when issue occurs. The command should be executed when / during you face the issue and not after rebooting / restarting the appliance.Also please let us know immediately once you face the issue as we will have to collect /log/corpvaccum.log. Let me know in case you have any query. Regards,"
Hi Adam Rippon
Thank you for sharing the details, could you please PM us the service request number so that we can keep an eye on the progress and details.
Thanks for reaching out Keyur.
PM has been sent.
Hi Adam Rippon Thank you for sharing it.
Below is the update we received last evening.
“Development team have found following 2 lines from psql which indicates 2 parallel vacuum process running.
23531 | vacuum full | active | 2020-05-11 19:19:58.115152+10 | 2020-05-11 19:19:58.116417+10 | 2020-05-11 19:19:58.116417+10 | 2020-05-11 19:19:58.116418+10
9580 | vacuum full | active | 2020-05-11 20:31:47.524481+10 | 2020-05-11 20:31:47.52581+10 | 2020-05-11 20:31:47.52581+10 | 2020-05-11 20:31:47.525812+10
To solve we need to change content of script /scripts/API/cleanupdb.sh and monitor it. “
So after all of this we received this reply
"Hello All,The solution for the reported issue is to disable 'CCL' for each firewall from SFM.
Please note that 'End of Life' for SFM is June 30, 2021
Central Management which will be used instead of SFM does not have 'CCL' feature. www.sophos.com/.../xg-firewall-in-central.aspx
For finding another workaround/solution we need to put CSC service in debug mode and will need output of some more Postgres queries. In case you are looking for another solution please provide me support Access ID of 2-3 different firewalls on which issue had occurred in the past.
I will put the services in debugging and provide you steps on how to collect the required information when issues occur."Really disappointed with the response.