we are using SG-115 Firewalls in a HA Active-Passive Setup for our remote-locations.
The setup is working good so far.
There is only one Problem that one node in the cluster is crashing. The firewall does not respond to anything when this happens. Even when there is a keyboard and a monitor directly connected to the firewall, there is only a black screen and nothing happens when htting the keyboard. This happens from time to time and happens in different locations.
I already tried to discuss this issue with Sophos Support two times with no final solution. Sometimes with a new firmware-release the problem calms down, but with another firmware version the problem comes back again and happens then more often.
In case of failure we instruct a person on site to power cycle the faulty node, then it comes up normally and the cluster is healthy again.
My question right now - as a workaround to my described problem - is there any kind of watchdog - solution implemented in the UTM or on the firmware part of the hardware so that a faulty node resets itself to boot again?
Hallo Jens and welcome to the UTM Community!
I can't recall ever seeing this issue here.
If Sophos Support hasn't been able to help you, you should request that your case be escalated. Are your SGs rev.1, .2 or .3?
It sounds like the lockup is such that your only solution may be the one you're using. Can you SSH into a node with a black screen? Have you tried PuTTy for this? Sophos UTM: How to access the UTM shell via SSH using PuTTY. Download PuTTY from this link.
How often does this happen in a given location? Is it always the active UTM that locks up, the passive or random?
Please insert a picture of the 'Configuration' tab in 'High Availability'.
Cheers - Bob
Hi BAlfson, thanks for your reply.
All devices are revision 3. The dead device is completly dead in that case. No access possible via TCP/IP - for example via SSH. I also tried to access the node via SSH from the Active Node (SSH-connected) but it´s completly hang up.
This happens in different locations and happens randomly. Mostly it´s the secondary which goes down, but also the primary goes down.
Here´s a screenshot:
I already set node 1 as prefered Master, but that don´t improve the situation.
I also configured a backup interface on other sites, this also does not improve the situation.
I already collected logs several times from the firewall but there is nothing noticeable in the logs.
My assumption is that some kind of pattern update causes this issue, but I cannot prove it right now...
I already raised a ticket at sophos support again (the 3rd one) and try to escalate it directly...
Last time they exchanged the hardware of a device on site where it happend, after that it did not happen again for about two weeks, after that the issue came up again...
It sounds like you're doing all of the right things, Jens. I usually recommend 'Preferred Master: None' and 'Backup interface: Internal'. Does that make any difference?
Cheers - Bob
Yes, in General the default-config is "Preferred Master: None". We also switched the backup interface for test purposes but no success. And I think it has nothing to do in the configuration which node is master and where is the HA-Traffic processed when the primary (ETH3) connection does not work. - Firewalls are of course directly connected on ETH3
The only thing I can say is, that the root cause of this problem must be somewhere in the "firmware" of the UTM. We´ve got also a SG-430 HA Active-Passive Setup which works well. Maybe there is some kind of "overflow" what the small SG-115 firewalls cannot handle...
Ticket with sophos begun and I tried to insist to work with a 2nd-Level person. Wish me luck...