Sophos Central Endpoint and SEC: Computers fail/hang on boot after the Microsoft Windows April 9, 2019 update. Please follow knowledge base article 133945
Learn about the Benefits of Multi-Factor Authentication (MFA). Turn your MFA on now!
We'd love to hear about it! Click here to go to the product suggestion community
I'm attempting to spin up a new HA cluster of a couple of XG VMs for a new environment. This is hosted in Hyper-V
I followed the basic setup here for active-passive mode: https://community.sophos.com/kb/en-us/123174
I confirmed that the devices can see each other on the HA link interface, and both will respond properly if I assign addresses to various interfaces they have.
I enable the HA settings on the aux device and it saves properly, however as soon as I enable HA on the primary device, both devices become unreachable either by ping or on the admin console. The VMs then seem to go into an endless loop of rebooting every 2-3 minutes. If I log in via the console to either device and run system ha show details they always show HA status enabled, with the current HA state of Standalone and the peer HA state of Fault.
I attempted to leave the VMs for several hours to see if they would sort themselves out of the issue, but nothing changed. I also tried simply powering off the secondary device entirely and rebooting the primary to determine if the primary would at least come up in that state but it had the same behavior.
If I run system ha disable from the console on a device, it seems to immediately start to respond again, so it doesn't appear anything is corrupt or fully broken with the config itself, just when HA is enabled.
If I disable HA and look at the log viewer the only relevant thing I seem to see is "Appliance with appliance key XXXX becomes standalone at appliance startup"
Are there additional steps that need to be taken in order to deploy XG in an HA config on Hyper-V or a way to view detailed information as to why both devices seem to report the other in fault mode or why the primary won't even respond if the secondary is off/unplugged?
HA in virtual Env is a little bit tricky because XG (like UTM HA) spoofs the MAC in Case of a HA takeover. And basically the Hyper-V vSwitch does not like this stage.
Check the Hyper-V Forum for a workaround on MAC - Spoofing and how to enable it.
In reply to LuCar Toni:
Thanks, that got me a little further :)
I enabled MAC spoofing on the NICs for both VMs. Configured the secondary/aux device and then enabled HA on the primary. I receive an error message that says "HA has been enabled successfully, but it is recommended to check the physical connectivity of peer monitoring ports" and the secondary device seems to go into an endless rebooting loop, although the primary device seems to work/respond properly.
Interestingly enough the secondary device seems to stop rebooting if I disconnect the peer port in Hyper-V (make it not connected to a network).
Also if I disabled HA, I can confirm that both devices are accessible via SSH on the monitoring port, so it doesn't looks like it's an actual link issue.
Are there any other issues that might cause that error? Or additional steps I can take in troubleshooting?
In reply to Sean Patterson:
I have exactly the same issue on hyper-v. I cannot fathom this out. I can ssh into each appliance.
In reply to Kerry Barnes:
Has anyone figured this out yet? HA deployment in VMware appears to work fine, but Hyper-V is a no go. Interestingly enough, an HA deployment in Azure doesn't appear to use the built in HA functionality - it uses external load balancers (community.sophos.com/.../127934 )
@SophosSupport: does this mean HA in Hyper-V is not supported? What's the official position?
Thx in advance!M.
In reply to Marcus Bauer:
I ended up giving up without solving it, figuring I could schedule the downtime for things behind this device when I needed to reboot for updates/etc.
I thought it may have had something to do with the mac spoofing tripping up the physical layer (either host or switch) in this environment because we had a different issue with the arp cache not properly updating in some cases. It was around that time I determined that the headaches weren't worth it and I stopped trying as I had other projects I needed to move to.
If someone can determine an easy way to configure this on hyper-v (especially in a replicable manner that doesn't tie up a real license key so I can play with it in a test environment), I'd love to have it solved.
Multicast is not supported in public cloud environments like Azure and AWS hence the reason for using the Azure/AWS load balancers to achieve HA in those environments.
In reply to DavidOkeyode:
That would explain Azure/AWS, but not on prem Hyper-V environments.
The lack of any kind of statement in this regard (HA on Hyper-V) is a bit disappointing …..
Can you tell me more about your Hyper-V and switch configuration, are you using anything like Spanning Tree Protocol?
In reply to EmileBelcourt:
I don't control the physical layer under the hypervisors I was attempting to install on, so I'm not entirely sure how they're configured.
One step I had taken to try to eliminate that was to install both the primary and secondary devices on the same hypervisor (running windows server 2016, hyper-v) and connected via a private network to hopefully isolate the physical layer.
The devices were still not functioning properly and exhibiting the same odd rebooting behaviors.
That was around the time I had to move to other projects and made the decision that downtime could be scheduled or the device could be restored from a backup is we ran into issues as opposed to having an HA setup.
At this point, I'm not planning on putting much more time into the configuration attempt unless someone can provide a confirmed walkthrough to configure XG VMs in an active/passive cluster that I can test using trial license keys. That will let me confirm the setup in my test bench and then configure a test instance on the same hardware/configuration I have the current production instance so I can be sure that everything will work before I schedule downtime for production.
The method for deploying HA as per the KB article is all that is required for the configuration and deployment and the little extra bits we discussed in this thread. The reason I asked about STP is because I have seen this behaviour of the Aux node going into reboot because the "interface up" of the "NIC" the XG was connected to for HA Heartbeat was taking longer than the 6 seconds required to confirm the remote HA master is available.
If you are having issues like this and have a licensed box, I would recommend involving Support on this. The other I would recommend is potentially reaching out to your account manager to speak to a Sales Engineer to further understand any nuances with HA in Hyper-V.
I have not seen the behaviour you have discussed in Hyper-V and my HA builds have normally gone through fine (even cross host) but I have seen the reboot loop before with Spanning Tree Protocol being enabled and causes a delay in the interface up scenario.
Is this something I can set up with a trial license?
That would allow me to configure it in my test environment and ensure there are no issues. Then, if it works, I could spin up new test VMs in the production environment to determine if that environment has some issue preventing it from working properly (like STP as you mentioned).
That will also let me attempt possible resolutions so I can be confident that when I schedule downtime for the production VM I’m not simply shooting in the dark and hoping.