Important note about SSL VPN compatibility for 20.0 MR1 with EoL SFOS versions and UTM9 OS. Learn more in the release notes.

This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Sophos Firewall: v19.5 MR2: Feedback and experiences

Release Post:   Sophos Firewall OS v19.5 MR2 is Now Available  

The old V19.5 MR1 Post: Sophos Firewall: v19.5 MR1: Feedback and experiences 

To make the tracking of issues / feedback easier: Please post a potential Sophos Support Case ID within your initial post, so we can track your feedback/issue. 



This thread was automatically locked due to age.
Parents
  • The day before this update came out we updated a customer from 19.0.x (19.0.2 i think) to 19.5.1, and it failed spectacularly. It seems to start up just fine but then as soon as the HA Auxiliary device joins it all comes crashing down. The Primary can ping maybe 1 in 3 devices on the network. We can connect remotely sometimes. Forcing a HA failover doesn't fix anything.

    In the release notes of 19.5.2 I see that NC-115019 refers to "Primary device in HA becomes unresponsive.", which vaguely matches our issue, but could also refer to a completely different issue (eg that the primary freezes after some number of days). How can I find out when that bug started? That NC number is not mentioned in the known issues list, maybe because it is now resolved? If the bug was created in 19.5.[01] then that could be the cause. If not I guess i'll raise a case with Sophos support, but doing that is such a pain Disappointed

Reply
  • The day before this update came out we updated a customer from 19.0.x (19.0.2 i think) to 19.5.1, and it failed spectacularly. It seems to start up just fine but then as soon as the HA Auxiliary device joins it all comes crashing down. The Primary can ping maybe 1 in 3 devices on the network. We can connect remotely sometimes. Forcing a HA failover doesn't fix anything.

    In the release notes of 19.5.2 I see that NC-115019 refers to "Primary device in HA becomes unresponsive.", which vaguely matches our issue, but could also refer to a completely different issue (eg that the primary freezes after some number of days). How can I find out when that bug started? That NC number is not mentioned in the known issues list, maybe because it is now resolved? If the bug was created in 19.5.[01] then that could be the cause. If not I guess i'll raise a case with Sophos support, but doing that is such a pain Disappointed

Children
  • Hi James,

    I'm sorry to hear that, and apologies for the experience, please do raise a support case and kindly share your case number here.

    Erick Jan
    Community Support Engineer | Sophos Technical Support
    Sophos Support Videos Product Documentation  |  @SophosSupport  | Sign up for SMS Alerts
    If a post solves your question use the 'Verify Answer' link.

  • I have logged a case. I would typically use the support case number as proof that the Sophos support caller is who they say they are, so i'm reluctant to post it here. I'll PM you

  • Hi James,

    Received the case number, I'll also put a note on your case and further monitor it. Thank you

    Erick Jan
    Community Support Engineer | Sophos Technical Support
    Sophos Support Videos Product Documentation  |  @SophosSupport  | Sign up for SMS Alerts
    If a post solves your question use the 'Verify Answer' link.

  • By the way, your issue sounds like a general ARP Problem. 

    https://doc.sophos.com/nsg/sophos-firewall/19.5/Help/en-us/webhelp/onlinehelp/HighAvailablityStartupGuide/AboutHA/HAArchitecture/index.html

    Check if the HA Cluster can "survive" this MAC Spoofing by checking the Switches. 

    __________________________________________________________________________________________________________________

  • Yes it definitely does have that feel to it. Has there been a change between the previous working-perfectly version 19.0.1 and the new everything-is-broken version 19.5.1?

    The LAN port is a LAG across 4 ports, so if there was a change with LACP or something then that could maybe trip it up.

  • There are no changes, and this behavior is the same since Day1. We are using a virtual MAC and if the interface fails, the Failover will be done. 

    So in case of Update and something does not work, you could do a failover to the other appliance to check, if this appliance accept the traffic. 

    __________________________________________________________________________________________________________________

  • You can be sure I've tried failover over, failing back, rebuilding HA. Something has changed between 19.0.1 and 19.5.1 and it's broken our cluster badly.

  • I had that happen too during one of our HA upgrades. We had to remove and re-create the HA connection from scratch to get it to work. The secondary upgraded fine and took over but the primary was unresponsive, and had to be physically restarted to even get into it's GUI or SSH. Once we did that, we were able to go into the GUI, remove the HA completely from both, upgrade the primary, then re-create the HA.

    Frustrating endeavor because it was an active-backup HA, and the secondary didn't take over the license in the re-creation process. Had to remotely add an evaluation license for the necessary firewall features for it to function properly while we did the work. Hopefully it doesn't happen again since that serial number has now burned it's evaluations, but we do also have the option of adding a 1-month MSP license to it through Sophos Central Partner in that case.

    I think that the primary firewall should have another management IP address in case of these kinds of issues, separate from the regular address that they both take together when in HA, much like the secondary has. But it wouldn't have mattered in this particular issue as the only access available was console until it was rebooted. I'll need to set up OOB console access for these kinds of deployments going forward, or at least OOB power outlet access to force reboots. But I do hate shutting down non-gracefully, as that occasionally corrupts the PostgreSQL database.

    I wonder if the hardware could include a small battery (similar to RAID controller battery size) that does a graceful shutdown when no AC power is found? But that's a huge ask, as it's not an easy thing to implement without possible false-alarm shutdown events, and would have to monitor at the hardware level to be reliable, not the software level. Maybe the higher end versions have this? I was using an XGS 2100 if I remember correctly, so I wouldn't be surprised if bigger models have it, but I haven't looked.

  • Just to follow this up. We updated to 19.5.2 and still had the same problems. We then engaged Sophos Support to join us on a support session the next morning and everything worked perfectly. It turned out that 19.5.2 did actually fix the issues with 19.5.1, but the morning we tested the upgrade, the ISP had a major outage which fairly closely mimicked the issue we were having with 19.5.1 so it appeared that the issue was still occurring.