Dropped Connections during Pattern Updates

Since installing multiple XG Firewalls in a multi-site environment, we have been plagued with "random" outages that last between 30-90 seconds.

I have finally correlated this with Pattern updates for either ATP, AV or IPS.  During the time of the definition updates all connectivity to the XG firewall is lost.  This actually brings down our Wide Area network and causes VoIP phones to restart looking for the phone server.

I have an open support ticket with Sophos but I'm awaiting their response.

I have changed the updates to happen less frequently (Daily), however when there are updates it still brings down the connection (albeit less often now).

Is there a way to still have automatic updates turned on but do them on a time schedule?  I find it utterly ridiculous that the system cannot do pattern updates without bringing down the entire network.

If this is "expected" behavior what have others done as workarounds?  I cannot have 30-90 seconds of downtime every other day for pattern updates. 



Added TAGs
[edited by: emmosophos at 9:10 PM (GMT -7) on 28 Jun 2021]
Parents
  • Thanks Bill.  I agree and have seen this article as well.

    But there is currently no fix and no workaround other than to turn off automatic pattern updates?  How can we have a firewall device that drops all connections during pattern updates?  How can I recommend to enterprise?  How do I get more visibility to this?  I've also seen the Sophos Idea to give more control over scheduling these updates which I have upvoted, but frankly, I don't want to lose connection, EVER.

    I'm awaiting Sophos support to get back to me on my questions above as well, but I just can't fathom how this is acceptable on any level.

    I feel like now I am forced to choose between consistent connectivity by turning off automatic pattern updates and security.

  • some interesting facts are coming up here. Any reason for the default disabled VFP setting in MR4? Is this only for fresh installations on MR4? What is this with migrations over MR4 to MR5. We went from 17.5 MR12 over 18 MR1,->4,->5 where we re-imaged our appliances when going to MR4, then imported the config.

    VFP was enabled when checking it recently but has now been disabled because asked by support for some kind of issue without fxing the issue by the disabled setting.

     can you provide some steps how you measured the time of connection loss?

    I'd like to review this with our XG430s HA.

    I know we lost traffic for some seconds when disabling VFP.

  • Sophos is not enabling most settings after a firmware upgrade to avoid issues within the network after a firmware update. V18.0 MR4 enabled VFP option on HAs. Customers coming from a older version, had this disabled and can enable it, if they want. This option will be likely be enabled with a future release. 

    A new installation without backup/restore will have VFP enabled per default. 

    __________________________________________________________________________________________________________________

  • Is there a command to show if it is enabled (rather than enable/disable it)?

  • console> system firewall-acceleration show
    Firewall Acceleration is Enabled in Configuration.

    __________________________________________________________________________________________________________________

  • Just from seeing the issue a few times, we'd typically notice that a MS Teams call would stop responding, then i'd try a web browser and see that it was a generic 'page cannot be displayed'. Give it around 20 seconds and then it works again as expected. But normally the delay is long enough that you'll get dropped from your Teams call and need to dial back in again. Super frustrating.

    I may try the workaround by and reboot our firewall in the early hours, hoping that the pattern updates will take place 24 hours again after that (e.g. out of hours).

  • We use a program called PingPlotter. We run pings to several external addresses (to avoid false positives) and the XG IP as well. It maintains logs of all the connections and we can check those to see at the time of an update, the ping to the XG is fine but all the other pings are blocked.

    You can use PingPlotter free but if you want to run it as a service, you can either run a 14 day trial or buy it.

  • PS: Keep in mind, Ping is not a TCP/UDP connection. There is another Bug ID related to Pings in virtual fastpath, as ICMP seems to behavior differently. As there are no real indication of session in ICMP, it cannot remain the session. Therefore if the ping packet is lost, its lost. TCP/UDP can work with retransmission and there pickup the same session. So a Ping lost does not have to result into a lost session within the network. 

    __________________________________________________________________________________________________________________

  • thanks for your replies. seems hard enough to even create a valid test scenario...

  • I appreciate that ping isn't the same as TCP/UDP but it was a useful tool to get some insight into why users were complaining of lost internet connections. Should have known that Sophos would have a separate bug for ICMP in vitual fastpath!

    The one thing I can confirm, on the two sites I have just tested it, pings don't stop when there is an update and Firewall Acceleration is disabled.

  • From a network perspective, ping is always a bad tool to troubleshoot further more than "is a connection even possible?". Because looking at Ping(ICMP) is its like looking at a street with jammed traffic. Using ICMP could mean, you use a motorcycle going through the traffic and still reaching the destination, but your "real traffic (cars) cannot do this. It simply does not reflect in some cases the real world. I saw a lot of administrators struggling with this especially in the movement to towards cloud (SD-Networks) or SD-WAN. You ping, the ping will reach the destination but not at the same speed as your VOIP. And this leads you to: Nothing. No conclusion, because there could be multiple issues at the same time (Wrong rule, wrong traffic selector, wrong traffic classification etc.). Ping(ICMP shortcut sometimes everything and uses different routes. Traceroute and other tools are doing the same. I cannot remember how often i have to discuss the traceroute outputs of customers and explaining, that this is not an issue. But its a easy tool to use and gives you something. 

    To recap:

    NC-69286: ICMP times out when Firewall Acceleration is enabled

    NC-70896: Internet traffic stops every time XG has an IPS or ATP update

    Those are both the affected bug IDs. It seems to be related to the Firewall Acceleration and needs to be checked. 

    __________________________________________________________________________________________________________________

Reply
  • From a network perspective, ping is always a bad tool to troubleshoot further more than "is a connection even possible?". Because looking at Ping(ICMP) is its like looking at a street with jammed traffic. Using ICMP could mean, you use a motorcycle going through the traffic and still reaching the destination, but your "real traffic (cars) cannot do this. It simply does not reflect in some cases the real world. I saw a lot of administrators struggling with this especially in the movement to towards cloud (SD-Networks) or SD-WAN. You ping, the ping will reach the destination but not at the same speed as your VOIP. And this leads you to: Nothing. No conclusion, because there could be multiple issues at the same time (Wrong rule, wrong traffic selector, wrong traffic classification etc.). Ping(ICMP shortcut sometimes everything and uses different routes. Traceroute and other tools are doing the same. I cannot remember how often i have to discuss the traceroute outputs of customers and explaining, that this is not an issue. But its a easy tool to use and gives you something. 

    To recap:

    NC-69286: ICMP times out when Firewall Acceleration is enabled

    NC-70896: Internet traffic stops every time XG has an IPS or ATP update

    Those are both the affected bug IDs. It seems to be related to the Firewall Acceleration and needs to be checked. 

    __________________________________________________________________________________________________________________

Children
No Data