Important note about SSL VPN compatibility for 20.0 MR1 with EoL SFOS versions and UTM9 OS. Learn more in the release notes.

Sluggish performance after 15 days up time with SFOS 20.0.0 GA-Build222

This is the first time I've observed anything like this with the 3+ years I've been a Sophos XG home user. I upgraded to the latest build 222 about 15 days ago, so my run time has been about 15 days. Over the past several days I've noticed some weird issues on my network such as:

- Websites are slow to load initially (it's almost like the DNS look up is slow)

- My Apple HomePods will randomly pause while playing music (almost like its buffering)

- When I attempted to update some docker images, it would just keep timing out

- Couldn’t connect to my home network using OpenVPN, it would just time out.

I tried restarting a bunch of other network relates devices such as my DNS server (PiHole), networking equipment (UniFi switches and wireless access points), etc. but I was still seeing the same behavior. CPU and memory usage were normal.

The next step I tried was switching Sophos X-Ops threat feeds from "Inspect all content" to "Inspect untrusted content", thinking that might have been causing it since this was a fairly recent change I made (about two months ago). Still no change.

Finally, I just restarted my Sophos XG box and now everything is working normally.

Not sure what was causing it but I'll continue to keep an eye on it to see if it happens again. I'd be curious if any other users are experiencing similar behavior.



Added some more information.
[edited by: shred at 3:31 AM (GMT -8) on 16 Feb 2024]
  • So I'm experiencing this issue again after what is being reported as 2 days of run time. I've been running since this last post so I'm not sure if Sophos XG rebooted on its own recently. I haven't had any power outages and all of my networking equipment runs on a UPS back-up, so I'm assuming something happened with Sophos XG itself. Regardless, the issue I'm seeing is identical to what happened in my original post.

    I have a primary ISP and a backup cellular ISP service. The backup cellular ISP service should only be used when the primary ISP is down and for a limited amount of devices. For some reason, my backup cellular ISP is being used for some of those devices I mentioned, but it shouldn't be because the primary ISP is still up. This was the same symptom I saw last time as well. You can see my SD-WAN route rule below. I reset the data transfer counter yesterday and I can see it's definitely being used.

    I really need to reboot Sophos XG because this is almost an unusable state for me, but if there's any information I can pull to help troubleshoot this, I'll wait on rebooting until I get some feedback on what information is needed. 

    ---

    Sophos XG guides for home users: https://shred086.wordpress.com/

  • There could be a lot of issues. 

    First of all: SD-WAN for WAN Connection will not fall back to the old connection per Default. This means, Sophos does not want to interrupt the connections if the ISP comes back up. 
    You can turn this on, if you want, which essentially means, the connections will be destroyed and new buildup. https://docs.sophos.com/nsg/sophos-firewall/20.0/help/en-us/webhelp/onlinehelp/AdministratorHelp/Routing/SDWANRoutes/RoutingSDWANRoutesBehavior/index.html#reroute-snat-connections 

    But in your setup: Likely DNS will cause it. SFOS uses the TTL of the DNS Server, and PIholes transfer for everything a TTL of 5 sec. This means, you will likely cause a flood of requests all the time - potentially slow down the entire network. See:  Constant DNS lookups to google domains and others in FQDN hosts - FIX 

    You can check the system graphs on the webadmin, if you see an increase of the interface throughput for some reason. 

    __________________________________________________________________________________________________________________

  • I am aware of the Pihole/DNS issue so I already have fqdn-host cache-ttl set to 3600 (I replied in that thread 2 years ago as well because I was having the issue).

    As for the first statement, is this behavior something that changed with the most recent version of Sophos XG? The reason I ask is because I’ve been running this setup for 2+ years without any issues, and I’ve had my primary ISP go down during that time and the SD-WAN rules appeared to work fine and when the primary ISP came back up, everything appeared to go back to the primary ISP. When I read the description on the link you provided, it mentions that only applies for MASQ connections and if I’m using an IP Pool, both of which don’t apply in my case. Edit: I read it incorrectly - those two rules always apply. I guess the Reroute SNAT Connections is disabled by default is what you’re saying, which will cause connections that fail over to the backup ISP to not switch back to the primary ISP when it comes back up? I’ll set it and see if it helps, but it’s odd I’m just seeing this issue happen recently.

    ---

    Sophos XG guides for home users: https://shred086.wordpress.com/

  • The point here is: You cannot fix it from a firewall perspective: The (Web)Server in the internet sees your connection coming from the LTE IP (MASQ). So if SFOS would move to the Fiber connection, the IP would change, the Webserver cannot figure out, what connection that is and the connection will ultimately drop. Therefore this setting is disabled per default, as most customers do not want to "fall back" and kill there working connections. But: new connection will use the old Fiber connection, meaning naturally the traffic will be falling back, but not existing connections. 

    So to speak: You have a "no impact for users" vs "impact but better connection" situation. 

    If you have IoT Devices here, sometimes they "hold" a connection forever and never will reconnect. This leads to a connection being stuck to the old LTE connection. 

    __________________________________________________________________________________________________________________

  • I think I'm understanding now. However, I find it strange that alone would be causing some of the issues I'm seeing. For example, why would I be unable to connect to my home network from the outside using OpenVPN when Sophos XG is in this "state" (referencing the issue I'm having)? I still think it's also strange I haven't seen this issue in the 2+ years I've been running this setup until fairly recently, so I'm questioning if this reroute-snat-connection is actually the issue.

    I've went ahead and enabled it for now. I ended up restarting Sophos XG and everything is working normal again. I did export all of my logs before restarting just in case there's something worth looking at.

    ---

    Sophos XG guides for home users: https://shred086.wordpress.com/

  • If it fails again, a tcpdump of the failed connections would be interesting. 

    __________________________________________________________________________________________________________________