IPSEC between multiple sites very slow

We are running several Sophos UTM with the 9.7 firmware: SG330 (HA), SG230 (HA), 4xSG115 and about 8 x RED15. All sites have ISP between 25/100 and 1000/1000 (main office).

They are all connected using IPSEC or the RED Interfaces forming a hub and spoke connection (any site goes to any site).

As all sites need to reach all sites, we have only one tunnel configured between each site, but to reach across, we had to add local networks creating around 50 IPSEC connections on the hub.

 

PROBLEM:

For several weeks, we are encountering very slow IPSEC traffic between those sites. Only between 3-5 MBit/s instead of 25-300 MBit/s as before and as expected.
Review the firewall history we couldn't identify a suspious change. Users report that the issue started in December. 

We have tried to isolate the problem but failed. We have tested the following changes:

IPS, ATP, Web filter, Country blocking
Any actions in these Modules on/off have no effect on the performance in the tunnel. IPS had all local and remote subnets in the exception lists.

Firewall
All RED and all IPSEC tunnels create an exception rule via automatic firewall rules.

MTU
Set to 1500 for all WAN interfaces, all direct internet connections are fast according to "real data transfer" and iperf. Only the IPSEC tunnel ist slow.

CPU Load
The sg115 are rather weak but there is no CPU spike or lag. We had IPSEC with 300/300 about a year ago with the same hardware.

Routing
We could rebuild the hub/spoke with routing instead of many IPSEC tunnels (currently over 50 tunnels).

UTM RED
We didn't try this yet but we could replace the IPSEC connections with UTM-RED connections, with manual routing and firewall rules.

IPSEC VPNID
We are using the VPN ID = public DNS (public IP) of each site.

IPSEC Compression and Policies
Tunnel Compression and different policies AES128/256 etc. have no effect.

IPSEC MTU path discovery
We have set the MTU path discovery for all IPSEC connections on both sides.

IPSEC initiate/respond
According to Sophos, the branch offices should initialize.

Dedicated WAN
The main office has two WAN connections so we are using the uplink interface and the bind-to-interface-option in multipathing as recommended for availability reasons. We could simply set this to one interface.

Firmware up/down
We took a spare sg115 and downgraded from 9.7 to 9.6 with no effect.

Sophos Case
We have escalated the case to Sophos. They are very helpful but the support was unable to identify a root cause for this slow tunnel behavior.

Any other ideas?

  • Hi  

    Would you please DM me the case number? 

  • Short Update on the big checklist above:

    We did not change the IPSEC many2many mayhem yet. We have connected one branch office using UTM RED Tunnel (not to be confused with a RED appliance. The RED Tunnel is more a gateway solution than a device). Our first results are a lot better in downstream. We will try other branch offices this week and will then be able to generate a conclusion.

  • In reply to mjpmotw:

    The IPSEC connections were set to connect EACH network of EACH location to EACH location to enable internal calling. This is an overkill/security risk for one feature, we are aware.

    After we have successfully built an UTM RED Tunnel from UTM-UTM we did another UTM-UTM RED Tunnel that was even slower than IPSEC. It's frustrating.

  • In reply to mjpmotw:

    Hallo,

    Strange situation...

    I also see your thread about issues with RED tunnels.  Have you tried packet captures to see if you have a lot of resends?  Have you checked ifconfig in each UTM to confirm that there's no "unhappy" NIC?  Have you checked the Intrusion Prevention logs?

    Cheers - Bob

  • In reply to BAlfson:

    Hi Bob

    I appreciate your feedback. I am quite familiar with the "Rules of Analysis" but most of time, the issue was easily resolved before hitting the "hard targets". We have created IPS Exceptions and did a test with ATP/IPS set to OFF. No change.

    I read about physical ports that might fail - but as one location (Germany) has a 100Mbit Tunnel while France and Poland often only get 5Mbit towards the same 1000MBit Datacenter, we never looked deeper into the NIC. When direct Internet Traffic is fast and only the tunnel is slow, we assume that the NIC is fine. How would I check "happiness" using ifconfig? Do you mean the full-duplex/half-duplex issue? 

    We are not familiar with the "package capturing" of the IPSEC Traffic. How would you do that? Wireshark on one of the computers behind the tunnel and compare tunnel traffic to internet traffic?

    Best and kind regards, Matthias

    BTW the tunnel mayhem any-to-any creates 36 IPSEC Connections (network to network) on the central UTM.

  • In reply to mjpmotw:

    Matthias, here are two suggestions:

    secure:/root # ifconfig eth1
    eth1      Link encap:Ethernet  HWaddr 00:08:02:A4:99:5F
              inet addr:68.227.97.51  Bcast:68.227.103.255  Mask:255.255.248.0
              UP BROADCAST RUNNING MULTICAST  MTU:1480  Metric:1
              RX packets:76353108 errors:0 dropped:0 overruns:0 frame:0
              TX packets:43730609 errors:0 dropped:0 overruns:0 carrier:0
              collisions:0 txqueuelen:1000
              RX bytes:53780834389 (51289.4 Mb)  TX bytes:18119541359 (17280.1 Mb)
              Interrupt:17 Memory:fd6c0000-fd6e0000

    To watch traffic inside an IPsec tunnel, you must first know the ref of the IPsec Connection object.  Assume the name of the Connection is "Paris Office."  As root at the command line:

    cc get_object_by_name ipsec_connection site_to_site 'Paris Office'|grep \'ref

    Say that returns 'ref' => 'REF_IpsSitParisOffice, and now you can do a packet capture with:

    espdump -n --conn REF_IpsSitParisOffice -vv

    Cheers - Bob

  • In reply to BAlfson:

    The RX packets / TX packets look fine, I will work on the capture later.

    FYI we have replaced all IPSEC Tunnels and SA with UTM RED TUNNEL with manual routing entries and manual firewall rules.

    Unfortunately, there was no change in throughput. The mystery remains.

  • In reply to mjpmotw:

    It seems an issue with the encapsolation and payload header resulting in defragmetation of packets. Can you reduce the MTU to 1450 and retest

  • In reply to Ehigbai Iyemifokhae:

    What MTU? IPSEC MTU (the tickbox), see above? UTM RED Tunnel MTU? WAN MTU of all WAN Interfaces?

    We already did a MTU Test with 1350 as by Bob "Rules of Analysis" but I realize we didn't reboot the Firewall nor reinitialized the interface afterwards.

  • In reply to mjpmotw:

    Few questions and steps:

     

    1. is your IPSEC Path MTU discovery enabled across the sites

    2. IPSEC compression and policies tunnel compression on AES128/256 etc

    3. set WAN MTU to 1450 across the WAN interfaces

    4. Creat an exception on ipsec tunnels via the automatic firewall rules

     

    Let me know the results.

  • In reply to Ehigbai Iyemifokhae:

    Thank you for your help, your questions have all been answered in my inital posting above. MTU discovery, blocking, logs, exceptions, all done. Differet tunnel initiation/response, different DES/3DES etc. have a minor effect that is not resolving the issue. WAN are mostly fiber or Coax and not PPPoE so MTU should be fine and is in testing. Now testing MSS.

  • In reply to mjpmotw:

    Matthias, what happens if you set the MSS of the tunnel to something like 1320?

    iptables -I FORWARD 1 -o -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1320

    That's a trick posted by Sascha Paris on 3-Mar-2014.

    Cheers - Bob

  • In reply to BAlfson:

    Sophos Support says that as of today, this it the iptables command for MSS:

    iptables -t mangle -I POSTROUTING -s 192.168.0.0/24 -d 172.16.16.0/24 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1300;
    iptables -t mangle -I POSTROUTING -s 172.16.16.0/24 -d 192.168.0.0/24 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1300;

    Change the source and destination networks for your networks.

    You can confirm the entries are added by listing the POSTROUTING chain in the mangle table of iptables:
    iptables -t mangle -L POSTROUTING;


    You can remove the custom entries with the following commands if you encounter any problems:

    iptables -t mangle -D POSTROUTING -s 192.168.0.0/24 -d 172.16.16.0/24 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1300;
    iptables -t mangle -D POSTROUTING -s 172.16.16.0/24 -d 192.168.0.0/24 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1300;

    These entries will not be retained through a reboot but a speed test after applying this may tell you if the issue is related to MSS

    -> we need to check this and - if still valid - the "iptables -I FORWARD 1 -o -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1320;"

  • In reply to mjpmotw:

    All those iptables-commands had no effect. We applied and verified. Whatever the underlying cause is...

    We also did more UTM RED Tunnel Testing and found out that the UTM RED Tunnel is way faster than the IPSEC. We simply don't know why.

  • In reply to mjpmotw:

    I bet you're felling like you're beating a dead horse, Matthias...

    If you don't want to just stay with the RED tunnels, I still think it would be worth trying fixed speed/duplex with reboots of the devices to force them to renegotiate.

    Cheers - Bob