Failover for route-based VPN with BGP

Hello all -- 

This is likely an easy question that I'm overthinking.  We have two sites, each with dual ISP links and Sophos XG v18.  Currently, there are four site-to-site tunnels between them, with a failover group on the branch/initiator side (A1-B1, A1-B2, A2-B1, A2-B2).  I'm wondering if the same idea translates to a route-based VPN using BGP, but with the benefit of not needing a failover group. 

All four route-based tunnels can be activated and connected without worry of routing mess, right?  If one link goes down, how long should we expect before traffic re-routes thru an alive-tunnel?


  • __________________________________________________________________________________________________________________

  • No, I hadn't seen that article yet (just 7days old! Smiley), but it shows only half of what I'm considering, and inspired a new question.

    The article shows only two tunnels; I'm wondering about four tunnels, because one of the ISP-links is limited to 5mbps/5mbps, so it should only be used in emergency, but last in the preference-list.

    I'm wondering if the same 2-tunnel approach would work with 4-tunnels. If so, is there a way to dictate a preference order, or weight of each tunnel?  So if all tunnels are up, prefer A1-B1; If A1 goes down, try A2-B1.

    Maybe this is more complicated than I orig thought.....  :-/ 

  • Did you ever figure this out? I've got a client who is having numerous issues with the current VPN failover groups where each of their sites has 2x WAN connections. We're trying to achieve exactly what you are, however the documentation of such advanced functionality on the XG platform is less than ideal from what I've found so far.

  • I did, with a little help from support (getting thru to someone competent enough to help, tho, that's a story for another day).

    The four-tunnel config, with BGP sessions, is documented and works (in LuCar's link above), tho I prefer to use link-local (169.254.x.x) IPs for the xfrm interfaces.  The article leaves out the BGP weighting, which was pretty vital for us, since one of the ISP links is basically dial-up.. lol

    Once you have the four tunnels connected and routes advertised, you'll want to set the weights per tunnel:

    1. PuTTY into the XG
    2. Select option #3 (Route Configuration)
    3. Select option #1 (Configure Unicast Routing)
    4. Select option #3 (Configure BGP)
    5. Type "ena" > "conf t" to enter config-terminal mode (Cisco IOS CLI)
    6. bpg(config)# router bgp 1  (use your local AS)
    7. bpg(config-router)# neighbor <neighbor IP> weight xx  (the highest value is preferred)
    8. ... do the same for all neighbors, with unique weights for each
    9. after you're done, use "write" to commit changes to startup-config, then exit

    After that, bounce the tunnels to activate the weights; Go to Network, Routing, BGP Routes to confirm settings.

    If this is wildly confusing, I'd recommend calling support and immediately requesting an escalation; I wasted many hours with the first-line support nonsense ("can you try this kb article and report back?"  "uhh.. that article uses policy-based routing."  "ah..  and what are you trying to do?"  *eyeroll*)

    On the plus side, with route-based tunnels and BGP, there really isn't a need for "failover" plans anymore, as all tunnels are up at once, and weights control which route is preferred.

    Good luck.

  • This is what I am looking to do as well to connect to our AWS VPC. We have two WAN interfaces and each has two tunnel interfaces.  I have set up the tunnel interfaces using this article:

    I have set this up for both WAN interfaces but since there is no weight set up for each tunnel all four will send traffic out by default. 

    I just have a question on the weights used for the routes. Essentially I want the two tunnels on our main WAN connection to have the highest weight and the two on our backup to have lower weights.  Is the weight based on 100 max or what is the highest value I can assign?

  • We also used the route-based "HA VPN" option to connect to a google cloud platform project (in addition to our site-to-site VPNs between home and branch sites). While the weighting I mentioned above works fine for XG-to-XG VPN, we were seeing the ECMP issue you mentioned with our connections to GCP. After weeks of testing with google cloud support, we figured out that Sophos uses a Cisco subsystem (maybe not the right word) for the BGP feature, which uses a proprietary weighting system that non-Cisco systems don't respect.  We ended up needing to implement MED values on the GCP-side, with matching route-map metrics on the sophos-side (via command-line).

    All that said, our current config uses weighting for XG-to-XG VPNs, MED values for GCP VPN, and route-maps to prevent inadvertent subnet routing thru other sites. It's not an easy config, by any means, so I would strongly recommend to get higher-tier AWS and Sophos support involved.

  • I was able to get the weighting set up on the tunnels but it does seem that the ECMP issue is still happening. Thanks for the additional info.  I'll need to follow up with AWS support to see if they are compatible with the Cisco weight system for BGP or if we need to do something similar by adding MED values on the AWS side.

    I am now more familiar with how Sophos implements BGP and currently have a prefix-list to only accept the subnet route in AWS. Would we still need to set up route-maps on the sophos side for the MED values? 

  • For the two tunnels for our primary ISP, we are seeing MED values of 100 and 200.  The MED values of the two tunnels for the backup ISP are also showing 100 and 200.  I just created two separate route-maps: one that sets metric to 300 and one that sets metric to 400. I applied the metric 300 one to the first backup ISP neighbor for outgoing advertisements and did the same with the metric 400 one on the other backup ISP neighbor.

    Now when I ping a server that is in AWS I am getting consistent pings so this appears to fix the issue. Thanks again!

  • Sounds about right. MED prefers lower values first, which is the opposite for the Cisco weighting system.

    Glad I could help.

  • We are looking to config this in a similar fashion.  How did you end up setting med values in AWS?  Or did you?

Reply Children
No Data