Uplink balancing vs Active Standby, dual ISP questions

I've got a large number of UTM devices at sites with dual ISPs and we're trying to resolve a 'best practices' question.

We typically have both ISPs active with multipathing / weights set up to put our 'priority' traffic (VOIP and RED Tunnels) on the better ISP, and everything else on the secondary.  This works great until the primary fails, at which point the tunnels fail over to the secondary.  That's not a problem, except that when the primary comes back up, the tunnels never fail back to the primary interface on their own. They can sit on the secondary (weaker) connection for hours, days, or weeks until we manually deactivate and reactivate them.

We're considering going to an Active / Standby setup with dual ISPs to address this issue, however in that configuration, our PRTG service can't properly monitor the backup connection (since it's essentially off).

For those of you on dual ISP setups:

1) How do you make sure RED tunnels (or whatever tunnels) fail back to a primary interface when an outage is resolved?

2) If you're running Active / Standby instead of multipathing, how do you monitor your standby ISP?

 

Thanks for the guidance.

  • I think you want to stay with Active-Active and your Multipath rules.  See the second exception in #3 in Rulz (last updated 2019-04-17).  Any better luck now?

    Cheers - Bob

  • In reply to BAlfson:

    This doesn't do exactly what we want - the problem is the persistence of the tunnels on the 'lesser' interface after the primary comes back up.  So say we have two connections, Fiber and Cable.

     

    We use multipathing to set the tunnels to go out Fiber (by specifying all traffic to the RED destination use the Fiber interface).  This works fine.

    Fiber goes down, the tunnel fails over to Cable.  This works fine.

    Fiber comes back up, but due to the persistence of the connection, the tunnel stays on Cable for days, weeks, or months, until either Cable goes down, we restart the RED interface, or we restart the entire device.

    There has to be a way to force the tunnels to re-initialize once a day or something, right? 

  • In reply to TG1:

    Click on the wrench beside 'Active Interfaces' and show us a picture of those settings.  Also, show picture(s) of the Edits of the relevant Multipath rule(s).

    Cheers - Bob

  • In reply to BAlfson:

    This shows the uplink balancing (we use our primary for ONLY multipath-specific traffic, everything else goes out eth2)

     

     

    This shows the multipath rule that forces tunnels onto the primary (the group shown, Colos, includes the IP of our RED tunnel endpoint).

     

    Again, this piece is working, it's the failing-back-over that doesn't.

  • In reply to TG1:

    Add a fourth Multipath rule at the bottom binding 'Any -> Any -> Any' to 'eth2 - monkeybrains'.

    For testing purposes, in 'Edit scheduler', set 'Persistence timeout' to 1 minute.  After testing, set it back to 15 minutes.

    Any better luck with that?

    Cheers - Bob

  • In reply to BAlfson:

    I'll make the change and test it, but can you explain to me how this is supposed to affect the change we want? If the issue is connection persistence, and the tunnel doesn't reinitialize unless it's downed and brought back up or otherwise interrupted, how does adding this at the base change the current setup?

     

    Thanks for the info.

  • In reply to TG1:

    This multipathing rule had no effect, as far as I can tell.  I tested by failing the primary ISP, the tunnels came back up on the secondary as expected.  I then reactivated the primary ISP, and other routing came back as normal.  However, since the tunnel has not been reinitialized, it's still on the secondary ISP, 2+ hours and counting (I tried setting the persistence to both 1 min and 15 min).

     

    What's my next option?

  • In reply to TG1:

    Maybe a bug.  What does Sophos Support have to say about this?

    Cheers - Bob

  • In reply to BAlfson:

    I've opened multiple tickets on the issue and they never provide a viable response.  I've been linked the FAQ about 'Actions' in uplink monitoring, but to my knowledge there's no way to have an action that restarts a tunnel when the ISP comes back up.

     

    This seems more like a blindingly obvious design flaw than a bug - the other brands of firewalls I work with regularly both handle this with no special config (Juniper and Sonicwall) - the tunnel just reinitializes after a given period and is then on the right ISP.

  • In reply to TG1:

    Please PM me the most recent case # so I can get this issue seen by someone that can get it in front of the right people.

    Cheers - Bob

  • In reply to BAlfson:

    Done, thanks.

  • Hi Folks,

    I think this has reached a limitation. As of now, it's not possible to do a failback for the RED tunnels to a primary interface once an outage is resolved. This is a really good feature request and I would recommend you to raise it here. I am going to discuss the possibility of this with someone from the Development team.

    There's a feature of Failback available now in Sophos XG but that still does not satisfy your requirements for Failback in RED devices. I'm going to post again once I've some more details or news.

  • In reply to Jaydeep:

    This is a glaring oversight considering we have decade old Junipers with this functionality... 

    So our options are either:

    1) Risk having our tunnels be tied to the incorrect interface for long periods of time in an Active/Active scenario

    2) Have our ISPs configured in Active / Standby, which fixes the tunnel issue but prevents us from properly monitoring the standby interface or doing any proper load balancing.

    I hope this sort of basic failover scenario gets considered for future revisions, as this seems like a very common setup to have been overlooked.

  • Sounds like you have a pretty sophisticated operation, so I think you can finesse the problem by having your monitoring system trigger the RestAPI:

    • Primary network down
      • External monitoring detects the outage and triggers an alert. 
      • Internally, UTM switches all traffic to secondary interface.

    • Primary network comes up
      • External monitoring detects the "up" event, and starts a timer to see if it is going to stay up.
      • After the timer expires, the Primary network is assumed healthy, and the monitoring system launches a script.
      • The script uses the RESTAPI over a tunnel (on the secondary interface at this point).
        • First it verifies that the primary interface is truly up.
        • Then it disables the secondary interface.
        • This will break the VPN tunnel being used by RESTAPI, which prevents instantaneously re-enabling the secondary interface.   
      • A delay timer is needed to allow the control tunnel to reconnect.
      • After the timer expires, the script resumes, and enables the secondary interface.
      • Error handling re-attempts the interface-enable operation, to handle the possibility that the timer interval was too short.

    You could (should?) add a monitoring script on the inside of the site.   It wakes up at frequent intervals, checks to see if the secondary interface is disabled, and attempts to enable it.  This reduces the risk of being locked out in a situation where the primary fails again while the secondary is disabled.

    Full disclosure:  I have not done anything close to this myself. 

    Hope you can make it work.  Good luck!