This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Uplink balancing vs Active Standby, dual ISP questions

I've got a large number of UTM devices at sites with dual ISPs and we're trying to resolve a 'best practices' question.

We typically have both ISPs active with multipathing / weights set up to put our 'priority' traffic (VOIP and RED Tunnels) on the better ISP, and everything else on the secondary.  This works great until the primary fails, at which point the tunnels fail over to the secondary.  That's not a problem, except that when the primary comes back up, the tunnels never fail back to the primary interface on their own. They can sit on the secondary (weaker) connection for hours, days, or weeks until we manually deactivate and reactivate them.

We're considering going to an Active / Standby setup with dual ISPs to address this issue, however in that configuration, our PRTG service can't properly monitor the backup connection (since it's essentially off).

For those of you on dual ISP setups:

1) How do you make sure RED tunnels (or whatever tunnels) fail back to a primary interface when an outage is resolved?

2) If you're running Active / Standby instead of multipathing, how do you monitor your standby ISP?

 

Thanks for the guidance.



This thread was automatically locked due to age.
  • Hi Folks,

    I think this has reached a limitation. As of now, it's not possible to do a failback for the RED tunnels to a primary interface once an outage is resolved. This is a really good feature request and I would recommend you to raise it here. I am going to discuss the possibility of this with someone from the Development team.

    There's a feature of Failback available now in Sophos XG but that still does not satisfy your requirements for Failback in RED devices. I'm going to post again once I've some more details or news.

    Regards

    Jaydeep

  • This is a glaring oversight considering we have decade old Junipers with this functionality... 

    So our options are either:

    1) Risk having our tunnels be tied to the incorrect interface for long periods of time in an Active/Active scenario

    2) Have our ISPs configured in Active / Standby, which fixes the tunnel issue but prevents us from properly monitoring the standby interface or doing any proper load balancing.

    I hope this sort of basic failover scenario gets considered for future revisions, as this seems like a very common setup to have been overlooked.

  • Sounds like you have a pretty sophisticated operation, so I think you can finesse the problem by having your monitoring system trigger the RestAPI:

    • Primary network down
      • External monitoring detects the outage and triggers an alert. 
      • Internally, UTM switches all traffic to secondary interface.

    • Primary network comes up
      • External monitoring detects the "up" event, and starts a timer to see if it is going to stay up.
      • After the timer expires, the Primary network is assumed healthy, and the monitoring system launches a script.
      • The script uses the RESTAPI over a tunnel (on the secondary interface at this point).
        • First it verifies that the primary interface is truly up.
        • Then it disables the secondary interface.
        • This will break the VPN tunnel being used by RESTAPI, which prevents instantaneously re-enabling the secondary interface.   
      • A delay timer is needed to allow the control tunnel to reconnect.
      • After the timer expires, the script resumes, and enables the secondary interface.
      • Error handling re-attempts the interface-enable operation, to handle the possibility that the timer interval was too short.

    You could (should?) add a monitoring script on the inside of the site.   It wakes up at frequent intervals, checks to see if the secondary interface is disabled, and attempts to enable it.  This reduces the risk of being locked out in a situation where the primary fails again while the secondary is disabled.

    Full disclosure:  I have not done anything close to this myself. 

    Hope you can make it work.  Good luck!