This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Uplink balancing vs Active Standby, dual ISP questions

I've got a large number of UTM devices at sites with dual ISPs and we're trying to resolve a 'best practices' question.

We typically have both ISPs active with multipathing / weights set up to put our 'priority' traffic (VOIP and RED Tunnels) on the better ISP, and everything else on the secondary.  This works great until the primary fails, at which point the tunnels fail over to the secondary.  That's not a problem, except that when the primary comes back up, the tunnels never fail back to the primary interface on their own. They can sit on the secondary (weaker) connection for hours, days, or weeks until we manually deactivate and reactivate them.

We're considering going to an Active / Standby setup with dual ISPs to address this issue, however in that configuration, our PRTG service can't properly monitor the backup connection (since it's essentially off).

For those of you on dual ISP setups:

1) How do you make sure RED tunnels (or whatever tunnels) fail back to a primary interface when an outage is resolved?

2) If you're running Active / Standby instead of multipathing, how do you monitor your standby ISP?

 

Thanks for the guidance.



This thread was automatically locked due to age.
Parents
  • Sounds like you have a pretty sophisticated operation, so I think you can finesse the problem by having your monitoring system trigger the RestAPI:

    • Primary network down
      • External monitoring detects the outage and triggers an alert. 
      • Internally, UTM switches all traffic to secondary interface.

    • Primary network comes up
      • External monitoring detects the "up" event, and starts a timer to see if it is going to stay up.
      • After the timer expires, the Primary network is assumed healthy, and the monitoring system launches a script.
      • The script uses the RESTAPI over a tunnel (on the secondary interface at this point).
        • First it verifies that the primary interface is truly up.
        • Then it disables the secondary interface.
        • This will break the VPN tunnel being used by RESTAPI, which prevents instantaneously re-enabling the secondary interface.   
      • A delay timer is needed to allow the control tunnel to reconnect.
      • After the timer expires, the script resumes, and enables the secondary interface.
      • Error handling re-attempts the interface-enable operation, to handle the possibility that the timer interval was too short.

    You could (should?) add a monitoring script on the inside of the site.   It wakes up at frequent intervals, checks to see if the secondary interface is disabled, and attempts to enable it.  This reduces the risk of being locked out in a situation where the primary fails again while the secondary is disabled.

    Full disclosure:  I have not done anything close to this myself. 

    Hope you can make it work.  Good luck!

Reply
  • Sounds like you have a pretty sophisticated operation, so I think you can finesse the problem by having your monitoring system trigger the RestAPI:

    • Primary network down
      • External monitoring detects the outage and triggers an alert. 
      • Internally, UTM switches all traffic to secondary interface.

    • Primary network comes up
      • External monitoring detects the "up" event, and starts a timer to see if it is going to stay up.
      • After the timer expires, the Primary network is assumed healthy, and the monitoring system launches a script.
      • The script uses the RESTAPI over a tunnel (on the secondary interface at this point).
        • First it verifies that the primary interface is truly up.
        • Then it disables the secondary interface.
        • This will break the VPN tunnel being used by RESTAPI, which prevents instantaneously re-enabling the secondary interface.   
      • A delay timer is needed to allow the control tunnel to reconnect.
      • After the timer expires, the script resumes, and enables the secondary interface.
      • Error handling re-attempts the interface-enable operation, to handle the possibility that the timer interval was too short.

    You could (should?) add a monitoring script on the inside of the site.   It wakes up at frequent intervals, checks to see if the secondary interface is disabled, and attempts to enable it.  This reduces the risk of being locked out in a situation where the primary fails again while the secondary is disabled.

    Full disclosure:  I have not done anything close to this myself. 

    Hope you can make it work.  Good luck!

Children
No Data