This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

High Availability (Sanity Check)

Hi all,

We recently purchased two SG450 Hardware Appliances to run in Active/Passive (Hot Standby) mode and I'm wondering if anyone could cast their eye over our set-up to confirm if I have configured HA correctly.

Node 1 - Master. Networked, joined to our Domain (to authenticate users), firewall and web profiles configured and operational. HA Operation Mode - Hot Standby (active-passive), Sync NIC = eth8, Device Node ID = 1, Encryption Key configured. Preferred Master = Node 1, No Backup Interface set.

Node 2 - Slave. Networked, HA Operation Mode - Automatic Configuration. Sync NIC = eth8.

The Sync NICs are connected over our switching infrastructure on a non-routed VLAN as the two nodes are in separate rooms.

On the face of it, this appeared to work. My only problem is when I reboot the Master unit and start a continuous ping, there is a discernible loss of connectivity (in excess of 1 minute) in both the pings and general Internet connectivity. Is this not a bit excessive?

Any recommendations/suggestions (and indeed criticisms) would be most welcome.

Many thanks,

John P



This thread was automatically locked due to age.
  • Hi All,

    Just a few more things to add to the mix here (apologies for muddying the waters further!!).

    I have just shut Node 1 (Master) down completely. After a period of approximately 70 seconds or so, ping connectivity to the internal interface was restored, as was Internet connectivity. However, the live Web Filtering log is showing no traffic and Web Filtering policies are not being imposed at all.

    The GUI is saying that Node 2 is now Master and Active.

    Curious, to say the least.

    Needless to say, I'm lost and any suggestions would be most welcome.

    John P

    2 x SG450 (Version 9.714-4)

    HA = Active-Passive

  • Hi All,

    Just a quick update.

    I found out the reason why the Web Filtering log was showing no traffic and why the Web Filtering Policies were not being applied. I had previously created an Exception List on Node 1 to permit all traffic from a specific IP address on our internal network without logging it. I assume this synced across to Node 2. However, when I later deleted this rule from Node 1, it apparently did not sync its deletion to Node 2. This is despite the fact that the rule did not appear in the GUI when I shut down Node 1 completely and Node 2 became Master.

    I turned off HA on Node 2 and reset it to Automatic Configuration once again and synced the two appliances. Normal service is once again resumed.

    However, when rebooting the Active appliance to test HA, I still get a break in continuous pings and Internet connectivity for a period in excess of 70 seconds. I think this is rather excessive, particularly when documentation indicates that the interruption should only be for a few seconds (one ping).

    Our HA Connection runs over our network switching infrastructure on a non-routed VLAN. Could this possibly contribute to the excessive loss of connectivity and would it be better to have a direct link between the two appliances?

    Many thanks for your time and patience.

    John P

    2 x SG450 (Version 9.714-4)

    HA = Active-Passive

  • Hey John.

    This is way excessive. I have a few HA setups and none of then take more then one ping to switchover. But I'm using a direct connection between the HA interfaces, I've never done it trough a switch, and, unless I'm mistaken, Sophos support recommends against using a switch between your HA interfaces on an cluster environment.  

    If I had to guess, by the amount of time taken for the switchover to complete, I would look for something on your switching infrastructure. Maybe setting spanning tree to portfast on the ports used by the UTM HA interfaces (considering is a Cisco switch) might mitigate this issue. A valuable test would be to direct connect the HA interfaces, somehow.

    Regards - Giovani

  • Hi Giovani,

    Many thanks for your prompt and very helpful reply.

    I know that Sophos don't recommend using a switching infrastructure between the appliances, but an engineer told me that whilst it isn't the preferred method, it should work anyway.

    On my return to work next week, I'll look into reconfiguring the switches as per your recommendation. Failing that I will link the appliances directly over our fibre infrastructure using media converters. I used the switching infrastructure as I didn't have the necessary fibre optic patch leads to hand.

    Thanks again for your help.

    I'll post the results next week.

    John P

    2 x SG450 (Version 9.714-4)

    HA = Active-Passive

  • Hi Giovani,

    I managed to scrape up the required patch leads and now have the two appliances connected directly. I'm using eth8 as the HA interface on both appliances.

    Unfortunately, I'm seeing the exact same behaviour as before. When failing over from the Master (Node 1) to Slave (Node 2) there is a loss of ping connectivity and Internet connectivity for 70+ seconds.

    However, once connectivity has been restored and the appliances are once more in sync and I fail back over from Node 2 to Node 1, I only drop one ping and Internet connectivity is lost for only a second.

    Any suggestions would be most welcome.

    Many thanks,

    John P

    2 x SG450 (Version 9.714-4)

    HA = Active-Passive

  • Hey John.

    Could you lay out a little bit of your switching infrastructure? Since your UTMs are in different rooms, I suspect that your configuration is a bit different that what I'm used to. And I still bet this behavior has more to do with your switching infrastructure than with UTM itself.

    Regards - Giovani

  • Hi Giovani,

    I have received a recommendation from Sophos Support to use eth3 (I'm currently using eth8) as the HA Failover interface.

    I do recall seeing this mentioned in some documentation, but I was unsure if it applied to the particular hardware appliances we are using (SG450). I will give their recommendation a try and post back any updates.

    Many thanks for your continued assistance.

    Best regards,

    John

    2 x SG450 (Version 9.714-4)

    HA = Active-Passive

  • Hi Giovani,

    Further to my last post.

    I changed the HA port to eth3 as recommended by Sophos. This made no difference at all. Rebooting Node 1 results in downtime of 70+ seconds until Node 2 takes over. However, when I fail back from Node 2 to Node 1, the downtime is negligible (maybe 1-2 seconds). Attached (hopefully) is an overview of how our UTMs have been deployed. Maybe this will help in figuring out what the issue may be.

    Many thanks,

    John P

    2 x SG450 (Version 9.714-4)

    HA = Active-Passive

  • Hi John,

    the two cisco 2960 are they stacked and work as one switch or did they have their own arp-tables?

    have you checked this https://www.sophos.com/en-us/medialibrary/PDFs/documentation/asg_8_HA_deployment_geng.pdf

    its bit old but describes the needed fully meshed links for a working ha env...

    greets

    zaphod
    ___________________________________________

    Home: Zotac CI321 (8GB RAM / 120GB SSD)  with latest Sophos UTM
    Work: 2 SG430 Cluster / many other models like SG105/SG115/SG135/SG135w/...

  • OK, now it's starting to make some sense. I'm not sure I'll be able to help you, but if you provide more information, I'll give it a go. Two heads think better than one, right?

    For the sake of sanity, could you repeat a failover and do theses tests:

    - Continuous ping from a external source to your WAN IP. Allow ICMP on WAN port for a short period of time, just for diagnostics purposes.

    - Continuous ping from a device connected to the second server room Cisco 3750.

    - Continuous ping from a device connected to the first server room Cisco 3750.

    - Monitor (and capture) Cisco 3750 and 2960 console for any activity on the ports connected to the UTMs during the failover

    Also, could you provide us with configuration for the ports on which the UTMs are connected on Cisco 2960 and 3750?

    Regards - Giovani