This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

High Availability (Sanity Check)

Hi all,

We recently purchased two SG450 Hardware Appliances to run in Active/Passive (Hot Standby) mode and I'm wondering if anyone could cast their eye over our set-up to confirm if I have configured HA correctly.

Node 1 - Master. Networked, joined to our Domain (to authenticate users), firewall and web profiles configured and operational. HA Operation Mode - Hot Standby (active-passive), Sync NIC = eth8, Device Node ID = 1, Encryption Key configured. Preferred Master = Node 1, No Backup Interface set.

Node 2 - Slave. Networked, HA Operation Mode - Automatic Configuration. Sync NIC = eth8.

The Sync NICs are connected over our switching infrastructure on a non-routed VLAN as the two nodes are in separate rooms.

On the face of it, this appeared to work. My only problem is when I reboot the Master unit and start a continuous ping, there is a discernible loss of connectivity (in excess of 1 minute) in both the pings and general Internet connectivity. Is this not a bit excessive?

Any recommendations/suggestions (and indeed criticisms) would be most welcome.

Many thanks,

John P



This thread was automatically locked due to age.
Parents
  • Hi All,

    Just a few more things to add to the mix here (apologies for muddying the waters further!!).

    I have just shut Node 1 (Master) down completely. After a period of approximately 70 seconds or so, ping connectivity to the internal interface was restored, as was Internet connectivity. However, the live Web Filtering log is showing no traffic and Web Filtering policies are not being imposed at all.

    The GUI is saying that Node 2 is now Master and Active.

    Curious, to say the least.

    Needless to say, I'm lost and any suggestions would be most welcome.

    John P

    2 x SG450 (Version 9.714-4)

    HA = Active-Passive

  • Hi All,

    Just a quick update.

    I found out the reason why the Web Filtering log was showing no traffic and why the Web Filtering Policies were not being applied. I had previously created an Exception List on Node 1 to permit all traffic from a specific IP address on our internal network without logging it. I assume this synced across to Node 2. However, when I later deleted this rule from Node 1, it apparently did not sync its deletion to Node 2. This is despite the fact that the rule did not appear in the GUI when I shut down Node 1 completely and Node 2 became Master.

    I turned off HA on Node 2 and reset it to Automatic Configuration once again and synced the two appliances. Normal service is once again resumed.

    However, when rebooting the Active appliance to test HA, I still get a break in continuous pings and Internet connectivity for a period in excess of 70 seconds. I think this is rather excessive, particularly when documentation indicates that the interruption should only be for a few seconds (one ping).

    Our HA Connection runs over our network switching infrastructure on a non-routed VLAN. Could this possibly contribute to the excessive loss of connectivity and would it be better to have a direct link between the two appliances?

    Many thanks for your time and patience.

    John P

    2 x SG450 (Version 9.714-4)

    HA = Active-Passive

  • Hi John,

    the two cisco 2960 are they stacked and work as one switch or did they have their own arp-tables?

    have you checked this https://www.sophos.com/en-us/medialibrary/PDFs/documentation/asg_8_HA_deployment_geng.pdf

    its bit old but describes the needed fully meshed links for a working ha env...

    greets

    zaphod
    ___________________________________________

    Home: Zotac CI321 (8GB RAM / 120GB SSD)  with latest Sophos UTM
    Work: 2 SG430 Cluster / many other models like SG105/SG115/SG135/SG135w/...

  • OK, now it's starting to make some sense. I'm not sure I'll be able to help you, but if you provide more information, I'll give it a go. Two heads think better than one, right?

    For the sake of sanity, could you repeat a failover and do theses tests:

    - Continuous ping from a external source to your WAN IP. Allow ICMP on WAN port for a short period of time, just for diagnostics purposes.

    - Continuous ping from a device connected to the second server room Cisco 3750.

    - Continuous ping from a device connected to the first server room Cisco 3750.

    - Monitor (and capture) Cisco 3750 and 2960 console for any activity on the ports connected to the UTMs during the failover

    Also, could you provide us with configuration for the ports on which the UTMs are connected on Cisco 2960 and 3750?

    Regards - Giovani

  • Hi Zaphod,

    Many thanks for joining in. Between you and Giovani I think we are making some headway here.

    We physically moved Node 2 in to the same room as Node 1 and connected its internal and external interfaces to the same 3750 and 2960 switches used by Node 1.

    Failover in both directions now works as it should. Only one dropped ping to the internal interface when failing over from Node 1 to Node 2 and only one ping dropped when failing back.

    The logs on the switches showed some port flapping during the failovers, but otherwise everything worked OK. I could see the mac address disappearing from one port and see it being assigned to another port almost immediately when the failover was triggered.

    Zaphod, in answer to your question, the 2960 switches are not stacked, therefore they would have their own ARP Tables. I take it that these ARP Tables aren't shared between switches which are not stacked, even if they are linked via a trunk port connection?

    Giovani, we haven't yet moved in to production with our UTMs, so hopefully I'll get a chance to carry out some of the tests you suggested over the next few days. In the meantime, here is the configuration of the ports on the 3750s and 2960s:

    3750:

    interface GigabitEthernet 1/0/2
    description *** L-320 Sophos UTM Internal Interface ***
    switchport access vlan 320
    switchport mode access
    switchport nonegotiate
    no cdp enable
    spanning-tree portfast
    spanning-tree bpduguard enable
    spanning-tree guard root
    end

    2960:

    interface GigabitEthernet 0/2
    description *** L-19 UTM External Interface  ***
    switchport access vlan 19
    switchport mode access
    switchport nonegotiate
    no cdp enable
    no cdp tlv server-location
    no cdp tlv app
    spanning-tree portfast
    spanning-tree bpduguard enable
    spanning-tree guard root

    Again, guys, many thanks for your input, it has been of great help and is truly appreciated.

    Best regards,

    John P

    2 x SG450 (Version 9.714-4)

    HA = Active-Passive

  • Hi John,

    glad to help you sorting things out :-)


    a trunk between the two cisco switches is not enough... as you see they have two arp tables which is the main problem in your first try.

    i dont know the cisco 2960 enough if its possible to stack them so that they act in fact as one device with one arp table for both backplanes.

    if you find a solution for it please share it ;-)

    greets

    zaphod
    ___________________________________________

    Home: Zotac CI321 (8GB RAM / 120GB SSD)  with latest Sophos UTM
    Work: 2 SG430 Cluster / many other models like SG105/SG115/SG135/SG135w/...

  • Hi Zaphod,

    I have raised a ticket with Sophos Support to see if they can come up with a solution which allows us to keep our proposed design and will post here any suggestions/recommendations they may have.

    We were hoping to keep the appliances in separate rooms to enhance resiliency. I won't pretend to know or understand how the actual fail-over mechanism works here, the Sophos documentation appears to be a bit vague in this area. Also, I can't claim to be a Cisco Networking expert either. Looks like I have a bit of reading ahead of me to get my head around ARP!!

    However, it has been an interesting learning experience and I am most grateful to both you and Giovani for your kind assistance.

    Best regards,

    John P

    2 x SG450 (Version 9.714-4)

    HA = Active-Passive

  • Hey John.

    I'm with Zaphod on that one. I'm guessing that what you are seeing is caused by the arp tables being different on both switches (or stacks, in case of the 3750s).

    AFAIK, when you build an Active/Passive cluster, Sophos UTM create a virtual MAC for every interface. When the slave takes over, it basically bind those MACs to its interfaces. So I'm guessing that, in your case, when the Slave takes over, your stack on Room 1 takes a little bit of time to update its ARP table, causing that delay, probably because the MAC was never really updated, but actually moved from one stack to another. Not sure I'm making any sense here, I hope I do. That's why I asked you to test the ping from a endpoint connected to room 2 stack, as I'm pretty sure that you'll see no delay there.

    There is a configuration for HA specifically when using a Sophos UTM cluster as virtual machines, because of some issues with MAC spoofing on virtual machines and so on. Maybe, just maybe, it could help you, as it would not create a virtual MAC, but instead use the physical interfaces MAC addresses. I'm thinking that a new interface coming online would force a MAC broadcast on both stacks, reducing that convergence time. Since it's not in production, I think it's worth a try. To disable virtual MAC on HA, run that command on both UTMs shell:

    cc set ha advanced virtual_mac 0

    In case your not sure how to get on the slave ssh, you would run ha_utils ssh from the master shell. That will log you in the slave SSH shell.

    As I mentioned earlier, it's just a though, but I think it might be worth trying.

    Regards - Giovani

    P.S.: I'm glad to help. I think we learn something new every time we try to help someone.

  • Hi Giovani,

    Many thanks for your input. It has certainly made some aspects of what is going on here much clearer.

    The delay in the ARP Table update looks like it indeed could be the main culprit here. I will give that command a try and see what happens. I'll post any further developments.

    Once again, many thanks for your patience and assistance.

    Best regards,

    John

    2 x SG450 (Version 9.714-4)

    HA = Active-Passive

  • Hi all,

    Just a quick update on how things are progressing.

    I moved Node 2 back to Server Room 2 (see above diagram) and was willing to accept the 70+ second interruption to service during failover from Node 1 to Node 2, when lo and behold, testing the failover in both directions (Node 1-Node 2, Node 2-Node 1) now results in only one ping being dropped [*-)]

    To say I'm confused is a bit of an understatement as I have made no changes to the configuration of the UTMs or the network switches. I'm hoping that it isn't something as daft as re-seating the cable which has resolved this issue. If it is, please accept my apologies if I have wasted anyone's time.

    Many thanks for your patience and assistance.

    Best regards,

    John P

    2 x SG450 (Version 9.714-4)

    HA = Active-Passive

  • Hey John.

    I'm just glad it's working as it should. This things have a way of sorting itself out from time to time. I was glad to help in anyway that I can, don't worry about it. If you ever find out what it was, post it back so someone else can be helped by your experience.

    Regards - Giovani

  • HI Giovani,

    Thanks again for all your support. Sophos were able to confirm that the design we were using is fine and should work with no problems. I'm just happy that it's working now and I'm a step closer to moving it in to our live production environment.

    I'm sure I'll have lots more questions for the community in the coming months, but it is good to know that there are people like you and Zaphod out there who are willing to lend a hand.

    Best regards,

    John P

    2 x SG450 (Version 9.714-4)

    HA = Active-Passive

Reply
  • HI Giovani,

    Thanks again for all your support. Sophos were able to confirm that the design we were using is fine and should work with no problems. I'm just happy that it's working now and I'm a step closer to moving it in to our live production environment.

    I'm sure I'll have lots more questions for the community in the coming months, but it is good to know that there are people like you and Zaphod out there who are willing to lend a hand.

    Best regards,

    John P

    2 x SG450 (Version 9.714-4)

    HA = Active-Passive

Children
No Data