This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Sophos UTM (SG450) cluster --> Link Aggregation Group failed after switching cluster status

Hello,

we have an SG450 cluster (Sophos UTM version: 9.602)

Yesterday we replaced one of the nodes, because there was an raid error displayed. 

 

We have two 10 Gbit SFP+ Transceiver in Port E8 + E9. On this both ports is an Link Aggregation group (LAG) defined.

On this LAG are some VLANs mapped (interface type: Ethernet VLAN). So the core routing for the subnets do the Sophos SG450 cluster. 

After changing the HA-configuration from: "Hot-Standby (active-passive)" to "off" (for tests),

all Gateway addresses from the VLANs lost connection! (no reaction, no pings etc.)

 

 

I accessed then the Sophos via "eth0". After taking each VLAN interface offline and then online again,

the network connections were online again. (pings, network access etc.)

This behaviour can be watched only if i change the HA-config from active-passive to "off".

It seems that the Trunk / LAG have trouble with this and needed to be reactivated. 

 

Can anyone imagine why this happens? 

 

Thansk so far!

 

 



This thread was automatically locked due to age.
  • Within cluster, you use a virtual MAC. ... and some other cluster specific functions possible ...

    Possible the Switch (or the SG) need a link down to handle the new addresses/config from LAG.

    Do you see no problems with "normal" failover?

    Sometimes i see the LAG from both SG nodes ending within one LAG group at the switch. This is mostly what causes the problem.

     


    Dirk

    Systema Gesellschaft für angewandte Datentechnik mbH  // Sophos Platinum Partner
    Sophos Solution Partner since 2003
    If a post solves your question, click the 'Verify Answer' link at this post.

  • Hello and thank you for your answer.

     

    Yes that could be the problem...

    On a normal failover the mac addresses are "refreshed". 

    We use two HP Core-Switches with LACP Trunks to the SG450 nodes and replaced two weeks ago a faulty node.

     

    Here it seems that the virtual mac address of the LAG to the firewall cluster pointed to another port on the switch.

    After deactivating and re-enabling the LAG-Interface (between firewall nodes and core switches)  the mac address and port assignment on the HP switches was refreshed and all network traffic was going through.

     

     

     

  • we configure the 2 LAGs (one per firewall node) by hand


    Dirk

    Systema Gesellschaft für angewandte Datentechnik mbH  // Sophos Platinum Partner
    Sophos Solution Partner since 2003
    If a post solves your question, click the 'Verify Answer' link at this post.

  • What do you mean with 2 LAGs?

     

    We have one LAG --> LAG1 and all Ethernet VLAN Interfaces are mapped on this LAG. (tagged VLANs)

     

  • You need one LAG per Firewall-Node at the switches. Otherwise, we have seen lag problems while failover the SG/XG.
    Understandable ... from a switch point of view: all known/negotiated ports go down. 2 completely new ones appear ... the existing LAG is getting into a crisis.

    SG1-eth8 --> LAG1 - Core1-E3
    SG1-eth9 --> LAG1 - Core2-E3

    SG2-eth8 --> LAG2 - Core1-E6
    SG2-eth9 --> LAG2 - Core2-E6

    Only one LAG can be configured for this on the SG

    This example applies if the switch nodes can form a common LAG port. (Stack/vStack/or similar)

    With 2 stand-alone switches it looks very different.

     

     

     

     

     

     


    Dirk

    Systema Gesellschaft für angewandte Datentechnik mbH  // Sophos Platinum Partner
    Sophos Solution Partner since 2003
    If a post solves your question, click the 'Verify Answer' link at this post.

  • Oh in my picture was an issue...

    we have the following configuration:

     

     

    The HA-cluster mode is: Hot-Standby (active-passive). --> on the cluster we have only one LAG. 

    The two Core-Switches also. Core 1 is active and Core 2 is passive. --> on this core switches are Trk1 + Trk2 configured. 

     

    I've tested a Failover yesterday and i think it works correctly now:

     

    On both core switches the virtual MAC address of the LAG1 was switched from Trk1 to Trk2 after rebooting the Sophos MASTER node. 

     

    Do your configuration refers to active-active HA-Cluster? Or Hot-Standby?

    If Hot-Standby: How can you configure two LAGs? 

     

     

    Thank you so far.

     

  • At the sophos devices you configure one LAG with port 8+9 only.

    You can't create config for the second node or use ports from second node within config.

    BTW: it is the same with active/standby and active/active. slave-ports are blocked always.

    With Cisco switches we repeatedly run into problems distributing a LAG over active and standby device. So we use a LAG for the active and a different for the standby device. 


    Dirk

    Systema Gesellschaft für angewandte Datentechnik mbH  // Sophos Platinum Partner
    Sophos Solution Partner since 2003
    If a post solves your question, click the 'Verify Answer' link at this post.

  • Ahh ok,

    now i understand your problem (...with the cisco devices.)

     

    Your hint with the virtual mac address helped me further in my analysis.

    Thank you ;)