This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

No failover on HA Cluster if Nic is in Error-State

Hello,

we are Using two SG 310 as a active / passive cluster.

Unfortunately we experienced that HA is not working.

Approximately once in a month we are loosing internet-connection.

The external WAN-Interface shows "ERROR" under "Link" and "UP" under "State"

Unfortunately the Cluster isn't switching to the other Node and the whole Office is cut from Internet.

If we turn the Interface off and on again, everything is working fine.

Rebooting the affecting Node will work too.

 

Did someone experince a similar behaviour.

We have to sort this problem out and support is not very helpful in this case

Tibor



This thread was automatically locked due to age.
Parents
  • Which cluster state you see.

    If there is a "unlinked" state the interface-down don't result in cluster failover.


    Dirk

    Systema Gesellschaft für angewandte Datentechnik mbH  // Sophos Platinum Partner
    Sophos Solution Partner since 2003
    If a post solves your question, click the 'Verify Answer' link at this post.

  • the problem is that ther is no "interface down". Instead there is a "error" state on the WAN Interface.

    cluster is working fine. If i reboot the affected node then  the second node will take over immediately.

Reply Children
  • Often the "error-state" we see, if the speed & duplex settings not the same at router and SG.

    For example ... you could not have "auto" at one side only.


    Dirk

    Systema Gesellschaft für angewandte Datentechnik mbH  // Sophos Platinum Partner
    Sophos Solution Partner since 2003
    If a post solves your question, click the 'Verify Answer' link at this post.

  • The true reason for an 'error' state is that it failed the 'uplink monitoring' conditions. 

    Basically the UTM cannot reach whatever hosts on whatever protocols that it uses to check if it has internet access. The 'error' state itself is not a problem but what caused it is. 

    Do you reboot both nodes when this happens or just one? 

  • i rebooted just one.

    Then the second one will take over aqnd everything is fine.

     

  • If you do a failover test manually, does it cause the problem? It might be the switch configuration if it has port security/spoof protection. 

  • unfortunaqtely the error is not reproduceable.

    The manual failover solves the problem but not cause it.

    in between i read a thread where they mentioned a problem on ARP Level between UTM and a Router.

     

    I've taken initiatives to set the providerrouter to fixed 100mbit.

    Then i will set the UTM to 100mbit.

     

    lets see....

  • When it fails over, the auxiliary will do a gratuitous arp when taking over. Perhaps this is what causes it to work but its better to find out why its not working when it happens. 

    I would recommend connecting to the device while its in the broken state and run captures with TCPDUMP on the external interface to see if it can arp for the default gateway, reach it, reach past it. 

  • unfortunately we do not have a second to investigate why its down when its down.

    Thats why we have a Cluster ;-)

    We have to be online 24/7

    Maybe at Midnight we'll have a chance, but i can not reproduce it

  • Like Dirk, I also suspect the speed/duplex settings.  I think the discussion with MasterRoshi confirms this.  `I've had several clients with a similar problem.  See the solution in #7.7 in Rulz (last updated 2019-04-17).

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA