No failover on HA Cluster if Nic is in Error-State

Hello,

we are Using two SG 310 as a active / passive cluster.

Unfortunately we experienced that HA is not working.

Approximately once in a month we are loosing internet-connection.

The external WAN-Interface shows "ERROR" under "Link" and "UP" under "State"

Unfortunately the Cluster isn't switching to the other Node and the whole Office is cut from Internet.

If we turn the Interface off and on again, everything is working fine.

Rebooting the affecting Node will work too.

 

Did someone experince a similar behaviour.

We have to sort this problem out and support is not very helpful in this case

Tibor

  • Which cluster state you see.

    If there is a "unlinked" state the interface-down don't result in cluster failover.

  • In reply to dirkkotte:

    the problem is that ther is no "interface down". Instead there is a "error" state on the WAN Interface.

    cluster is working fine. If i reboot the affected node then  the second node will take over immediately.

  • In reply to Odi:

    Often the "error-state" we see, if the speed & duplex settings not the same at router and SG.

    For example ... you could not have "auto" at one side only.

  • In reply to dirkkotte:

    The true reason for an 'error' state is that it failed the 'uplink monitoring' conditions. 

    Basically the UTM cannot reach whatever hosts on whatever protocols that it uses to check if it has internet access. The 'error' state itself is not a problem but what caused it is. 

    Do you reboot both nodes when this happens or just one? 

  • In reply to MasterRoshi:

    i rebooted just one.

    Then the second one will take over aqnd everything is fine.

     

  • In reply to Odi:

    If you do a failover test manually, does it cause the problem? It might be the switch configuration if it has port security/spoof protection. 

  • In reply to MasterRoshi:

    unfortunaqtely the error is not reproduceable.

    The manual failover solves the problem but not cause it.

    in between i read a thread where they mentioned a problem on ARP Level between UTM and a Router.

     

    I've taken initiatives to set the providerrouter to fixed 100mbit.

    Then i will set the UTM to 100mbit.

     

    lets see....

  • In reply to Odi:

    When it fails over, the auxiliary will do a gratuitous arp when taking over. Perhaps this is what causes it to work but its better to find out why its not working when it happens. 

    I would recommend connecting to the device while its in the broken state and run captures with TCPDUMP on the external interface to see if it can arp for the default gateway, reach it, reach past it. 

  • In reply to MasterRoshi:

    unfortunately we do not have a second to investigate why its down when its down.

    Thats why we have a Cluster ;-)

    We have to be online 24/7

    Maybe at Midnight we'll have a chance, but i can not reproduce it

  • In reply to Odi:

    Like Dirk, I also suspect the speed/duplex settings.  I think the discussion with MasterRoshi confirms this.  `I've had several clients with a similar problem.  See the solution in #7.7 in Rulz (last updated 2019-04-17).

    Cheers - Bob