This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ESXi HA Link monitoring

One of my clients had an issue this morning.

They have an UTM HA setup (active/passive) in an ESXi environment (ESXi standalone), with both nodes running on a different ESXi server. Every ESXi server has two vswitches: one connected to a pair of 1Gbps NICs (OOB management and Internet VLAN's amongst them), and one connected to a pair of 10Gbps NICs (backend VLANs and storage connections).

The heartbeat connection is a VLAN on the 10G vswitch, the backup connection a VLAN on the 1G vswitch.

They experienced an issue with their network stack, causing the 1G vswitch in the server running the active node to crash and stop forwarding traffic to the vNICs. Traffic between VM's still worked fine.

No failover happened, I assume because the link state of the vswitch ports didn't change, and there was still heartbeat connectivity. This effectively meant a loss of all internet connectivity and the ability to manage the virtual servers (or login to the UTM).

Are there any idea's on how to solve this issue?

I've thought about a physical interface pass-through to handle link state detection, but since in this case the switch issue was related to forwarding, that link would also have stayed up. I assume the UTM doesn't have something like beacon probing or gateway pings to check interface availability?



This thread was automatically locked due to age.
Parents
  • Hi  

    If you don't mind, would you please share the High-availability logs from a couple of mins before the incident happened? That would give us an idea about the status of the heartbeat between the devices. Further, it will indicate the monitoring ports and status of those ports which should clear the picture a bit.

    Regards

    Jaydeep

  • Jaydeep,

    Thanks for your response. The problem is that there isn't any.

    The setup is the following:

    • ESX 6.5 Update 3 patch level 11/2019
    • two teamed 1G interfaces connected to vswitch0
    • two teamed 10G interfaces connected to vswitch1
    • UTM has a heatbeat over dedicated VLAN 2, via vswitch1
    • UTM uses the OOB network as backup over VLAN 4, via vswitch0
    • UTM has 8 other interfaces, Internet on vswitch0, storage and DMZ's on vswitch1

    From what I can deduct now:

    The ign (Intel gigabit) drivers that run the 1Gbps interfaces crashed, causing vswitch0 to stop forwarding packets our of the box. vswitch0 itself remained operational (so inter VM traffic in the box wasnt affected), and vswitch1 still worked fine (10G drivers are different) so there were no heartbeat problems, and therefore no reason to failover.

    With a physical UTM, the internet port would have gone down, and HA Link Monitoring would have triggered a fail-over. However, the physical UTM would have had the same problem if the failure was further up in the infrastructure, for example the failure of an additional switch between the UTM and its gateway.

    Other systems mitigate this scenario using beacon or gateway probing, i.e. they don't check the physical link state, but the logical link state, using a "can I still reach this IP address?" check. But afaik the UTM doesn't have that feature.

    I'm now contemplating a cron script on the UTM to do the probing, and force a shutdown if the probing fails.

    To mitigate the chance of a second incident, I have now moved both UTM nodes to servers that don't use Intel NICs.

Reply
  • Jaydeep,

    Thanks for your response. The problem is that there isn't any.

    The setup is the following:

    • ESX 6.5 Update 3 patch level 11/2019
    • two teamed 1G interfaces connected to vswitch0
    • two teamed 10G interfaces connected to vswitch1
    • UTM has a heatbeat over dedicated VLAN 2, via vswitch1
    • UTM uses the OOB network as backup over VLAN 4, via vswitch0
    • UTM has 8 other interfaces, Internet on vswitch0, storage and DMZ's on vswitch1

    From what I can deduct now:

    The ign (Intel gigabit) drivers that run the 1Gbps interfaces crashed, causing vswitch0 to stop forwarding packets our of the box. vswitch0 itself remained operational (so inter VM traffic in the box wasnt affected), and vswitch1 still worked fine (10G drivers are different) so there were no heartbeat problems, and therefore no reason to failover.

    With a physical UTM, the internet port would have gone down, and HA Link Monitoring would have triggered a fail-over. However, the physical UTM would have had the same problem if the failure was further up in the infrastructure, for example the failure of an additional switch between the UTM and its gateway.

    Other systems mitigate this scenario using beacon or gateway probing, i.e. they don't check the physical link state, but the logical link state, using a "can I still reach this IP address?" check. But afaik the UTM doesn't have that feature.

    I'm now contemplating a cron script on the UTM to do the probing, and force a shutdown if the probing fails.

    To mitigate the chance of a second incident, I have now moved both UTM nodes to servers that don't use Intel NICs.

Children
No Data