HA Lost Heartbeat 10GbE

Consistantly getting messages in HA log...

2021:04:14-09:07:17 utm-2 ha_daemon[4811]: id="38A1" severity="warn" sys="System" sub="ha" seq="S: 178 17.839" name="Lost heartbeat message from node 1! Expected 12681 but got 12682"
2021:04:14-09:07:20 utm-1 ha_daemon[4818]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 185 20.483" name="Lost heartbeat message from node 2! Expected 14295 but got 14297"
2021:04:14-09:08:37 utm-1 ha_daemon[4818]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 186 37.561" name="Lost heartbeat message from node 2! Expected 14373 but got 14374"
2021:04:14-09:08:40 utm-1 ha_daemon[4818]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 187 40.565" name="Lost heartbeat message from node 2! Expected 14376 but got 14377"
2021:04:14-09:09:00 utm-1 ha_daemon[4818]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 188 00.585" name="Lost heartbeat message from node 2! Expected 14396 but got 14397"
2021:04:14-09:09:26 utm-1 ha_daemon[4818]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 189 26.610" name="Lost heartbeat message from node 2! Expected 14422 but got 14423"
2021:04:14-09:09:45 utm-1 ha_daemon[4818]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 190 45.628" name="Lost heartbeat message from node 2! Expected 14441 but got 14442"
2021:04:14-09:10:03 utm-1 repctl[31108]: [i] recheck(1057): got ALRM: replication recheck triggered Setup_replication_done = 1

Both server hardware are identical
Dell PowerEdge R440
2 Intel Xeon Silver 4216 CPU
48 GB RAM
QLogic FastLinQ QL41134H 10GbE Adapter (Jumbo Packets Enabled)
Uplinks connect at 10GbE

Running Windows Server 2016 Hyper-V
Processors 32
Memory 32 GB
Each interface has its own vSwitch on Qlogic FastLinQ card
Jumbo packets enabled on VSwitch
    [Get-NetAdapterAdvancedProperty]
    -SLOT 2 Port 4, Jumbo Packet, 9014, JumboPacket {9014}
    
UTM9 running version 9.705-3
eth3 (HA) configured for jumbo packets on command line (I know it is not permanent)
ifconfig eth3 mtu 9000

The reason we started looking deeper into the UTM was it was having issues with google meet when the internal traffic started going over 1Gbit

We have aprox 375000 Concurrent Connections
Internal Bandwidth Uplink 10GbE
2 External Pipes connected at 10GbE but limited to 2Gbit each

We have tried disabling intrusion protection and adding exceptions
We have tried disabling advanced threat protection and adding exceptions

Everything I have read and found has stated this should be working great, and it does when traffic is < 800MB

Any help in troubleshooting or resolving this issue is greatly appreciated.

  • Hi Peter and welcome to the UTM Community!

    I don't recall seeing this problem here before.  What does Sophos Support say about this?

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
  • UPDATE

    This was\is not an issue with the UTM9 Application

    The issue appears to have been with the Windows QLogic Card settings (vSwitches)

    We disabled all hardware offloading settings on the QLogic card and the HA Heartbeat errors have gone away.

    Thank You

    -Peter Mastrangelo