Random WAN Drops

I recently migrated my UTM over from an older VMWare ESXI platform (5.5) over to new hardware and the newest version (6.7).

Since doing so, I've been having what seems to be random WAN drops and I can't figure out why.  My physical connectivity was:

 

Verizon FiOS ONT -> Dell Server EN0 ethernet port

Within ESXi, I have EN0 configured in a port group on a vSwitch with only the WAN interface of the UTM included.  This is the same way it was setup on my ESXi 5.5 box and worked perfectly for years.  Now, my WAN is seemingly randomly dropping out.  I've searched through the UTM logs and can't find anything solid indicating a problem, however I do receive emails saying prefetch failed or spam filter can't reach the database servers, or that the WAN is offline/online.

I started at Layer 1, changed ethernet cables.  Then I routed my WAN connection through a separate VLAN on my cisco switch.  The new (and current) topology is now:

 

Verizon FiOS ONT -> Cisco Switch -> Dell Server EN0 ethernet port

The WAN has dropped multiple times since then with notifications coming through email.  Cisco server logs show no Layer1/2 disconnects.  I think this drop is coming in a higher layer in the stack.  When the problem has occurred, within the UTM dashboard, I have never seen the WAN interface show disconnected, however it seems that after I refresh the lease on it, it drops for a few seconds, then pulls the same IP again and everything works.  I have also noticed a few times on the uplink monitoring tab that it will say WAN OFFLINE.  I turn uplink monitoring off globally, then back on and it resets to WAN ONLINE and everything works again.

Any pointers as to what I can look for to pinpoint the problem?  I could backup the UTM and re-install, but I'd really like to figure out what's causing this instead of just blowing everything away and not really knowing if the problem is fixed or not.

I'm running 9.605-1 UTM build

Hardware is a Dell R720 server.

Midway through trying to post this, I lost the WAN again.  This is getting old!

  • Does changing the NICs to VMXNET3 resolve this issue?

    Cheers - Bob

  • In reply to BAlfson:

    Thank you sir, for the reply Wink

    I'm already setup with VMXNET3 NICs, and was before the transition over to 6.7

    I noticed for the first time today that on the dashboard, when the WAN is down that my WAN state show "UP" but the link shows "ERROR".  On the interfaces screen, there was no indication of anything wrong.  Turning off Uplink Monitoring then right back on cleared the error on the dashboard, but until I renewed the DHCP lease, the WAN did not actually come back up.

    Where can I look for error logs to at least start digging into the root cause?  I can't find anything of substance to even start tracing...

    Thanks

  • In reply to Tango2:

    I would try a Google with something like site:community.sophos.com/products/unified-threat-management/f state up link error

    Cheers - Bob

  • In reply to BAlfson:

    Thanks Bob.  I did some more searching and tried some more things:

    Set speed/duplex on cisco switch ports between FiOS router and WAN NIC in UTM to 1000/Full.  VMXNET 3 NICs don't appear to support speed/duplex settings on UTM, so had to leave them as is.  Still had same random dropouts.

    Re-installed Sophos from scratch on VM.  Tried NIC passthrough to give the VM full control of the hardware.  I couldn't get this to work, maybe because I was trying to pass through only one NIC of a quad port card, or maybe I did something else wrong.

    Ended up adding one E1000e NIC to the WAN side of the UTM.  Things seem more stable, however I'm still having problems - but they are slightly different.

    Previously, when I had dropouts, the WAN would go offline in the UTM.  I would get a link error and I could not ping from the gateway to a public DNS server (8.8.8.8).

    Today, I experienced a dropout.  No clients could pass traffic externally, but on the dashboard it showed 0/0 traffic on the WAN port but both state and link were up.  Weird.

    Then I hopped over to the support tab and attempted a ping.  Worked fine, received replies.  As soon as I renewed the IP on the WAN interface, all the clients were able to pass traffic through the UTM and to the internet.

    I'm really at a loss here, unless there are some weird hardware compatibility issues with the quad port nic in the new server I've moved to.  The NIC is a l350-t rNDC as reported in the ESXi console.

    Any other ideas?