This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Random WAN Drops

I recently migrated my UTM over from an older VMWare ESXI platform (5.5) over to new hardware and the newest version (6.7).

Since doing so, I've been having what seems to be random WAN drops and I can't figure out why.  My physical connectivity was:

 

Verizon FiOS ONT -> Dell Server EN0 ethernet port

Within ESXi, I have EN0 configured in a port group on a vSwitch with only the WAN interface of the UTM included.  This is the same way it was setup on my ESXi 5.5 box and worked perfectly for years.  Now, my WAN is seemingly randomly dropping out.  I've searched through the UTM logs and can't find anything solid indicating a problem, however I do receive emails saying prefetch failed or spam filter can't reach the database servers, or that the WAN is offline/online.

I started at Layer 1, changed ethernet cables.  Then I routed my WAN connection through a separate VLAN on my cisco switch.  The new (and current) topology is now:

 

Verizon FiOS ONT -> Cisco Switch -> Dell Server EN0 ethernet port

The WAN has dropped multiple times since then with notifications coming through email.  Cisco server logs show no Layer1/2 disconnects.  I think this drop is coming in a higher layer in the stack.  When the problem has occurred, within the UTM dashboard, I have never seen the WAN interface show disconnected, however it seems that after I refresh the lease on it, it drops for a few seconds, then pulls the same IP again and everything works.  I have also noticed a few times on the uplink monitoring tab that it will say WAN OFFLINE.  I turn uplink monitoring off globally, then back on and it resets to WAN ONLINE and everything works again.

Any pointers as to what I can look for to pinpoint the problem?  I could backup the UTM and re-install, but I'd really like to figure out what's causing this instead of just blowing everything away and not really knowing if the problem is fixed or not.

I'm running 9.605-1 UTM build

Hardware is a Dell R720 server.

Midway through trying to post this, I lost the WAN again.  This is getting old!



This thread was automatically locked due to age.
Parents
  • Does changing the NICs to VMXNET3 resolve this issue?

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
  • Thank you sir, for the reply [;)]

    I'm already setup with VMXNET3 NICs, and was before the transition over to 6.7

    I noticed for the first time today that on the dashboard, when the WAN is down that my WAN state show "UP" but the link shows "ERROR".  On the interfaces screen, there was no indication of anything wrong.  Turning off Uplink Monitoring then right back on cleared the error on the dashboard, but until I renewed the DHCP lease, the WAN did not actually come back up.

    Where can I look for error logs to at least start digging into the root cause?  I can't find anything of substance to even start tracing...

    Thanks

  • I would try a Google with something like site:community.sophos.com/products/unified-threat-management/f state up link error

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
  • Thanks Bob.  I did some more searching and tried some more things:

    Set speed/duplex on cisco switch ports between FiOS router and WAN NIC in UTM to 1000/Full.  VMXNET 3 NICs don't appear to support speed/duplex settings on UTM, so had to leave them as is.  Still had same random dropouts.

    Re-installed Sophos from scratch on VM.  Tried NIC passthrough to give the VM full control of the hardware.  I couldn't get this to work, maybe because I was trying to pass through only one NIC of a quad port card, or maybe I did something else wrong.

    Ended up adding one E1000e NIC to the WAN side of the UTM.  Things seem more stable, however I'm still having problems - but they are slightly different.

    Previously, when I had dropouts, the WAN would go offline in the UTM.  I would get a link error and I could not ping from the gateway to a public DNS server (8.8.8.8).

    Today, I experienced a dropout.  No clients could pass traffic externally, but on the dashboard it showed 0/0 traffic on the WAN port but both state and link were up.  Weird.

    Then I hopped over to the support tab and attempted a ping.  Worked fine, received replies.  As soon as I renewed the IP on the WAN interface, all the clients were able to pass traffic through the UTM and to the internet.

    I'm really at a loss here, unless there are some weird hardware compatibility issues with the quad port nic in the new server I've moved to.  The NIC is a l350-t rNDC as reported in the ESXi console.

    Any other ideas?

Reply
  • Thanks Bob.  I did some more searching and tried some more things:

    Set speed/duplex on cisco switch ports between FiOS router and WAN NIC in UTM to 1000/Full.  VMXNET 3 NICs don't appear to support speed/duplex settings on UTM, so had to leave them as is.  Still had same random dropouts.

    Re-installed Sophos from scratch on VM.  Tried NIC passthrough to give the VM full control of the hardware.  I couldn't get this to work, maybe because I was trying to pass through only one NIC of a quad port card, or maybe I did something else wrong.

    Ended up adding one E1000e NIC to the WAN side of the UTM.  Things seem more stable, however I'm still having problems - but they are slightly different.

    Previously, when I had dropouts, the WAN would go offline in the UTM.  I would get a link error and I could not ping from the gateway to a public DNS server (8.8.8.8).

    Today, I experienced a dropout.  No clients could pass traffic externally, but on the dashboard it showed 0/0 traffic on the WAN port but both state and link were up.  Weird.

    Then I hopped over to the support tab and attempted a ping.  Worked fine, received replies.  As soon as I renewed the IP on the WAN interface, all the clients were able to pass traffic through the UTM and to the internet.

    I'm really at a loss here, unless there are some weird hardware compatibility issues with the quad port nic in the new server I've moved to.  The NIC is a l350-t rNDC as reported in the ESXi console.

    Any other ideas?

Children
  • I'm still struggling with this, and would love some other ideas.  I'm concerned that there may be something going on with my ISP, but given the circumstances I feel it's unlikely.  I also know that if I call them, I'm going to have to reconfigure my network to use "their" equipment, which I really don't want to do.

    Other things I've noticed:

    With Intel E1000e NIC vs VMXNET3...

    Link/State are now both staying Up/Up, even when the WAN is down

    The UTM can ping using support function even when clients can't access the internet.

    The only way to have clients begin browsing again is to renew the DHCP lease

    When renewing the DHCP lease, I get a new IP address every time.  This is the main reason that I'm concerned it may be an ISP problem, as previously I would almost never get a new WAN IP.

    I think my next step is to wireshark through a span port on my switch that my WAN is running through now to see what exactly is going on.  It's strange to me that the WAN is showing up and is allowing pings through to the internet, however the clients can't access.  Any other thoughts/ideas would be greatly appreciated.

    Travis

  • Knock on wood... I think I figured it out.  Fingers crossed that it's fixed now.  Hopefully someone else can benefit from what I found.

    The Dell R720 (and many other "servers") offer IPMI or in this case iDRAC which is an in-band management platform to see a remote console over IP among other things.  My configuration wasn't right, and it was fighting for an IP on the same physical interface as my WAN interface on my UTM.  I found this via the packet capture that I performed on the physical interface.  I noticed traffic originating from two different MAC addresses (the virtual MAC on the UTM as well as the virtual address of the iDRAC controller).  I immediately checked the settings and noticed it was pulling IP addresses from my ISP as well, which was conflicting with the IPs I was pulling on the WAN of the UTM.

    I'm 99% sure this case is closed, but I'll monitor for the next day or two and see what happens.