This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Random packet drops, poor performance after HDD swap

Hi folks

I've been using Sophos UTM since it was Astaro. I recently "upgraded" my home installation (home license) by replacing an old spinning-disk HDD with a SSHD hybrid drive as I was doing upgrades across some of my other laptops etc. The migration was performed using a standard install (x64 version) and a restore of the config from backup.

Since then I have been experiencing random packet loss, dropped TCP sessions, performance problems with Application Control.  A continuous ping to the next-hop router will drop packets, even when the interface traffic charts show there wasn't much happening (and certainly not full link saturation!). The one site-to-site VPN I have drops out, and the unit usually notfies mee that the internet uplink monitoring has noted a dropped link.

I notice in the install ISO that there's a file that mentions white-listed SSDs, with the implication being that other SSDs would not be good. I'm not sure if this is a hang-over from the distant past though.

So, two questions;

  1. Does anyone have experience with SSHDs or SSDs that would indicate I've made a mistake utilising one? and
  2. Any clues how I would go about troubleshooting whether this is a hardware issue, a broken installation, or something else?  What logs should I be referring to?

Any help appreciated.

Cheers, Jeremy.



This thread was automatically locked due to age.
Parents
  • Found this in the kernel log, is the NIC driver crashing?

    2017:10:19-19:27:50 gateway kernel: [376706.592782] e1000e 0000:00:19.0 eth0: Reset adapter unexpectedly
    2017:10:19-19:28:03 gateway kernel: [376719.644888] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: None
    2017:10:19-19:38:58 gateway kernel: [377374.731526] e1000e 0000:00:19.0 eth0: Reset adapter unexpectedly
    2017:10:19-19:39:10 gateway kernel: [377387.512578] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: None
    2017:10:19-20:08:41 gateway kernel: [379158.155079] e1000e 0000:00:19.0 eth0: Reset adapter unexpectedly
    2017:10:19-20:08:54 gateway kernel: [379171.243275] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: None
    2017:10:19-20:15:44 gateway kernel: [379581.229387] e1000e 0000:00:19.0 eth0: Reset adapter unexpectedly
    2017:10:19-20:15:56 gateway kernel: [379594.169552] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: None
    2017:10:19-20:43:52 gateway kernel: [381269.653976] e1000e 0000:00:19.0 eth0: Reset adapter unexpectedly
    2017:10:19-20:44:05 gateway kernel: [381282.630811] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: None
  • Hi there Jeremy,

    Yes i'd say so, at least there is an issue with your NIC hardware or drivers. Is this a virtualized UTM? if not i'd try to fix the speed end to end from that NIC (Mbps/duplex) this at NIC level and like so at switch level if possible. Thus avoiding the auto negotiation.

    If it's a vUTM, i'd swap from the e1000e to the e1000 drivers, which i think is part of the Sophos recommended drivers.
    https://community.sophos.com/kb/en-us/119230

    Let us know,

    Cheers,

    m-

  • Not virtual, it's a USFF PC (Lenovo M92P) with Intel chipset. Seems there is lots of Intel NIC issues, googling that e1000e reset message brings up LOTS of results.

    I've found some suggestions from the beta forums regarding disabling TSO, GRO and GSO network options (https://community.sophos.com/products/unified-threat-management/astaroorg/f/utm-9-3-beta/65895/9-302-2-bug-adapter-e1000e-hangs-reset) but I'll try your advise to hard-set the link speed first.

     

    Cheers, appreciated.

  • Extra info for those that come after: both the switch and the NIC are 1000Mbit/Full "capable" but kept falling back to 100Mbit/Full.

    In my situation, despite both being "capable" of 1000Mbit, I cannot force both to 1000Mbit and achieve network link. I have had to set both to 100Mbit which is what they were happier negotiating to in the first place.

    This is likely the source of the issue at this point (not the SSHD after all) - but irrespective of whether this turns out to be the final fix or not, if you find yourself in such a situation then you should always fix both ends to a common speed and duplex as they will continue to go through a negotiate/fallback loop.

     

    Edit: stressed tested as much as I can and no faults yet. I also see in the kernel log that adapter TSO is now disabled automatically, which it wasn't doing when the auto negotiation was occurring. Disabling the adapter TSO option was a recommendation in the beta forums for working around this issue.

Reply
  • Extra info for those that come after: both the switch and the NIC are 1000Mbit/Full "capable" but kept falling back to 100Mbit/Full.

    In my situation, despite both being "capable" of 1000Mbit, I cannot force both to 1000Mbit and achieve network link. I have had to set both to 100Mbit which is what they were happier negotiating to in the first place.

    This is likely the source of the issue at this point (not the SSHD after all) - but irrespective of whether this turns out to be the final fix or not, if you find yourself in such a situation then you should always fix both ends to a common speed and duplex as they will continue to go through a negotiate/fallback loop.

     

    Edit: stressed tested as much as I can and no faults yet. I also see in the kernel log that adapter TSO is now disabled automatically, which it wasn't doing when the auto negotiation was occurring. Disabling the adapter TSO option was a recommendation in the beta forums for working around this issue.

Children
  • Although much improved, there are still situations where the UTM simply stops responding or passing packets.

    Looks like the e1000e adapter is still resetting. Seems TSO is not being disabled any more.

    gateway:/root # lspci -vt
    -[0000:00]-+-00.0 Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor DRAM Controller
    +-02.0 Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor Graphics Controller
    +-14.0 Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller
    +-16.0 Intel Corporation 7 Series/C210 Series Chipset Family MEI Controller #1
    +-16.3 Intel Corporation 7 Series/C210 Series Chipset Family KT Controller
    +-19.0 Intel Corporation 82579LM Gigabit Network Connection
    +-1a.0 Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #2
    +-1b.0 Intel Corporation 7 Series/C210 Series Chipset Family High Definition Audio Controller
    +-1d.0 Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #1
    +-1e.0-[01]--
    +-1f.0 Intel Corporation Q77 Express Chipset LPC Controller
    +-1f.2 Intel Corporation 7 Series/C210 Series Chipset Family 6-port SATA Controller [AHCI mode]
    \-1f.3 Intel Corporation 7 Series/C210 Series Chipset Family SMBus Controller
    gateway:/root # lspci -vn
    00:19.0 0200: 8086:1502 (rev 04)
    Subsystem: 17aa:3086
    Flags: bus master, fast devsel, latency 0, IRQ 44
    Memory at f7c00000 (32-bit, non-prefetchable) [size=128K]
    Memory at f7c39000 (32-bit, non-prefetchable) [size=4K]
    I/O ports at f080 [size=32]
    Capabilities: [c8] Power Management version 2
    Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
    Capabilities: [e0] PCI Advanced Features
    Kernel driver in use: e1000e
    Kernel modules: e1000e
    gateway:/root # ethtool -k eth0
    Features for eth0:
    rx-checksumming: on
    tx-checksumming: on
    tx-checksum-ipv4: off [fixed]
    tx-checksum-ip-generic: on
    tx-checksum-ipv6: off [fixed]
    tx-checksum-fcoe-crc: off [fixed]
    tx-checksum-sctp: off [fixed]
    scatter-gather: on
    tx-scatter-gather: on
    tx-scatter-gather-fraglist: off [fixed]
    tcp-segmentation-offload: on
    tx-tcp-segmentation: on
    tx-tcp-ecn-segmentation: off [fixed]
    tx-tcp6-segmentation: on
    udp-fragmentation-offload: off [fixed]
    generic-segmentation-offload: on
    generic-receive-offload: on
    large-receive-offload: off [fixed]
    rx-vlan-offload: on
    tx-vlan-offload: on
    ntuple-filters: off [fixed]
    receive-hashing: on
    highdma: on [fixed]
    rx-vlan-filter: off [fixed]
    vlan-challenged: off [fixed]
    tx-lockless: off [fixed]
    netns-local: off [fixed]
    tx-gso-robust: off [fixed]
    tx-fcoe-segmentation: off [fixed]
    tx-gre-segmentation: off [fixed]
    tx-udp_tnl-segmentation: off [fixed]
    tx-mpls-segmentation: off [fixed]
    fcoe-mtu: off [fixed]
    tx-nocache-copy: off
    loopback: off [fixed]
    rx-fcs: off
    rx-all: off
    tx-vlan-stag-hw-insert: off [fixed]
    rx-vlan-stag-hw-parse: off [fixed]
    rx-vlan-stag-filter: off [fixed]


  • Hey Jeremy,

    I've battled a lot as well with e1000e drivers, in the future i'll avoid these boards.
    Although i'm currently running my setup on en ESXi 6.5 setup with e1000e natively supported drivers and it works fine.

    Would you have the possibility to try to virtualize your setup? free hypervisor from VMware will totally do the tricks.
    Honestly i've been reluctant at the beginning to use my UTM as a VM - I'm clearly not looking behind nowadays, so much easier, fallbacks, clones, sotrage vMotions...

    Cheers,

    -m-

  • Hey Mokaz

    I totally can do a virtual platform, in fact this all started (in hindsight) when I moved to this mini PC from a VM in my ESXi 5.5 server. Honestly the physical is better, especially running IDS/snort as that engine is single-threaded and my ESXi server is 30% slower on CPU speed (2.4GHz vs 3.4GHz) which makes a massive difference.

    The latest up2date rolled out after my last post, which of course required a reboot. Checking after the reboot, and I see that the kernel log shows the TSO option correctly disabled on that adapter. So hard-setting the switch and machine to a consistent duplex & speed really was the solution, just needed a clean reboot and for me to leave things alone afterwards!

    So far no more issues, and I have specific problems to go looking for if the does reoccur.

    Overall I hear you on the e1000e adapter, unfortunately all the mini PCs seem to be Intel-based and its unusual to find a true mini machine with a PCIe slot for an expansion board.  

    Cheers for the help, much appreciated.

  • The various intel NICs are still an issue, and I've been forced to add a hack (?) to the startup scripts to set the gro and tso options.

    Create file '/etc/udev/rules.d/21-my-nic.rules' with content:

    #community.sophos.com/.../9-302-2-bug-adapter-e1000e-hangs-reset
    #ethtool -K eth0 gso off <- Can't automate this!
    #ethtool -K eth0 gro off
    #ethtool -K eth0 tso off
    SUBSYSTEM=="net", ACTION=="add", ATTRS{vendor}=="0x8086", ATTRS{device}=="0x1502", RUN+="/lib/udev/nic-disable-tso"
    SUBSYSTEM=="net", ACTION=="add", ATTRS{vendor}=="0x8086", ATTRS{device}=="0x1502", RUN+="/lib/udev/nic-disable-gro"
     

    I'd also add a line for nic-disable-gso but that script doesn't exist.