WAN Interface State=UP Link=DOWN

My external monitor clued me into to two events this week with my SG105.  It has two WAN interfaces; ETH1 uses DHCP and ETH3 uses a static address.  Both are in the Active Interfaces box on the Interfaces > Uplink Balancing screen, ETH1 first, with ETH1 weighted at 90 and ETH3 at 10.  I have a DDNS entry that is successfully being updated to the ETH1 interface's address when things are normal.  During initial testing, I pulled the ETH1 interface and the system quickly detected the issue and registered the ETH3 static IP with DDNS as expected.  It also restored the interface and reverted the DDNS entry when I reconnected ETH1.  All good.

This week, the system appears to have detected some fault on ETH1 and failed it.  The first time was mid-weekday and I didn't have time to investigate so I disabled and re-enabled the ETH1 interface and everything reset back to normal.  It just happened again mid-day Saturday when I have more time to investigate before getting screams from users.  

The Dashboard page is indicating the State of ETH1 is Up but the Link is Down.  I SSH'd in and ifconfig shows the interface up with the same IP the provider always assigns.  I went poking through the logs and found entries in Fallback Messages.

2016:03:19-16:04:25 remote [daemon:info] nwd[30869]:  Interface eth1 is up but link is down 
2016:03:19-16:04:25 remote [daemon:info] nwd[30869]:  Interface eth1 has link down
2016:03:19-16:04:25 remote [daemon:info] nwd[30869]:  Writing into file  eth1: 0
2016:03:19-16:04:25 remote [daemon:info] nwd[30869]:  Executing Command /var/mdw/scripts/dhcpc renew eth1
2016:03:19-16:04:25 remote [daemon:info] irqd[5883]:  eth1 ether 00:1a:8c:zz:zz:zz <broadcast,multicast,up> group 0 
2016:03:19-16:04:37 remote [daemon:info] irqd[5883]:  eth1 ether 00:1a:8c:zz:zz:zz <broadcast,multicast,up,running,lowerup> group 0 
2016:03:19-16:04:37 remote [daemon:info] nwd[30869]:  Interface eth1 is up and link is back up  
2016:03:19-16:04:37 remote [daemon:info] nwd[30869]:  Interface eth1 has link up
2016:03:19-16:04:37 remote [daemon:info] nwd[30869]:  Writing into file  eth1: 1
2016:03:19-16:04:37 remote [daemon:info] nwd[30869]:  Executing Command /var/mdw/scripts/dhcpc renew eth1
2016:03:19-16:04:40 remote [daemon:info] dhcp_updown[28278]:  Installing IPv4 address: #.#.#.#/255.255.255.0

The address in the last line of the log is my typical ETH1 address so this looks good.  No idea why it thought it went down though.

I went ahead and disabled then re-enabled ETH1 again and everything recovered.  I'm looking for some guidance on a couple fronts.  First, where can I find more info on what happened?  Second, what does Status down but Link up indicate?  

Thanks in advance.

  • This could be the ISP's equipment and your SG's NIC swearing at each other.  Try setting both on fixed speed/duplex to see if that solves the problem.  Start with 100Mbps/Full and then try Half if Full doesn't work.  After applying the new settings on both devices, reboot both.

    Any luck with that?

    Cheers - Bob

  • In reply to BAlfson:

    I thought of that but the setup has been working for a few weeks already and I'd expect to see issues with "auto" earlier.  I chose to replace the physical cabling first and, so far, it's been solid.  Will let it ride with a cable laying on the floor for a while and repunch and test the original path in the mean time.  Will try the speed/duplex settings if the issue returns.

  • In reply to Paul Dugas:

    It's been solid for more than a week then started doing it again today.  3 episodes so far.  The kernel is clearly seeing a physical issue given the dmesg content below.

    [2403269.618927] igb 0000:02:00.0 eth1: igb: eth1 NIC Link is Down
    [2403282.234570] igb 0000:02:00.0 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
    [2403444.345974] igb 0000:02:00.0 eth1: igb: eth1 NIC Link is Down
    [2403456.928341] igb 0000:02:00.0 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
    [2403767.253375] net_ratelimit: 1 callbacks suppressed
    [2404900.210003] igb 0000:02:00.0 eth1: igb: eth1 NIC Link is Down
    [2404912.807399] igb 0000:02:00.0 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
    [2405458.639830] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
    [2405458.639841] 8021q: adding VLAN 0 to HW filter on device eth1
    [2405461.648575] igb 0000:02:00.0 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
    [2405461.648871] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
    [2406682.662082] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
    [2406682.662093] 8021q: adding VLAN 0 to HW filter on device eth1
    [2406685.590843] igb 0000:02:00.0 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
    [2406685.591139] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
    [2416851.594641] igb 0000:02:00.0 eth1: igb: eth1 NIC Link is Down
    [2416863.780448] igb 0000:02:00.0 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
    [2420105.752894] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
    [2420105.752905] 8021q: adding VLAN 0 to HW filter on device eth1
    [2420108.769593] igb 0000:02:00.0 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
    [2420108.769887] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
    [2420151.853424] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
    [2420151.853435] 8021q: adding VLAN 0 to HW filter on device eth1
    [2420154.834274] igb 0000:02:00.0 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
    [2420154.834567] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready

    I know I need to figure out the hardware fault but I'm really curious about how the UTM is responding.  Each time, I'm finding the entry on the UTM's main dashboard showing the State as UP but the Link as DOWN.  At the same time, the interface is [UP] on the Interfaces page.  Running ifconfig in an SSH session shows the interface is UP and using the IPv4 and IPv6 addresses consistently assigned via DHCP by the provider.  Running ethtool also shows good info with the correct negotiated speed and duplex values and "Link detected" as "yes".

    Why isn't the UTM recovering from whatever these blips are? 

  • In reply to Paul Dugas:

    I've seen this too many times.  Did you try the fixed settings and reboots?

    Cheers - Bob

  • In reply to BAlfson:

    I fixed the settings to 100baseT, full-duplex in the UTM UI.  ethtool now reportes the same.  I've not rebooted.  Will let it lie for a bit first but that will be the next step.

    On a side note, I also have a home-built machine here running the same UTM firmware as opposed to the Sophos SG105 hardware in question in this thread.  Both are connected to identical Cisco/Linksys cable modems and Comcast circuits.  I've never seen this issue with the home-built one here.  Have only seen this with the SG105 at this site.  They almost certainly have different hardware interfaces and drivers so it's not immediately relevant but It's where I'm coming from.  This is why I'm surprised to see these faults.

    Separate from the hardware and potential negotiation issues, I feel like I'm experiencing a situation that the software should be able to recover from on on it's own.  Am I off base here?  Given the ethtool and ifconfig outputs and the fact that disabling and re-enabling the interface restores things, I don't understand why the UTM dashboard is reporting the interface as down.  Why doesn't is recover on it's own?

  • In reply to Paul Dugas:

    "I've never seen this issue with the home-built one here." - that would serve to confirm my guess.

    Cheers - Bob

  • In reply to BAlfson:

    It's been more reliable since disabling auto-negotiation on the hardware interface but the problem persists.  Was solid for two weeks then it happened yesterday and again overnight.  Every time, the kernel says the interface is up and it's already reconfigured the interface via DHCP but the UTM says the interface is in error and won't enable it. This is driving me insane. 

  • In reply to Paul Dugas:

    Paul, did you also make the same fixed settings on the ISP's equipment and then rebooted both devices afterwards?

    Cheers - Bob

  • In reply to BAlfson:

    Honestly, I asked the on-site guy to do it but I'll be there tomorrow night to reverify.

  • In reply to Paul Dugas:

    Gah!  Cisco and Comcast are going to drive me insane!  I bought this Cisco/Linksys DPC3008 cable modem to avoid the monthly rental fee.  Seems the firmware Linksys ships it with is Comcast-specific.  Cisco won't tell me what the login info is - says, go to Comcast.  Comcast says they don't know the credentials since it's not one they own. Catch-22!  Crazy!

    Anyway, changed tactics...  Inserted a couple ports on my core switch in between the UTM and cable modem.  Partitioned them off into a separate untagged VLAN.  If the cable modem's auto-negotiation fails, I'm hoping the switch handles it and prevents the connection to the UTM from seeing anything weird.  Fingers crossed...

  • Does anyone perhaps know where these type of errors are stored in /log/ for the XG , I looked in the errors log but dont see any logs for interfaces. 

  • In reply to Paul Dugas:

    Do you perhaps know how i can see these types of logs on the XG ?