This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Broken WAN port?

We have a customer with a SG125. They have been experiencing lags every couple of minutes. We have ben troubleshooting for some time now and last week we found out that starting a ping from their Terminal server in the data center to anything on their lan over an IpSec-tunnel "fixed" the problem. They experienced no lags while the ping was running. 

 

Yesterday I decided to fix the problem, no matter what. I found out that the kernel reported link flapping even though the ISP did not see link flapping on their side:

2020:02:23-22:03:37 fw kernel: [420615.139509] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Down
2020:02:23-22:03:37 fw kernel: [420615.139572] br0: port 1(eth1) entered disabled state
2020:02:23-22:03:40 fw kernel: [420618.059066] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
2020:02:23-22:03:40 fw kernel: [420618.059204] br0: port 1(eth1) entered forwarding state
2020:02:23-22:03:40 fw kernel: [420618.059238] br0: port 1(eth1) entered forwarding state
2020:02:23-22:03:55 fw kernel: [420633.079000] br0: port 1(eth1) entered forwarding state
2020:02:23-22:11:02 fw kernel: [421059.919407] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Down
2020:02:23-22:11:02 fw kernel: [421059.919460] br0: port 1(eth1) entered disabled state
2020:02:23-22:11:04 fw kernel: [421062.775004] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
2020:02:23-22:11:04 fw kernel: [421062.775145] br0: port 1(eth1) entered forwarding state
2020:02:23-22:11:04 fw kernel: [421062.775180] br0: port 1(eth1) entered forwarding state
2020:02:23-22:11:19 fw kernel: [421077.802938] br0: port 1(eth1) entered forwarding state

 

The WAN port was a bridge with eth1 and eth2, where eth2 was not connected. So I started by splitting the bridge and setting up as a regular Ethernet port to exclude the possibility of something STP-related. Some of the log lines disappeared, but the problem was not solved.

 

The kernel.log now filled up with these lines:

2020:02:24-14:57:57 fw kernel: [481460.554306] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Down
2020:02:24-14:58:00 fw kernel: [481463.461912] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
2020:02:24-14:59:21 fw kernel: [481544.545914] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Down
2020:02:24-14:59:24 fw kernel: [481547.433504] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
2020:02:24-14:59:48 fw kernel: [481571.687330] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Down
2020:02:24-15:00:22 fw kernel: [481605.863339] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
2020:02:24-15:00:25 fw kernel: [481608.174485] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Down
2020:02:24-15:00:27 fw kernel: [481611.086055] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
2020:02:24-15:01:27 fw kernel: [481670.427370] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Down
2020:02:24-15:01:30 fw kernel: [481673.418930] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
2020:02:24-15:06:38 fw kernel: [481981.560022] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Down
2020:02:24-15:06:41 fw kernel: [481984.311627] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None

After trying absolutely everything seemingly related, I tried to switch the WAN port from eth1 to eth7 and voila, it has been stable since, closing in on 24 hours now. 

After working with networking since 1993, I have never experienced that a single port on a switch/firewall/router have behaved like this. Anyone with an explanation (other than a physically broken port)?

 



This thread was automatically locked due to age.
Parents
  • Hei Rolf-Arne,

    I wonder if this isn't an incompatibility between the SG's NICs and the ISP's equipment.  What happens if you try #7.7 in Rulz (last updated 2019-04-17)?

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
  • The UTM is at a site quite far away, so I am unable to test and I do not think my customer wants to pay for my hours for further troubleshooting. I considered autoneg-issues as the cause and that would have been my next step troubleshooting.  

    Fortunately changing to another port fixed the problem. Was not tempted to try get the ISP to change speed/duplex on the router. :)

Reply
  • The UTM is at a site quite far away, so I am unable to test and I do not think my customer wants to pay for my hours for further troubleshooting. I considered autoneg-issues as the cause and that would have been my next step troubleshooting.  

    Fortunately changing to another port fixed the problem. Was not tempted to try get the ISP to change speed/duplex on the router. :)

Children
No Data