This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Broken WAN port?

We have a customer with a SG125. They have been experiencing lags every couple of minutes. We have ben troubleshooting for some time now and last week we found out that starting a ping from their Terminal server in the data center to anything on their lan over an IpSec-tunnel "fixed" the problem. They experienced no lags while the ping was running. 

 

Yesterday I decided to fix the problem, no matter what. I found out that the kernel reported link flapping even though the ISP did not see link flapping on their side:

2020:02:23-22:03:37 fw kernel: [420615.139509] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Down
2020:02:23-22:03:37 fw kernel: [420615.139572] br0: port 1(eth1) entered disabled state
2020:02:23-22:03:40 fw kernel: [420618.059066] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
2020:02:23-22:03:40 fw kernel: [420618.059204] br0: port 1(eth1) entered forwarding state
2020:02:23-22:03:40 fw kernel: [420618.059238] br0: port 1(eth1) entered forwarding state
2020:02:23-22:03:55 fw kernel: [420633.079000] br0: port 1(eth1) entered forwarding state
2020:02:23-22:11:02 fw kernel: [421059.919407] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Down
2020:02:23-22:11:02 fw kernel: [421059.919460] br0: port 1(eth1) entered disabled state
2020:02:23-22:11:04 fw kernel: [421062.775004] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
2020:02:23-22:11:04 fw kernel: [421062.775145] br0: port 1(eth1) entered forwarding state
2020:02:23-22:11:04 fw kernel: [421062.775180] br0: port 1(eth1) entered forwarding state
2020:02:23-22:11:19 fw kernel: [421077.802938] br0: port 1(eth1) entered forwarding state

 

The WAN port was a bridge with eth1 and eth2, where eth2 was not connected. So I started by splitting the bridge and setting up as a regular Ethernet port to exclude the possibility of something STP-related. Some of the log lines disappeared, but the problem was not solved.

 

The kernel.log now filled up with these lines:

2020:02:24-14:57:57 fw kernel: [481460.554306] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Down
2020:02:24-14:58:00 fw kernel: [481463.461912] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
2020:02:24-14:59:21 fw kernel: [481544.545914] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Down
2020:02:24-14:59:24 fw kernel: [481547.433504] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
2020:02:24-14:59:48 fw kernel: [481571.687330] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Down
2020:02:24-15:00:22 fw kernel: [481605.863339] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
2020:02:24-15:00:25 fw kernel: [481608.174485] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Down
2020:02:24-15:00:27 fw kernel: [481611.086055] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
2020:02:24-15:01:27 fw kernel: [481670.427370] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Down
2020:02:24-15:01:30 fw kernel: [481673.418930] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
2020:02:24-15:06:38 fw kernel: [481981.560022] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Down
2020:02:24-15:06:41 fw kernel: [481984.311627] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None

After trying absolutely everything seemingly related, I tried to switch the WAN port from eth1 to eth7 and voila, it has been stable since, closing in on 24 hours now. 

After working with networking since 1993, I have never experienced that a single port on a switch/firewall/router have behaved like this. Anyone with an explanation (other than a physically broken port)?

 



This thread was automatically locked due to age.
Parents Reply Children
  • The UTM is at a site quite far away, so I am unable to test and I do not think my customer wants to pay for my hours for further troubleshooting. I considered autoneg-issues as the cause and that would have been my next step troubleshooting.  

    Fortunately changing to another port fixed the problem. Was not tempted to try get the ISP to change speed/duplex on the router. :)

  • Actually, your suspicion is plausible. I was a bit surpriced when I found out that eth0-eth3 and eth4-eth7 are different NICs. Both Intel, but different models. I wonder if I would have gotten the same symptoms on eth2 and eth3. Too bad I cannot experiment on a customers production firewall. :)