This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Broken WAN port?

We have a customer with a SG125. They have been experiencing lags every couple of minutes. We have ben troubleshooting for some time now and last week we found out that starting a ping from their Terminal server in the data center to anything on their lan over an IpSec-tunnel "fixed" the problem. They experienced no lags while the ping was running. 

 

Yesterday I decided to fix the problem, no matter what. I found out that the kernel reported link flapping even though the ISP did not see link flapping on their side:

2020:02:23-22:03:37 fw kernel: [420615.139509] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Down
2020:02:23-22:03:37 fw kernel: [420615.139572] br0: port 1(eth1) entered disabled state
2020:02:23-22:03:40 fw kernel: [420618.059066] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
2020:02:23-22:03:40 fw kernel: [420618.059204] br0: port 1(eth1) entered forwarding state
2020:02:23-22:03:40 fw kernel: [420618.059238] br0: port 1(eth1) entered forwarding state
2020:02:23-22:03:55 fw kernel: [420633.079000] br0: port 1(eth1) entered forwarding state
2020:02:23-22:11:02 fw kernel: [421059.919407] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Down
2020:02:23-22:11:02 fw kernel: [421059.919460] br0: port 1(eth1) entered disabled state
2020:02:23-22:11:04 fw kernel: [421062.775004] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
2020:02:23-22:11:04 fw kernel: [421062.775145] br0: port 1(eth1) entered forwarding state
2020:02:23-22:11:04 fw kernel: [421062.775180] br0: port 1(eth1) entered forwarding state
2020:02:23-22:11:19 fw kernel: [421077.802938] br0: port 1(eth1) entered forwarding state

 

The WAN port was a bridge with eth1 and eth2, where eth2 was not connected. So I started by splitting the bridge and setting up as a regular Ethernet port to exclude the possibility of something STP-related. Some of the log lines disappeared, but the problem was not solved.

 

The kernel.log now filled up with these lines:

2020:02:24-14:57:57 fw kernel: [481460.554306] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Down
2020:02:24-14:58:00 fw kernel: [481463.461912] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
2020:02:24-14:59:21 fw kernel: [481544.545914] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Down
2020:02:24-14:59:24 fw kernel: [481547.433504] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
2020:02:24-14:59:48 fw kernel: [481571.687330] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Down
2020:02:24-15:00:22 fw kernel: [481605.863339] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
2020:02:24-15:00:25 fw kernel: [481608.174485] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Down
2020:02:24-15:00:27 fw kernel: [481611.086055] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
2020:02:24-15:01:27 fw kernel: [481670.427370] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Down
2020:02:24-15:01:30 fw kernel: [481673.418930] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
2020:02:24-15:06:38 fw kernel: [481981.560022] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Down
2020:02:24-15:06:41 fw kernel: [481984.311627] igb 0000:00:14.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None

After trying absolutely everything seemingly related, I tried to switch the WAN port from eth1 to eth7 and voila, it has been stable since, closing in on 24 hours now. 

After working with networking since 1993, I have never experienced that a single port on a switch/firewall/router have behaved like this. Anyone with an explanation (other than a physically broken port)?

 



This thread was automatically locked due to age.
Parents Reply
  • Actually, your suspicion is plausible. I was a bit surpriced when I found out that eth0-eth3 and eth4-eth7 are different NICs. Both Intel, but different models. I wonder if I would have gotten the same symptoms on eth2 and eth3. Too bad I cannot experiment on a customers production firewall. :)

Children
No Data