Strange drops

We have a customer with a phone switchboard application that periodically freezes, either at an application level (can't click anything), or it just won't show incoming calls. In both cases it can sometimes unfreeze, and then all the calls that have come in in the meantime suddenly flash on the screen. We've ruled out AV as the cause and are now looking into the problem being at the network layer.

drop-packet-capture shows this at the time of freezing:

2017-05-23 08:58:14 0101021 IP 10.10.90.2.8779 > 10.10.10.112.43470 : proto TCP: P 3007061919:3007062115(196) win 330 checksum : 55314
0x0000:  4500 00ec 18b4 4000 3f06 a9d2 0a0a 5a02  E.....@.?.....Z.
0x0010:  <remainder of the packet redacted>
Date=2017-05-23 Time=08:58:14 log_id=0101021 log_type=Firewall log_component=Firewall_Rule log_subtype=Denied log_status=N/A log_priority=Alert duration=N/A in_dev=Lag.90 out_dev=Lag.10 inzone_id=1 outzone_id=8 source_mac=00:1a:e8:8b:15:b4 dest_mac=00:e0:20:11:08:fc l3_protocol=IP source_ip=10.10.90.2 dest_ip=10.10.10.112 l4_protocol=TCP source_port=8779 dest_port=43470 fw_rule_id=0 policytype=1 live_userid=0 userid=0 user_gp=0 ips_id=0 sslvpn_id=0 web_filter_id=0 hotspot_id=0 hotspotuser_id=0 hb_src=0 hb_dst=0 dnat_done=0 proxy_flags=0 icap_id=0 app_filter_id=0 app_category_id=0 app_id=0 category_id=0 bandwidth_id=0 up_classid=0 dn_classid=0 source_nat_id=0 cluster_node=0 inmark=0x0 nfqueue=101 scanflags=0 gateway_offset=0 max_session_bytes=0 drop_fix=0 ctflags=33554472 connid=2341170016 masterid=0 status=398 state=3 sent_pkts=N/A recv_pkts=N/A sent_bytes=N/A recv_bytes=N/A tran_src_ip=N/A tran_src_port=N/A tran_dst_ip=N/A tran_dst_port=N/A

then the same again exactly 2 minutes later (even the checksum is the same)

The connection came good another minute later.

Any idea where to look next?

thanks

James

  • James,

    create a Firewall Rule from 10.10.90.2 TCP 8779 to 10.10.10.112 port 43470. Log ID: 0101021 means that traffic is dropped by Firewall.

    Regards

     

  • In reply to lferrara:

    You can see from the packet that this is not a SYN packet, this is a packet from an established connection that has been dropped for no obvious reason. The connection has not timed out - it is still present in conntrack. In this case, the connection came good again after a bit and the application unfroze. So for some reason XG is deciding that it occasionally doesn't like something about packets in the middle of a connection.

    I have done the following:

    set advanced-firewall bypass-stateful-firewall-config add source_network 10.10.90.0 source_netmask 255.255.255.0 dest_network 10.10.10.0 dest_netmask 255.255.255.0

    set advanced-firewall bypass-stateful-firewall-config add source_network 10.10.10.0 source_netmask 255.255.255.0 dest_network 10.10.90.0 dest_netmask 255.255.255.0

    which disables connection tracking and inspection between the two networks, and the problem has not occurred in the 8ish hours since I put that in place. Previously the problem would have occurred many times in that time.

    I will raise a ticket with Sophos if I get another day of trouble free connectivity. If the customer is in agreement I might try removing those rules and see if the problem returns, just to prove the fix.

  • In reply to jamesharper:

    jamesharper,

    if you ended with those commands, you have an asymmetric routing issue so this is normal on XG and by default it is blocked. More info here:

    https://community.sophos.com/products/xg-firewall/f/network-and-routing/89972/asymmetric-routing-question

    Regards

  • In reply to lferrara:

    I don't see how. One device is on Lag.10, and the other device is on Lag.90. It's not possible for there to be asymmetric routing as the Sophos is the only way to get from one network to the other. The only other thing it could possibly be is the HA Passive device interfering somehow, but I don't see any evidence of that. I guess that would be easy enough to test though.

    And this is only occasional packets being blocked, if there was an asymmetric route I would expect the connection not to work at all.

  • In reply to jamesharper:

    Thanks James for your answer.

    If the situation is this, something is not working properly (bug?). Open a ticket and let us know.

    Regards

  • In reply to lferrara:

    I will be doing that once I have confirmed that the issue is resolved. This particular issue has appeared to have been "fixed" a few times now via various means (disable AV, build new citrix session host, etc), but never for this long.

    The other thing that bothers me is if this is a bug, it's a pretty major one. I can't be the first one seeing this problem. Maybe the phone system is doing something a bit strange with TCP, but it's Linux based so I wouldn't think it would be a problem.

    This would be much easier to troubleshoot if I could capture a whole days worth of packets...

  • In reply to jamesharper:

    James,

    I am having the exact same issue as you. Support could not find an answer yet. I had to move all routing to our core switches for it to work properly. I had the exact scenario where I had the firewall rules in place but sometimes the XG just didn't allow the connection. Unfortunately, this is the 3rd case I have open right now with no answers. Support is a joke. The enginner saw this happening live and was baffled. I am very curious as to what they say to you. Please keep us updated. My next step is to re-load the primary device via the ISO of firewall OS and restore a backup. Once restored, re-install the OS on the auxiliary unit and setup HA.

    Mike 

  • In reply to MichaelBolton:

    Hi Mike,

    Thanks for the confirmation that you are seeing similar problems. I wonder if it's possible to log a "class action" support case :)

    I notice you mention you are using HA - have you tried reproducing this problem with HA turned off? That's what I might be trying next, although I should probably let the client have a few days of trouble free networking before I do that.

    When faced with this sort of problem I normally turn to tcpdump, but to properly investigate I need to capture a whole days worth of traffic, or at least lots more than the 100000 packets that XG's tcpdump filedump lets me capture. I can capture on the PC, but the packets are getting dropped from the PBX to the PC, so the PC wouldn't see them and I wouldn't get the whole picture.

    I wonder if tcpdump from Debian would run on XG... even if I have to copy all the deps into a chroot environment to make it work :)

    thanks

    James

  • In reply to jamesharper:

    Hi James,

    I did try with a single unit for a day or so and that did not help in our case. I reloaded the OS on both units but I have not tested it again. I moved all of the routing back to our core switches as I fought it for a few weeks. I hope I get a chance in the next few days to put the gateway back on the XG. I can't believe the XG is failing at the most basic of firewall functions. My output is identical to yours. The connections show in the connection table and has a connection ID. In my case it is computers accessing a SQL database when the connections get dropped. Nothing strange about the packets at all.

    2017-05-11 09:44:40 0101021 IP 10.11.0.31.53689 > 10.10.0.18.1433 : proto TCP: P 1318068363:1318068473(110) win 256 checksum : 5546
    0x0000:<remainder of the packet redacted>
    Date=2017-05-11 Time=09:44:40 log_id=0101021 log_type=Firewall log_component=Firewall_Rule log_subtype=Denied log_status=N/A log_priority=Alert duration=N/A in_dev=Uplink_LAG.19 out_dev=Uplink_LAG inzone_id=1 outzone_id=1 source_mac=18 6:da:1f:38:9b dest_mac=00:e0:20:11:0a:66 l3_protocol=IP source_ip=10.11.0.31 dest_ip=10.10.0.18 l4_protocol=TCP source_port=53689 dest_port=1433 fw_rule_id=0 policytype=1 live_userid=0 userid=0 user_gp=0 ips_id=0 sslvpn_id=0 web_filt _id=0 hotspot_id=0 hotspotuser_id=0 hb_src=0 hb_dst=0 dnat_done=0 proxy_flags=0 icap_id=0 app_filter_id=0 app_category_id=0 app_id=0 category_id=0 bandwidth_id=0 up_classid=0 dn_classid=0 source_nat_id=0 cluster_node=0 inmark=0x0 nfque =100 scanflags=0 gateway_offset=0 max_session_bytes=0 drop_fix=0 ctflags=40 connid=180176096 masterid=0 status=398 state=3 sent_pkts=N/A recv_pkts=N/A sent_bytes=N/A recv_bytes=N/A tran_src_ip=N/A tran_src_port=N/A tran_dst_ip=N/A tran st_port=N/A

    (IP's changed)

     

    I agree with you on trying to capture packets. My next plan is to port mirror the interface on the switch for one of the computers affected and the database server. Then run Wireshark until I capture it. That may be next week though before I can. 

    Mike

  • In reply to MichaelBolton:

    James,

    I am having another issue with authentication that may or may not be associated. Just trying to get all of the details I can. Is your customer using STAS as well?

    Mike

  • In reply to MichaelBolton:

    Yes using STAS (and SATC), but not enforced - user information is logged if available but not required by any rules. I had considered SATC as an issue as it integrates into the system a bit deeply, but this is happening on PC's as well as Citrix.

    I do notice from your packet that you seem to be using VLAN on LAG. Are you using LACP too?

    I might try disabling a port so the LAG only uses one port. It's not the same as not using LAG at all but is easy enough to try - disabling the LAG would require an uncomfortable amount of reconfiguration on XG.

    One thing I finally managed to capture is another case that i've seen a bit where users get "Page could not be displayed". A tcpdump on the PC shows SYN packets being sent but nothing is logged on the router - it's like the packets never reach it.

    If that's happening mid connection too then that might explain the mid-stream packet drops we've both seen.

  • In reply to jamesharper:

    I have been running a tcpdump on one of the servers on this network, filtering by SYN packets, and looking for SYN retransmissions.

    There are the expected retransmissions here and there which is expected, but there are definite periods where there are an excessive number of retransmissions, both to internal hosts that I know are up, and to external hosts. This matches my experience in the past where suddenly pages stop loading, even though already established connections seem to be okay (and my connection via Citrix appears unaffected).

    What could be causing this? I don't have any IPS or DoS turned on on these networks, and there is nothing in any logs at this time.

    The Sophos XG never logs any evidence of these packets, so I don't know if they are reaching the XG or not. My hope would be to run a tcpdump side by side on the server and the XG and compare results, but tcpdump is crippled on XG so this isn't possible.

    I have enough now to open a case though, so I guess i'll do that next.

    James

  • In reply to jamesharper:

    Hi,

    Could you share the case details so we may check on our end ? 

  • In reply to jamesharper:

    I have the exact same problem and I've already been told to look for asymmetric routing problem but the XG is the only way between my two networks as well. I've had 3 other sets of eyes on this including 2 CCIE's and they all point back at the XG. 

    "Strange Drops" is a good way to put it.

  • In reply to Matthew Weiskopf:

    Matthew can you tell me a little about your setup? The other user who reported similar issues is using VLAN's on top of LAG interface. Are you doing this also? Are you using LACP? I have since tried disabling one of the LAG ports so it's now only on one port, but this isn't having any effect. If you are using LAG also then I might try removing the LAG setup to test if that is somehow affecting the problem.

    I have a case open with Sophos for this. I got great response from support initially but having tried all the easy things (check asymmetric routing, turn of micro app discovery, etc) I haven't heard anything in 2 days.

    It's a HA setup so one of the devices will be removed shortly and will have SG put back on it if I don't get anywhere with this. I'm really reluctant to do this as it means the problem will likely not get solved as I have no way to test it.

    James