I am using XG v18 and getting intemittent connection drops - how to find the root cause.

I have a simple home setup using V18. (WAN to LAN with IPS and web-filtering) Seems to work fine, except few times a day I loose connection to internet for 20-30 seconds (about 10 consecutive ping drops if I leave a ping running). I know my internet on WAN is not dropping as I have a device on wan which do not drop pings or loose connection. 

Need some help in trouble shooting. I suspect may be IPS or blocking kicking into block all of my outgoing traffic for few seconds. What logs to look for. need some help please. Do we have a troubleshooting guide for this type of drops. Do we have a troubleshooting guide?

Thank you,

- Sam

  • It's quite a coincidence that you post this now as I have been planning to post exactly the same issue.

    This has been an issue for us since the pre-release version of v18, in home edition, Hyper-V VM and our XG 230. At first I let it go as early teething issues with new software and hoped a fix would come as the product matured a little but the issue is still there.

    I too would like suggestions at where is the best place to start looking at this problem.

    it would also be interesting to hear if anybody else is having the same problem.

  • Hello Samy,

    Thank you for contacting the Sophos Community.

    Please connect to the XG following this KB (https://community.sophos.com/kb/en-us/133678)

    Once in there press number 4 to land in the console and run the following command:

    console > drop-packet-capture 'host X.X.X.X and host 8.8.8.8' (Modify the X.X.X.X to be the Private IP of the computer where you are running the Ping)

    If the XG is dropping  the traffic you will see something there.

    You can also check at the time the issue happens the fwlog.log

    In a new Putty session/window now go 5>3 then type cd /log and press enter

    then you can type less # less fwlog.log (ctrl + g takes you to the last line) and check the time when the issue happens

    Additionally, to this, I would also recommend you to leave or run a conntrack when the issue is happening

    #conntrack -E -s X.X.X.X 

    Check for unreplied packets.

    And finally, check the IPS.log for anything the XG might be dropping at that time, and also when the issue is happening confirm if the XG can ping 8.8.8.8

    Regards,

  • In reply to emmosophos:

    One of the areas that is particularly noticeable for us is DNS resolution failures. We use Cloudflare DNS servers, 1.1.1.1 and 1.0.0.1

    The drop-packet-capture started showing results fairly quickly, I had 45 drops in 30 minutes:

    drop-packet-capture 'dst host 1.1.1.1'

    2020-06-26 00:27:44 0110021 IP 192.168.1.101.53621 > 1.1.1.1.53 : proto UDP: packet len: 54 checksum : 25859
    0x0000:  4500 004a da27 0000 7e11 a16b c0a8 fe65  E..J.'..~..k...e
    0x0010:  0101 0101 d175 0035 0036 6503 01eb 0100  .....u.5.6e.....
    0x0020:  0001 0000 0000 0001 0377 7777 0667 6f6f  .........www.goo
    0x0030:  676c 6503 636f 6d02 6567 0000 0100 0100  gle.com.eg......
    0x0040:  0029 0fa0 0000 0000 0000                 .)........
    Date=2020-06-26 Time=00:27:44 log_id=0110021 log_type=Firewall log_component=Identity log_subtype=Denied log_status=N/A log_priority=Alert duration=N/A in_dev=Port1 out_dev=Port2 inzone_id=1 outzone_id=2 source_mac=00:88:8b:85:22:f7 dest_mac=7c:5a:55:4d:22:40 bridge_name= l3_protocol=IPv4 source_ip=192.168.1.101 dest_ip=1.1.1.1 l4_protocol=UDP source_port=53621 dest_port=53 fw_rule_id=25 policytype=1 live_userid=0 userid=65535 user_gp=0 ips_id=0 sslvpn_id=0 web_filter_id=16 hotspot_id=0 hotspotuser_id=0 hb_src=0 hb_dst=0 dnat_done=0 icap_id=0 app_filter_id=0 app_category_id=0 app_id=0 category_id=0 bandwidth_id=0 up_classid=0 dn_classid=0 nat_id=0 cluster_node=0 inmark=0x0 nfqueue=0 gateway_offset=0 connid=2612661056 masterid=0 status=256 state=0, flag0=36031545800130560 flags1=8796629893120 pbdid_dir0=0 pbrid_dir1=0

    I don't know if there is something wrong with my fwlog.log but it seems to contain next to nothing, why is this?:

    XG230_WP02_SFOS 18.0.1 MR-1-Build396# tail /log/fwlog.log
    NOTICE: Netlink socket buffer size has been set to 8388608 bytes.
    NOTICE: Netlink socket buffer size has been set to 8388608 bytes.
    NOTICE: Netlink socket buffer size has been set to 8388608 bytes.
    NOTICE: Netlink socket buffer size has been set to 8388608 bytes.
    NOTICE: Netlink socket buffer size has been set to 8388608 bytes.
    NOTICE: Netlink socket buffer size has been set to 8388608 bytes.
    NOTICE: Netlink socket buffer size has been set to 8388608 bytes.
    NOTICE: Netlink socket buffer size has been set to 8388608 bytes.
    NOTICE: Netlink socket buffer size has been set to 8388608 bytes.
    NOTICE: Netlink socket buffer size has been set to 8388608 bytes.
    XG230_WP02_SFOS 18.0.1 MR-1-Build396# 

    I did check the GUI version version of the firewall log and it showed nothing blocked for a destination of 1.1.1.1

    I ran 'conntrack -d 1.1.1.1' but there were so many entries I couldn't find anything useful on my first attempt. I may have another go at this tomorrow or setup a more specific test than DNS lookups which will produce fewer results.

    I did look at ips.log but struggled again with the number of entries and couldn't get grep to work. I did look at the GUI version and that didn't show anything for a destination of 1.1.1.1.

    Is there any way to download the log files? Finding info in the console can be a bit of a pain if it's not something you are used to doing.

  • In reply to JasP:

    Hi,

    for your home XG please read the following thread, especially the last entry.

    https://community.sophos.com/products/xg-firewall/f/hardware/121434/interface-issues#pi2151=2

     

    Ian

  • In reply to rfcat_vk:

    Do you use STAS? 

    Seems like your XG is dropping because of Identity probing. 

    Try to adjust the values here:

     

    Test "No". 

    Test a smaller number (Maybe 10). 

  • In reply to LuCar Toni:

    Yes we use STAS

    I've set 'Restrict client traffic during identity probe' to 'No', I haven't changed the timeout. For our environment, identifying the user is not critical, I'm far more interested in stopping these drops.

    For my own learning, how did you identify this as a potential issue from the information I supplied?

    Also, is there any way to download the logs from an XG rather than just view them in the console?

  • In reply to JasP:

    In STAS / XG, there is something called Quarantine for Unauthenticated Users. It means, if a client communicate and a User based Rule exists, XG checks, if this IP is authenticated. If its not, it put this IP into a learning phase and waits to get the Live User online from STAS. This Quarantine last for 1-120 seconds and you can configure the dropping or only learning (Yes / no). 

    As XG has a cleanup mechanism, sometimes the client gets kicked out and STAS is not able to quickly recover this IP. Hence it will start to drop this client for 120 sec until this client reauth via STAS. 

     

    See: https://community.sophos.com/kb/en-us/123156#2-Drop%20timeout%20in%20Learning%20Mode // https://community.sophos.com/kb/en-us/125217

  • In reply to LuCar Toni:

    @LuCar Toni

    LuCar Toni
    It means, if a client communicate and a User based Rule exists, XG checks, if this IP is authenticated. If its not, it put this IP into a learning phase and waits to get the Live User online from STAS.


    The good news is that changing 'Restrict client traffic during identity probe' to 'No' seems to have fixed our issues. I haven't tried altering the timeout to a shorter period.

    What I don't understand, is that the example I used (DNS lookups), isn't a 'User based Rule' so why is STAS involved at all

  • In reply to JasP:

    XG is using some sort of learning phase for every traffic, unrelated to the authentication rule or network rule. 

    As a Client(IP) is in this learning phase, XG cannot know, if this is a user or a simple IP. Therefore it cannot verify, if it should use a network or user based rule.

  • In reply to emmosophos:

    Hello emmosophos,

     

    Thank you for the reply. I still keep getting the drops, how ever it is difficult to catch the drops using 'drop-packet-capture' console command as the ssh time-outs much quickly. Still I will try to catch a drop, by keeping an eye. (This happens few times a day, and at that time, ssh has timed out and don't have any output so far). 

     

    STAT was enabled, but not used as far as I understand in my config (I do not authenticate users). I am going to disable STAT and see if it still happens. 

     

    Is there any setting to disable ssh timing out for trouble-shooting purposes?

     

    - Sam 

  • In reply to Samy Wee:

    XG uses a SSH IDLE Timeout.

    To prevent this, use a SSH client, which can handle keep a live sessions. 

  • In reply to LuCar Toni:

    Toni, thank you for the tip;  putty is capable of setting a keep-alive on it, under Connection settings. now I will try to capture some drops. Thank you.