Important note about SSL VPN compatibility for 20.0 MR1 with EoL SFOS versions and UTM9 OS. Learn more in the release notes.

This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Unable to ping certain IP address intermittently

We have had reports of a certain IP address being available most of the time but occasionally it will become unavailable and the user is unable to ping it at this time. We have an XG 135 running SFOS 18.5.1 MR-1-Build326.

I don't see any blockages in the firewall logs for the timeframe in which the problem last happened. I would like to try and rule out the XG 135 and would appreciate any suggestions for debugging. It is difficult as it is intermittent.



This thread was automatically locked due to age.
Parents
  • is this traffic logged when it works? Do you have a custom rule at the bottom that blocks and logs everything? That is not there from scratch and you may not see every firewall block.

    you could analyze and see if these issues you're reporting overlap with IPS updates. That's known to cause network disconnects on the small appliances as the SNORT services restart. I'd say, this is most likely your issue.

    read here:

    https://community.sophos.com/sophos-xg-firewall/f/discussions/128637/dropped-connections-during-pattern-updates

  • Thanks for your reply.

    is this traffic logged when it works? Do you have a custom rule at the bottom that blocks and logs everything? That is not there from scratch and you may not see every firewall block.

    Yes, it is logged when it works but we did not have a rule blocking and logging everything. I have now added the following:

    I did manage to reproduce the problem myself just now after adding the rule. I tried to SSH into the affected IP address and it timed out. I checked the logs but there was nothing blocked. Then I tried again and I could SSH successfully. It is as if the first SSH attempt wakes something up.

    you could analyze and see if these issues you're reporting overlap with IPS updates. That's known to cause network disconnects on the small appliances as the SNORT services restart. I'd say, this is most likely your issue.

    This sounds familiar. We have had this issue before (almost 1 year ago) and were advised to disable firewall acceleration (https://community.sophos.com/sophos-xg-firewall/f/discussions/127621/intermittent-vpn-issues). We have not had any further reports of the issue until now but cannot be sure if it was ever actually resolved by this change, I suspect not. We have also had a firmware update since then and I understood that a fix was going to be in that update. However, I can confirm that firewall acceleration is still disabled at our end.

    I will look into the analysis again but this will be quite difficult as I can't control when it happens...

  • Further to this, I was able to reproduce it again and performed a packet capture. The corresponding firewall rule appears to be the built in "drop all" rule which is not logged.

    I see the items below repeated over and over when in the normal state when accessing over SSH. I note the different port number between the failure (11757) and success (1508) logs.

    I ruled out the "automatic pattern update" issue as I increased the frequency from 2 hours to daily before the above. I have restored this to 2 hours again.

  • That's interesting, that it seems your first connections goes into violation and second is working. Some king of DPI thing - is that traffic going into TLS inspection? Have you already checked the other logs? Sometimes I focus on firewall log and miss the events that are shown in IPS or TLS log section.

  • Thanks, good point about the other logs - I also tend to focus on the firewall. However, I have been through them all including IPS and TLS and don't see any corresponding entries.

  • do you use heartbeat, Intercept X on the endpoints? Firewall Violation can also be caused by missing heartbeat and bad helath status on a device. Our endpoints get blocked because of missing HB caused by endpoint updates multiple times a week after the endpoints updated some components.

    To rule out firewall completely, you will need to create a firewall rule on top for a single host that is known to have this issue and allow this traffic without any Security features enabled.

  • No, we don't use any endpoint features. 

    We don't currently have any of the listed security features enabled for our existing VPN rules. Is this what you meant?

  • I haven't changed any firewall rules but I think I have proved that it is definitely something in the UTM that is blocking the traffic. This morning I reproduced the issue and was unable to SSH into the affected server for a few minutes, during that time I could successfully SSH from a machine on the internal network (i.e. bypassing the UTM).

  • Yes, I meant, what you shared with the screenshot above.

    I understand from your posts, this is VPN access over a WAN connection which is sometimes going into timeout, true?

    Can you rule out any connection issues or high latency?

    We have sites connected by site-2-site VPN that have poor WAN connections and while the tunnel is up and fine, we're having severe timeouts over the whole day to these sites.

    Maybe some of the SA's are temporarily down.

    Please describe your environment and VPN topology.

  • Yes, this is a VPN connection over WAN and any attempt to communicate with a specific server during a short window of failure results in a timeout. I am testing with SSH but the affected user is seeing the issue with a proprietary application and ping. Yesterday the failure period seemed to be happening roughly every hour.

    I think connection/latency issues can be ruled out because I can access other servers during the temporary block of the server that I'm debugging. As I said, I can also access the affected server from the internal network (via remote desktop) so it is VPN specific.

    We use both SSL and IPsec VPN and have equivalent firewall rules setup for each set to accept from VPN to LAN with the additional settings in my screenshots above. We already had a rule to drop VPN to WAN and since yesterday have had one to drop VPN to any zone. Neither are being activated or logged. I have reproduced the problem with IPsec whilst the affected user is using SSL.

    Here is a diagram of the topology.

  • thanks, I can understand that better now.

    we need to catch this:

    your screenshot

    my screenshot:

    this traffic should hit your rule 11. instead it only hit your rule default 0 so it did not match all requirements of rule 11.  I suggest you no not only limit this rule to VPN, make it a drop and log for all zones selected (not any!) rule.

    did you try to catch it on CLI with

    drppkt host 10.10.10.2

    single host

    drppkt net 10.10.10

    this should be your SSL or IPSEC VPN nat range

    already?

    community.sophos.com/.../sophos-xg-cli-troubleshooting-tools

Reply Children
  • Thanks for the suggestion about drppkt as I had not come across that tool before. I ran it during the issue and saw several of these messages:

    2022-03-23 12:08:52 0101021 IP *.*.*.*.8464 > *.*.*.*.22 : proto TCP: S 3092190331:3092190331(0) win 65280 checksum : 28126
    0x0000: 4500 0034 de51 4000 7f06 08ce 0a02 0002 E..4.Q@.........
    0x0010: 0a00 00a1 2110 0016 b84f 147b 0000 0000 ....!....O.{....
    0x0020: 8002 ff00 6dde 0000 0204 0550 0103 0308 ....m......P....
    0x0030: 0101 0402 ....
    Date=2022-03-23 Time=12:08:52 log_id=0101021 log_type=Firewall log_component=Firewall_Rule log_subtype=Denied log_status=N/A log_priority=Alert duration=N/A in_dev=ipsec0 out_dev=br0 inzone_id=5 outzone_id=0 source_mac=**:**:**:**:**:** dest_mac=**:**:**:**:**:** bridge_name= l3_protocol=IPv4 source_ip=*.*.*.* dest_ip=*.*.*.* l4_protocol=TCP source_port=8464 dest_port=22 fw_rule_id=0 policytype=0 live_userid=690 userid=6 user_gp=1 ips_id=0 sslvpn_id=0 web_filter_id=0 hotspot_id=0 hotspotuser_id=0 hb_src=0 hb_dst=0 dnat_done=0 icap_id=0 app_filter_id=0 app_category_id=0 app_id=0 category_id=0 bandwidth_id=0 up_classid=0 dn_classid=0 nat_id=0 cluster_node=0 inmark=0x200 nfqueue=0 gateway_offset=0 connid=115465088 masterid=0 status=256 state=1, flag0=618477387776 flags1=34359738368 pbdid_dir0=0 pbrid_dir1=0

    I then went on to add the Drop All rule that you suggested and was then able to detect the blockage:

    I see that the source port was 8464 in the drppkt log but 33996 in the firewall log.

    I've had to disable this new drop all log for now as it resulted in everything being blocked!

  • the log from drppkt and from logviewer are not the same. The different source port indicate a new SSH connection from the remote client.

    There is something else wrong in your setup if you create a custom block rule at the very bottom of your ruleset but on top of the default drop rule and this results in all traffic in your network beeing blocked. that doesn't make sense to me.

  • This was my mistake, I misunderstood what I was doing and the rule wasn't quite at the bottom. I waited to try again at a less busy time and now have the rule at the bottom.

    Now that I've done this I'm not seeing my timed out SSH connection appear in the firewall log, it is still only in drppkt as below:

    2022-03-24 06:24:04 0101021 IP *.*.*.*.1698 > *.*.*.*.22 : proto TCP: S 2032111559:2032111559(0) win 65280 checksum : 19504
    0x0000: 4500 0034 2bf7 4000 7f06 bb28 0a02 0002 E..4+.@....(....
    0x0010: 0a00 00a1 06a2 0016 791f 8fc7 0000 0000 ........y.......
    0x0020: 8002 ff00 4c30 0000 0204 0550 0103 0308 ....L0.....P....
    0x0030: 0101 0402 ....
    Date=2022-03-24 Time=06:24:04 log_id=0101021 log_type=Firewall log_component=Firewall_Rule log_subtype=Denied log_status=N/A log_priority=Alert duration=N/A in_dev=ipsec0 out_dev=br0 inzone_id=5 outzone_id=0 source_mac=**:**:**:**:**:** dest_mac=**:**:**:**:**:** bridge_name= l3_protocol=IPv4 source_ip=*.*.*.* dest_ip=*.*.*.* l4_protocol=TCP source_port=1698 dest_port=22 fw_rule_id=0 policytype=0 live_userid=688 userid=6 user_gp=1 ips_id=0 sslvpn_id=0 web_filter_id=0 hotspot_id=0 hotspotuser_id=0 hb_src=0 hb_dst=0 dnat_done=0 icap_id=0 app_filter_id=0 app_category_id=0 app_id=0 category_id=0 bandwidth_id=0 up_classid=0 dn_classid=0 nat_id=0 cluster_node=0 inmark=0x200 nfqueue=0 gateway_offset=0 connid=1985830912 masterid=0 status=256 state=1, flag0=618477387776 flags1=34359738368 pbdid_dir0=0 pbrid_dir1=0

    The inzone_id=5 corresponds to the VPN zone but I'm not sure what outzone_id=0 means as it is undefined when I ran psql -U nobody -d corporate -c "select * from tblnetworkzone"

  • looks like your first intention is correct, that at some point, the XG somehow needs to re-initialize the connecttion by a new connection. Eventually this is related to the setup as bridge.

    Can you check from the following allowed packets in your scenario, what the IN and OUT interfaces and zones are? Are they correct and match to the zones in the drppkt log?

    I assume that the first packets have different zones than the traffic of the following packets in that TCP communication and thats why the rules do not match. Something for tech support probably.

  • The IN and OUT interfaces match between drppkt and the allowed item in the firewall log. The source VPN zone matches but the destination is this unidentified 0 zone in drppkt but LAN in the allowed log.

    Thanks for your help with this. I think you are right, I will open a ticket.

  • Just to close this thread, I did open a ticket and they confirmed that the behaviour was unexpected. They advised updating to SFOS 18.5.3 MR-3-Build408 which I did and since then I have not seen this issue again.

  • thank you for sharing that information here. I hope, it helps others.

    We had a different bridge issue with MR1 and that got fixed in MR2.

    are the zones and interfaces now displayed correctly in the details of that FW log?

  • Certainly for the valid cases the firewall log is correct, and I only saw the discrepancy before when I was unable to reach some of the servers. I have not reproduced this yet since updating.