Important note about SSL VPN compatibility for 20.0 MR1 with EoL SFOS versions and UTM9 OS. Learn more in the release notes.

This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Unable to ping certain IP address intermittently

We have had reports of a certain IP address being available most of the time but occasionally it will become unavailable and the user is unable to ping it at this time. We have an XG 135 running SFOS 18.5.1 MR-1-Build326.

I don't see any blockages in the firewall logs for the timeframe in which the problem last happened. I would like to try and rule out the XG 135 and would appreciate any suggestions for debugging. It is difficult as it is intermittent.



This thread was automatically locked due to age.
Parents Reply Children
  • Thanks for your reply.

    is this traffic logged when it works? Do you have a custom rule at the bottom that blocks and logs everything? That is not there from scratch and you may not see every firewall block.

    Yes, it is logged when it works but we did not have a rule blocking and logging everything. I have now added the following:

    I did manage to reproduce the problem myself just now after adding the rule. I tried to SSH into the affected IP address and it timed out. I checked the logs but there was nothing blocked. Then I tried again and I could SSH successfully. It is as if the first SSH attempt wakes something up.

    you could analyze and see if these issues you're reporting overlap with IPS updates. That's known to cause network disconnects on the small appliances as the SNORT services restart. I'd say, this is most likely your issue.

    This sounds familiar. We have had this issue before (almost 1 year ago) and were advised to disable firewall acceleration (https://community.sophos.com/sophos-xg-firewall/f/discussions/127621/intermittent-vpn-issues). We have not had any further reports of the issue until now but cannot be sure if it was ever actually resolved by this change, I suspect not. We have also had a firmware update since then and I understood that a fix was going to be in that update. However, I can confirm that firewall acceleration is still disabled at our end.

    I will look into the analysis again but this will be quite difficult as I can't control when it happens...

  • Further to this, I was able to reproduce it again and performed a packet capture. The corresponding firewall rule appears to be the built in "drop all" rule which is not logged.

    I see the items below repeated over and over when in the normal state when accessing over SSH. I note the different port number between the failure (11757) and success (1508) logs.

    I ruled out the "automatic pattern update" issue as I increased the frequency from 2 hours to daily before the above. I have restored this to 2 hours again.

  • That's interesting, that it seems your first connections goes into violation and second is working. Some king of DPI thing - is that traffic going into TLS inspection? Have you already checked the other logs? Sometimes I focus on firewall log and miss the events that are shown in IPS or TLS log section.

  • Thanks, good point about the other logs - I also tend to focus on the firewall. However, I have been through them all including IPS and TLS and don't see any corresponding entries.

  • do you use heartbeat, Intercept X on the endpoints? Firewall Violation can also be caused by missing heartbeat and bad helath status on a device. Our endpoints get blocked because of missing HB caused by endpoint updates multiple times a week after the endpoints updated some components.

    To rule out firewall completely, you will need to create a firewall rule on top for a single host that is known to have this issue and allow this traffic without any Security features enabled.

  • No, we don't use any endpoint features. 

    We don't currently have any of the listed security features enabled for our existing VPN rules. Is this what you meant?

  • I haven't changed any firewall rules but I think I have proved that it is definitely something in the UTM that is blocking the traffic. This morning I reproduced the issue and was unable to SSH into the affected server for a few minutes, during that time I could successfully SSH from a machine on the internal network (i.e. bypassing the UTM).

  • Yes, I meant, what you shared with the screenshot above.

    I understand from your posts, this is VPN access over a WAN connection which is sometimes going into timeout, true?

    Can you rule out any connection issues or high latency?

    We have sites connected by site-2-site VPN that have poor WAN connections and while the tunnel is up and fine, we're having severe timeouts over the whole day to these sites.

    Maybe some of the SA's are temporarily down.

    Please describe your environment and VPN topology.

  • Yes, this is a VPN connection over WAN and any attempt to communicate with a specific server during a short window of failure results in a timeout. I am testing with SSH but the affected user is seeing the issue with a proprietary application and ping. Yesterday the failure period seemed to be happening roughly every hour.

    I think connection/latency issues can be ruled out because I can access other servers during the temporary block of the server that I'm debugging. As I said, I can also access the affected server from the internal network (via remote desktop) so it is VPN specific.

    We use both SSL and IPsec VPN and have equivalent firewall rules setup for each set to accept from VPN to LAN with the additional settings in my screenshots above. We already had a rule to drop VPN to WAN and since yesterday have had one to drop VPN to any zone. Neither are being activated or logged. I have reproduced the problem with IPsec whilst the affected user is using SSL.

    Here is a diagram of the topology.

  • thanks, I can understand that better now.

    we need to catch this:

    your screenshot

    my screenshot:

    this traffic should hit your rule 11. instead it only hit your rule default 0 so it did not match all requirements of rule 11.  I suggest you no not only limit this rule to VPN, make it a drop and log for all zones selected (not any!) rule.

    did you try to catch it on CLI with

    drppkt host 10.10.10.2

    single host

    drppkt net 10.10.10

    this should be your SSL or IPSEC VPN nat range

    already?

    community.sophos.com/.../sophos-xg-cli-troubleshooting-tools