17.0.8 MR8 is out... Hundreds of "VPN Down" logs in just an evening ... Imagine how many I received since 17.0.8 was installed.

For those who asked, our VPN setup is the result of the intervention of 2 seniors engineers at Sophos, Boston.  I normally need no help to setup a IKEv2 VPN. I was doing it with no sweat on any firewall more than a decade ago.

  • That looks really rough.

    So fair question for Sophos: who else out there uses strongSwan as the VPN engine in their products?  It doesn't seem like the idea to use this engine is paying off.

  • In reply to Bill Roland:

    One more thing, Sophos VPN client behaves the same.  While I am on my computer at home running that VPN client, I can start Teamviewer anytime while Sophos' VPN is down ...

  • In reply to Big_Buck:

    One more thing, clients will behave the same on both XG210

  • In reply to Bill Roland:

    Hi Bill,

    i only know both pfSense (free open source FW) & OPNsense (free open source FW) are using strongSwan VPN and they are doing great job with much more VPN capabilities & Features.....The problem is with Sophos XG.

  • Hi Big_Buck ,

    Thank you for the XG Feedback.

    Request to share the  details:

    • IPsec connection and Policy configuration.
    • /log/charon.log /log/strongswan.log

    Regards,

    Deepti

  • In reply to deeptibhavsar:

    Hello

    I'll send screens shots of both VPN setups and IPSEC policy for both main locations.

    Very basic preshared key, site-to-site IPSEC IKEv2.  One site initiate connection, while the other wait.  Unlimited key retry.

    Paul Jr

  • In reply to Bill Roland:

    The good old sophos UTM use strongSwan. So is vyatta/vyos/edgeos. I think most firewall is using strongSwan if they are not using something proprietary.

    The problem is sophos xg.

  • We are starting to see this same issue on a few of our sites.  Did Sophos come up with any solution?  It is only going down for a few minutes here and there but really shouldn't be going down at all.  I've verified the gateway is staying up and I can also ping the WAN interface on the XG.  When it is down, I am not able to get into the web interface of the unit which is interesting.  

  • In reply to BrianH:

    Brian,  are you saying you're seeing this after you applied MR8?

     

    -Scott

  • In reply to BrianH:

    Hello

    I opened a case since this VPN falls every 4 minutes for around one minute.

    I received these answer for Sophos' senior engineer responsible for VPNs:

    Answer One:

    Hi Paul

    I took a look at the log files and it seems there’s massive packet loss (at least for the IKE packets). For the time and tunnel that we looked at the mails for it seems the connection was faulty for some reason.

    The log on QC looks like:

    2018-05-14 11:37:04 06[ENC] <QC_to_AA-1|323> generating INFORMATIONAL request 0 [ ]

    2018-05-14 11:37:08 30[IKE] <QC_to_AA-1|323> retransmit 1 of request with message ID 0

    2018-05-14 11:37:15 22[IKE] <QC_to_AA-1|323> retransmit 2 of request with message ID 0

    2018-05-14 11:37:28 12[IKE] <QC_to_AA-1|323> retransmit 3 of request with message ID 0

    2018-05-14 11:37:52 12[IKE] <QC_to_AA-1|323> retransmit 4 of request with message ID 0

    2018-05-14 11:38:22 26[ENC] <QC_to_AA-1|323> parsed INFORMATIONAL request 16 [ ]

    2018-05-14 11:38:22 26[ENC] <QC_to_AA-1|323> generating INFORMATIONAL response 16 [ ]

    2018-05-14 11:38:34 20[IKE] <QC_to_AA-1|323> retransmit 5 of request with message ID 0

    2018-05-14 11:38:34 06[ENC] <QC_to_AA-1|323> parsed INFORMATIONAL response 0 [ ]

    The log on AA looks like:

    2018-05-14 11:37:04 09[ENC] <AA_to_QC-1|325> parsed INFORMATIONAL request 0 [ ]

    2018-05-14 11:37:04 09[ENC] <AA_to_QC-1|325> generating INFORMATIONAL response 0 [ ]

    2018-05-14 11:37:04 11[DMN] [GARNER-LOGGING] (child_alert) ALERT: received IKE message with invalid SPI (ADC08921) from other side

    2018-05-14 11:37:08 22[ENC] <AA_to_QC-1|325> parsed INFORMATIONAL request 0 [ ]

    2018-05-14 11:37:08 22[IKE] <AA_to_QC-1|325> received retransmit of request with ID 0, retransmitting response

    2018-05-14 11:37:08 25[DMN] [GARNER-LOGGING] (child_alert) ALERT: received IKE message with invalid SPI (ADC08921) from other side

    2018-05-14 11:37:15 10[ENC] <AA_to_QC-1|325> parsed INFORMATIONAL request 0 [ ]

    2018-05-14 11:37:15 10[IKE] <AA_to_QC-1|325> received retransmit of request with ID 0, retransmitting response

    2018-05-14 11:37:15 20[DMN] [GARNER-LOGGING] (child_alert) ALERT: received IKE message with invalid SPI (ADC08921) from other side

    2018-05-14 11:37:28 19[ENC] <AA_to_QC-1|325> parsed INFORMATIONAL request 0 [ ]

    2018-05-14 11:37:28 19[IKE] <AA_to_QC-1|325> received retransmit of request with ID 0, retransmitting response

    2018-05-14 11:37:28 27[DMN] [GARNER-LOGGING] (child_alert) ALERT: received IKE message with invalid SPI (ADC08921) from other side

    2018-05-14 11:37:52 12[ENC] <AA_to_QC-1|325> parsed INFORMATIONAL request 0 [ ]

    2018-05-14 11:37:52 12[IKE] <AA_to_QC-1|325> received retransmit of request with ID 0, retransmitting response

    2018-05-14 11:37:52 05[DMN] [GARNER-LOGGING] (child_alert) ALERT: received IKE message with invalid SPI (ADC08921) from other side

    2018-05-14 11:38:22 05[IKE] <AA_to_QC-1|325> sending DPD request

    2018-05-14 11:38:22 05[ENC] <AA_to_QC-1|325> generating INFORMATIONAL request 16 [ ]

    2018-05-14 11:38:22 28[ENC] <AA_to_QC-1|325> parsed INFORMATIONAL response 16 [ ]

    2018-05-14 11:38:34 24[ENC] <AA_to_QC-1|325> parsed INFORMATIONAL request 0 [ ]

    2018-05-14 11:38:34 24[IKE] <AA_to_QC-1|325> received retransmit of request with ID 0, retransmitting response

    Note the child_alert messages on the AA side. I think that’s where the issues come from. CB seems to never receive the responses before AA sends its own DPD message (here 16). After 16 was processed all of a sudden the response to message 0 comes through as well. I suppose in the case where a message like 16 doesn’t come in time before the 5th retransmit, QB would have detected the peer as dead and shut down the connection just to re-establish it right away again.

    I will discuss this with the team tomorrow. Currently I’m not sure who’s to blame here. I’ll enable a higher debug level for the daemon, maybe that will reveal more info.

    I’ll keep you updated on the proceedings in this case.

    Answer Two:

    Hi Paul

    I think something in your network is not working right. In the tcpdump from AA I see that the response to the DPD packet destined to QB is actually received by AA instead of QB (that’s also where the invalid SPIs come from):

    2018-05-14 14:13:05 03[NET] received packet: from 207.134.161.10[500] to 205.237.70.202[500]

    2018-05-14 14:13:05 03[NET] waiting for data on sockets

    2018-05-14 14:13:05 20[NET] <AA_to_QC-1|334> received packet: from 207.134.161.10[500] to 205.237.70.202[500] (96 bytes)

    2018-05-14 14:13:05 20[ENC] <AA_to_QC-1|334> parsed INFORMATIONAL request 0 [ ]

    2018-05-14 14:13:05 20[IKE] <AA_to_QC-1|334> received retransmit of request with ID 0, retransmitting response

    2018-05-14 14:13:05 20[NET] <AA_to_QC-1|334> sending packet: from 205.237.70.202[500] to 207.134.161.10[500] (96 bytes)

    2018-05-14 14:13:05 04[NET] sending packet: from 205.237.70.202[500] to 207.134.161.10[500]

    2018-05-14 14:13:05 03[NET] received packet: from 205.237.70.202[500] to 205.237.70.202[500]

    2018-05-14 14:13:05 03[NET] waiting for data on sockets

    2018-05-14 14:13:05 24[DMN] [GARNER-LOGGING] (child_alert) ALERT: received IKE message with invalid SPI (375D9B4E) from other side

    2018-05-14 14:13:18 31[APP] [COP-UPDOWN][STATUS] (db_status_update) conn_name: AA_to_QC count: 15

    14:13:28.740389 Port2, IN: IP 207.134.161.10.500 > 205.237.70.202.500: isakmp: parent_sa inf2

            0x0000:  4500 007c 598a 4000 3711 649e cf86 a10a

            0x0010:  cded 46ca 01f4 01f4 0068 8943 4e9b 5d37

            0x0020:  8c16 11a1 f2c7 99d4 ebf9 cea6 2e20 2500

            0x0030:  0000 0000 0000 0060 0000 0044 1792 d7e8

            0x0040:  416a 0e04 98e3 d72a deff 06c3 b4a1 10ff

            0x0050:  a53e 6010 c634 65cc e48a c74b 66dd cf88

            0x0060:  ac55 3cff e761 5673 fa62 2e67 8a4b 2632

            0x0070:  dfc0 3b59 3de4 f345 ce8c 7ef2

    14:13:28.740614 Port2, OUT: IP 205.237.70.202.500 > 207.134.161.10.500: isakmp: parent_sa inf2

            0x0000:  4500 007c 8945 4000 4011 2be3 cded 46ca

            0x0010:  cf86 a10a 01f4 01f4 0068 d616 4e9b 5d37

            0x0020:  8c16 11a1 f2c7 99d4 ebf9 cea6 2e20 2528

            0x0030:  0000 0000 0000 0060 0000 0044 78b4 8d4b

            0x0040:  a0d0 dbdf 3450 66c8 398f 8c12 118d 4875

            0x0050:  cc8c d769 a5f1 1b1e 4a7b 13f9 3373 5570

            0x0060:  e30a 8e4d 6161 f525 19cb 1ce9 00af 717f

            0x0070:  7fb6 e228 9b93 8c77 6afb cbae

    14:13:28.740880 Port2, IN: IP 205.237.70.202.500 > 205.237.70.202.500: isakmp: parent_sa inf2

            0x0000:  4500 007c 8945 4000 3f11 88bc cded 46ca

            0x0010:  cded 46ca 01f4 01f4 0068 31f0 4e9b 5d37

            0x0020:  8c16 11a1 f2c7 99d4 ebf9 cea6 2e20 2528

            0x0030:  0000 0000 0000 0060 0000 0044 78b4 8d4b

            0x0040:  a0d0 dbdf 3450 66c8 398f 8c12 118d 4875

            0x0050:  cc8c d769 a5f1 1b1e 4a7b 13f9 3373 5570

            0x0060:  e30a 8e4d 6161 f525 19cb 1ce9 00af 717f

            0x0070:  7fb6 e228 9b93 8c77 6afb cbae

    I believe once this issue is fixed your IPsec connections will be stable. One thing that could help you get to the root of this is the fact that the DPD response is actually delivered to QB correctly after AA itself sent a DPD request (16 from the mail below). I’m not an expert in routing or NAT, but would assume that some misplaced rule could be causing this behavior. Please contact support for further assistance.

    Here, I think the engineer meant QC instead of QB.

    Conclusion massive packet lost probably because of a misplaced rule or routing ... So ... Nat or Networking is screwed up ...  I do not have a single route installed, so. Only possible route are those done automatically by the firewall itself.

    A had VPN down very often before, but not to that stratospheric level.

    Go figure.  Paul Jr

  • In reply to Big_Buck:

    I don't buy a routing issue because I don't have any setup either.  I also tested connecting to the same site from three other sites and none of them worked.  I can ping the external interface of the XG as well as the gateway without issue.  However I cannot get into the XG using the admin interface.  I believe something is freezing up the XG.  I'll be at that site later this week and will attempt to confirm that.

  • In reply to BrianH:

    Brian, do you also see log lines in /log/charon.log where the source and destination address are the same, like:

    received packet: from 205.237.70.202[500] to 205.237.70.202[500]

  • In reply to HeikoHund:

    Hello Heiko

    I contacted support.  They are asking me to run with an IKEv2 setp with DH 31 and 18 removed and with SHA1 activated.

    I am no encryption expert, but weakness in SHA1 have been demonstrated more than a decade ago, and Microsoft themselves have been refusing SHA1 certificates for almost two years now.  Same with anyone else in the industry. CERT-US have urged anyone to stop using it for years.

    Even my dog knows SHA1 is dead !!!

    What is going here ?  Why are they asking something that outdated and unsecured ?

    Paul Jr

  • In reply to Big_Buck:

    Ok ...  Sophos support asked me to test this IKEv2 policy.

    Exact same behaviour.  VPN falls every 4 minutes ...  Besides, this policy is absolutely not recommended.  SHA1 is dead and buried.  Key life is now 15 hours (54000 secondes), DH groups has 31, and 18 disabled.  Which in the end means quite an insecure and outdated policy.

    But the thing is: reducing encryption load do not change the VPN falling every 4 minutes.  The theory that XG and its own routing is guilty is even more probable.

    I am now living with VPNs that falls every so often for more than a year now.

    Paul Jr 

  • In reply to Big_Buck:

    For those who ask, and as a follow up, nope !!!  this case still unsolved.  Apparently, we are not alone afflicted by this.