Sophos XG (Case Study) How to Troubleshoot IPsec VPN with multiples SA's when one SA goes down while the other remains active

Disclaimer: This information is posted as-is and the content should be referenced at your own risk

Hello There,

This case study below describes a scenario that has IPsec VPN tunnel site to site VPN between XG (On-Premises) and Azure, with 2 SA’s established, and suddenly one of them get disconnected, and investigation steps that as taken to took to find the root cause of the disconnections events.

 


XG IP is 285.54.258.149 (Response only)
Azure IP: 51.136.198.59 (initiator)

 

 First step - check if the SA’s status has changed, this can be seen looking at the IPsec logs Charon.log (In debug)* by doing the following command:

 NOTE: Enable Debug mode to IPsec Logs : "service strongswan:debug -ds nosync" to disable the  debug you most run the same command again

1- Filter the logs with the following syntax “cat Charon.log | grep conn_name

2019-10-23 20:19:21 23[APP] [COP-UPDOWN][STATUS] (db_status_update) conn_name: azuregw count: 2
2019-10-23 20:22:21 07[APP] [COP-UPDOWN][STATUS] (db_status_update) conn_name: azuregw count: 2
2019-10-23 20:25:21 19[APP] [COP-UPDOWN][STATUS] (db_status_update) conn_name: azuregw count: 2
2019-10-23 20:28:21 25[APP] [COP-UPDOWN][STATUS] (db_status_update) conn_name: azuregw count: 2
2019-10-23 20:31:21 10[APP] [COP-UPDOWN][STATUS] (db_status_update) conn_name: azuregw count: 2
2019-10-23 20:34:21 32[APP] [COP-UPDOWN][STATUS] (db_status_update) conn_name: azuregw count: 2
2019-10-23 20:37:21 27[APP] [COP-UPDOWN][STATUS] (db_status_update) conn_name: azuregw count: 2
2019-10-23 20:40:21 28[APP] [COP-UPDOWN][STATUS] (db_status_update) conn_name: azuregw count: 2
2019-10-23 20:43:21 20[APP] [COP-UPDOWN][STATUS] (db_status_update) conn_name: azuregw count: 1
2019-10-23 20:46:21 30[APP] [COP-UPDOWN][STATUS] (db_status_update) conn_name: azuregw count: 1
2019-10-23 20:49:21 14[APP] [COP-UPDOWN][STATUS] (db_status_update) conn_name: azuregw count: 1
2019-10-23 20:52:21 27[APP] [COP-UPDOWN][STATUS] (db_status_update) conn_name: azuregw count: 1
2019-10-23 20:55:21 30[APP] [COP-UPDOWN][STATUS] (db_status_update) conn_name: azuregw count: 1
2019-10-23 20:58:21 16[APP] [COP-UPDOWN][STATUS] (db_status_update) conn_name: azuregw count: 1
2019-10-23 21:01:21 09[APP] [COP-UPDOWN][STATUS] (db_status_update) conn_name: azuregw count: 1
2019-10-23 21:04:21 19[APP] [COP-UPDOWN][STATUS] (db_status_update) conn_name: azuregw count: 1

 Note that the count was showing 2 (remember it has 2 SA’s), and then after 2019-10-23 20:43:21 start show count 1, which proves that the second SA got disconnected, otherwise will be always count 2

 

Searching the reason why the disconnection happen digging into the charon.log (with debug enabled):

Narrowed down the logs a bit before that timestamp and I could find by filtering the reference  “TS” inside of the same log file (Charon.log in debug) that means Traffic Selector, I could  find the moment that the SA is disconnect. Note that the Azure side, sends a response that is  actually a solicitation to REKEYING the current installed TS: 

2019-10-23 20:40:07 15[CHD] <aztest1-1|143> SPI 0xc08f3a27, src 51.136.198.59 dst 285.54.258.149
2019-10-23 20:40:07 15[CHD] <aztest1-1|143> registering outbound ESP SA
2019-10-23 20:40:07 15[CHD] <aztest1-1|143> SPI 0xe37df272, src 285.54.258.149 dst 51.136.198.59
2019-10-23 20:40:07 15[CHD] <aztest1-1|143> CHILD_SA azuregw-1{132} state change: INSTALLING => INSTALLED
2019-10-23 20:40:07 15[IKE] <aztest1-1|143> inbound CHILD_SA azuregw-1{132} established with SPIs c08f3a27_i e37df272_o and TS 192.168.1.0/24 === 192.168.64.0/20
2019-10-23 20:40:07 15[CHD] <aztest1-1|143> CHILD_SA azuregw-1{131} state change: INSTALLED => REKEYING
2019-10-23 20:40:07 15[CHD] <aztest1-1|143> CHILD_SA azuregw-1{131} state change: REKEYING => REKEYED
2019-10-23 20:40:07 15[ENC] <aztest1-1|143> generating CREATE_CHILD_SA response 6269 [ SA No TSi TSr ]
2019-10-23 20:40:07 15[NET] <aztest1-1|143> sending packet: from 285.54.258.149[500] to 51.136.198.59[500] (208 bytes)
2019-10-23 20:40:07 04[NET] sending packet: from 285.54.258.149[500] to 51.136.198.59[500]
2019-10-23 20:40:07 03[NET] received packet: from 51.136.198.59[500] to 285.54.258.149[500] on Port2
2019-10-23 20:40:07 03[NET] waiting for data on sockets
2019-10-23 20:40:07 11[NET] <aztest1-1|143> received packet: from 51.136.198.59[500] to 285.54.258.149[500] (80 bytes)
2019-10-23 20:40:07 11[ENC] <aztest1-1|143> parsed INFORMATIONAL request 6270 [ D ]
2019-10-23 20:40:07 11[IKE] <aztest1-1|143> received DELETE for ESP CHILD_SA with SPI 6d8c6dba
2019-10-23 20:40:07 11[IKE] <aztest1-1|143> closing CHILD_SA azuregw-1{131} with SPIs c300308c_i (532547358 bytes) 6d8c6dba_o (35133355 bytes) and TS 192.168.1.0/24 === 192.168.64.0/20
2019-10-23 20:40:07 11[IKE] <aztest1-1|143> sending DELETE for ESP CHILD_SA with SPI c300308c
2019-10-23 20:40:07 11[CHD] <aztest1-1|143> CHILD_SA azuregw-1{131} state change: REKEYED => DELETING
2019-10-23 20:40:07 11[IKE] <aztest1-1|143> CHILD_SA closed
2019-10-23 20:40:07 11[CHD] <aztest1-1|143> CHILD_SA azuregw-1{131} state change: DELETING => DELETED

Conclusion:

The initiator side (Azure), sends a rekey request, however the SA does not re-establish after the rekey, then Sophos XG honor by deleting the TS as requested. In this particular scenario, the issue lays on the Azure side, which will require further investigation on the Azure logs.

  • Hi JuanaMelo,

    Was this down to the Azure dropping the connection due to inactivity? I believe the 'traffic' timeout on the azure GW is 15mins, so if this is a remote site or for example the SSL VPN subnet then without any traffic over that route it would drop - hence why only your primary / main XG lan remains active? 

     

    Let us know what you found on the azure side.. 

     

    Thanks,

    Michael.

  • Hello Michael,

     

    When I came across the first event having this behavior, I had the same line of thought, however, after a long investigation we could determine that the main cause of the disconnection was due to the fact the XG was receiving REKEY request only for one CHILD_SA, and not for another one (both SA's are actives LANs pretty much with the same traffic pattern, not remote SSL VPN subnets). For other CHILD_SA we should be receiving REKEY request from Azure which was not happening.

    From this point forward the Azure supported was engaged to help us identify why the second SA was not receiving rekeying request.

    Conclusion:

    Azure analysis reported that based on their logs, one of the SA was not having enough data flowing, reason why only one was renewed, while the other was simply deleted because it wasn't enough data flowing to prevent the SA pairs from being deleted.

    To work around this issue, it was recommended use route-based VPN connections, rather than policy-based, however, Sophos currently only supports Policy-based. And the final alternative provided was to maintain a continuous ping - all the time from hosts on the Azure side, to keep consistent data flowing coming from the Azure side.

    I hope this clarifies 

     

    Regards,

    Juana

  • Thanks Juana, 

     

    I've also had to set pings running from the azure side to keep tunnels open to connections like the SSL vpn through the primary azure s2s with the XG.  I've not tested V18 on the XG's but looking to see if route based vpn to azure with a higher spec gateway on that side will help these types of problems.