This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

SG to SG IpSec tunnel falling down after being stable for years.

Hi

 

A couple of months ago an IpSec NAT-T tunnel suddenly started falling down a couple of times each day. What we do to get it up again is to disable the tunnel on both sides (we control both, or rather all 3, firewalls), wait a few minutes and enable them again. When it started we upgraded and rebooted the outer firewall (se drawing below). Then it was stable for about a month and it started doing this again. 

We have verified the policies, timeouts, NTP settings and everything relevant to IpSec and they are all correct on both gateways.

The problem seems to be on the outer firewall that is doing the NAT. When we disable the tunnel on both sides and wait for the connections to timeout on the NAT firewall, the tunnel comes up again when enabling the gateways.

To complicate things, both vpn gateway have a lot of other vpn connections that works as supposed to. 

Anybody have any clues?

tcpdump on eth1 on outer barrier when things are working:

09:47:36.946832 IP 85.x.x.10.4500 > 195.x.x.2.4500: UDP-encap: ESP(spi=0x440a56b0,seq=0xa95), length 1468
09:47:36.946964 IP 85.x.x.10.4500 > 195.x.x.2.4500: UDP-encap: ESP(spi=0x440a56b0,seq=0xa96), length 1468
09:47:36.946965 IP 195.x.x.2.4500 > 85.x.x.10.4500: UDP-encap: ESP(spi=0x367563fc,seq=0x5e4), length 76
09:47:36.947098 IP 85.x.x.10.4500 > 195.x.x.2.4500: UDP-encap: ESP(spi=0x440a56b0,seq=0xa97), length 1468
09:47:36.947421 IP 85.x.x.10.4500 > 195.x.x.2.4500: UDP-encap: ESP(spi=0x440a56b0,seq=0xa98), length 1468
09:47:36.947499 IP 195.x.x.2.4500 > 85.x.x.10.4500: UDP-encap: ESP(spi=0x367563fc,seq=0x5e5), length 76
09:47:36.948005 IP 195.x.x.2.4500 > 85.x.x.10.4500: UDP-encap: ESP(spi=0x367563fc,seq=0x5e6), length 76
09:47:36.948256 IP 195.x.x.2.4500 > 85.x.x.10.4500: UDP-encap: ESP(spi=0x367563fc,seq=0x5e7), length 76
09:47:36.948467 IP 85.x.x.10.4500 > 195.x.x.2.4500: UDP-encap: ESP(spi=0x440a56b0,seq=0xa99), length 1468
09:47:36.948544 IP 195.x.x.2.4500 > 85.x.x.10.4500: UDP-encap: ESP(spi=0x367563fc,seq=0x5e8), length 76
09:47:36.948620 IP 85.x.x.10.4500 > 195.x.x.2.4500: UDP-encap: ESP(spi=0x440a56b0,seq=0xa9a), length 1468
09:47:36.949581 IP 85.x.x.10.4500 > 195.x.x.2.4500: UDP-encap: ESP(spi=0x440a56b0,seq=0xa9b), length 1468
09:47:36.949885 IP 195.x.x.2.4500 > 85.x.x.10.4500: UDP-encap: ESP(spi=0x367563fc,seq=0x5e9), length 76
09:47:36.950019 IP 85.x.x.10.4500 > 195.x.x.2.4500: UDP-encap: ESP(spi=0x440a56b0,seq=0xa9c), length 284
09:47:36.951062 IP 195.x.x.2.4500 > 85.x.x.10.4500: UDP-encap: ESP(spi=0x367563fc,seq=0x5ea), length 76
09:47:36.951535 IP 195.x.x.2.4500 > 85.x.x.10.4500: UDP-encap: ESP(spi=0x367563fc,seq=0x5eb), length 76



This thread was automatically locked due to age.
  • Hei Rolf-Arne,

    Do you see anything "interesting" in the IPsec log on either endpoint when the tunnel fails?  Doesn't one side or the other immediately try to re-initiate the connection?

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
  • Hi Bob!

    I am quite certain this is related to NAT on the outer barrier. When the error occurs, conntrack shows a session on udp 500. When we disable the tunnel on both sides and wait until this udp 500-session times out, the tunnel comes up just fine again. 

    udp      17 38 src=195.x.x.2 dst=85.x.x.10 sport=500 dport=500 packets=33 bytes=9092 src=10.x.x.1 dst=195.x.x.2 sport=500 dport=500 packets=42 bytes=11572 [ASSURED] mark=528625 delta-time=1306 use=1

     

    On the inner barrier firewall I see this. I have marked the common denominator for all tunnel fails in red:

    2019:04:21-08:25:07 LHL-fw-indre-1 pluto[20855]: | route owner of "S_REF_IpsSitmycompanyos_12" erouted: self; eroute owner: self
    2019:04:21-08:25:07 LHL-fw-indre-1 pluto[20855]: | route_and_eroute with c: S_REF_IpsSitmycompanyos_12 (next: none) ero:S_REF_IpsSitmycompanyos_12 esr:{(nil)} ro:S_REF_IpsSitmycompanyos_12 rosr:{(nil)} and state: 98844
    2019:04:21-08:25:07 LHL-fw-indre-1 pluto[20855]: | route_and_eroute: instance "S_REF_IpsSitmycompanyos_12", setting eroute_owner {spd=0x97528b8,sr=0x97528b8} to #98844 (was #98757) (newest_ipsec_sa=#98757)
    2019:04:21-08:25:07 LHL-fw-indre-1 pluto[20855]: | inI2: instance S_REF_IpsSitmycompanyos_12[0], setting newest_ipsec_sa to #98844 (was #98757) (spd.eroute=#98844)
    2019:04:21-08:25:07 LHL-fw-indre-1 pluto[20855]: "S_REF_IpsSitmycompanyos_12" #98844: IPsec SA established {ESP=>0x4f1970ad <0x40146c29 NATOA=0.0.0.0 DPD}
    2019:04:21-08:25:07 LHL-fw-indre-2 pluto[30877]: | route owner of "S_REF_IpsSitmycompanyos_12" erouted: self; eroute owner: self
    2019:04:21-08:25:07 LHL-fw-indre-2 pluto[30877]: | route owner of "S_REF_IpsSitmycompanyos_12" erouted: self; eroute owner: self
    2019:04:21-08:25:07 LHL-fw-indre-2 pluto[30877]: | route_and_eroute with c: S_REF_IpsSitmycompanyos_12 (next: none) ero:S_REF_IpsSitmycompanyos_12 esr:{(nil)} ro:S_REF_IpsSitmycompanyos_12 rosr:{(nil)} and state: 98844
    2019:04:21-08:25:07 LHL-fw-indre-2 pluto[30877]: | route_and_eroute: instance "S_REF_IpsSitmycompanyos_12", setting eroute_owner {spd=0x9e62678,sr=0x9e62678} to #98844 (was #98757) (newest_ipsec_sa=#98844)
    2019:04:21-08:28:08 LHL-fw-indre-1 pluto[20855]: "S_REF_IpsSitmycompanyos_0" #98645: DPD: Phase1 state #98645 has been superseded by #98722 - timeout ignored
    2019:04:21-08:28:34 LHL-fw-indre-1 pluto[20855]: "S_REF_IpsSitmycompanyos_9" #98722: DPD: No response from peer - declaring peer dead
    2019:04:21-08:28:34 LHL-fw-indre-1 pluto[20855]: "S_REF_IpsSitmycompanyos_9" #98722: DPD: Restarting all connections of peer
    2019:04:21-08:28:34 LHL-fw-indre-1 pluto[20855]: "S_REF_IpsSitmycompanyos_9" #98722: DPD: Terminating all SAs using this connection
    2019:04:21-08:28:34 LHL-fw-indre-1 pluto[20855]: "S_REF_IpsSitmycompanyos_9" #98826: deleting state (STATE_QUICK_R2)


    Then it takes down all Connections/SA's with multiple lines like this. I won't paste them all due to readability:


    2019:04:21-08:28:34 LHL-fw-indre-1 pluto[20855]: | executing down-client: 2>&1 PLUTO_VERSION='1.1' PLUTO_VERB='down-client' PLUTO_CONNECTION='S_REF_IpsSitmycompanyos_9' PLUTO_NEXT_HOP='195.x.x.2' PLUTO_INTERFACE='eth9.6' PLUTO_REQID='17617' PLUTO_ME='10.x.x.1' PLUTO_MY_ID='85.x.x.10' PLUTO_MY_CLIENT='local.crypto-net/24' PLUTO_MY_CLIENT_NET='local.crypto-net' PLUTO_MY_CLIENT_MASK='255.255.255.0' PLUTO_MY_PORT='0' PLUTO_MY_PROTOCOL='0' PLUTO_PEER='195.x.x.2' PLUTO_PEER_ID='195.x.x.2' PLUTO_PEER_CLIENT='remote.crypto-net/24' PLUTO_PEER_CLIENT_NET='remote.crypto-net' PLUTO_PEER_CLIENT_MASK='255.255.255.0' PLUTO_PEER_PORT='0' PLUTO_PEER_PROTOCOL='0' PLUTO_PEER_CA='' PLUTO_MY_SOURCEIP='10.x.x.1' /usr/libexec/ipsec/updown classic
    2019:04:21-08:28:34 LHL-fw-indre-1 pluto[20855]: | executing unroute-client: 2>&1 PLUTO_VERSION='1.1' PLUTO_VERB='unroute-client' PLUTO_CONNECTION='S_REF_IpsSitmycompanyos_9' PLUTO_NEXT_HOP='195.x.x.2' PLUTO_INTERFACE='eth9.6' PLUTO_REQID='17617' PLUTO_ME='10.x.x.1' PLUTO_MY_ID='85.x.x.10' PLUTO_MY_CLIENT='local.crypto-net/24' PLUTO_MY_CLIENT_NET='local.crypto-net' PLUTO_MY_CLIENT_MASK='255.255.255.0' PLUTO_MY_PORT='0' PLUTO_MY_PROTOCOL='0' PLUTO_PEER='195.x.x.2' PLUTO_PEER_ID='195.x.x.2' PLUTO_PEER_CLIENT='remote.crypto-net/24' PLUTO_PEER_CLIENT_NET='10.246.3.0' PLUTO_PEER_CLIENT_MASK='255.255.255.0' PLUTO_PEER_PORT='0' PLUTO_PEER_PROTOCOL='0' PLUTO_PEER_CA='' PLUTO_MY_SOURCEIP='10.x.x.1' /usr/libexec/ipsec/updown classic

    Then, theres a lot of these lines:

    2019:04:21-08:28:35 LHL-fw-indre-1 pluto[20855]: DPD: Restarting connection "S_REF_IpsSitmycompanyos_1"
    2019:04:21-08:28:35 LHL-fw-indre-1 pluto[20855]: | route owner of "S_REF_IpsSitmycompanyos_1" unrouted: NULL
    2019:04:21-08:28:35 LHL-fw-indre-1 pluto[20855]: | executing unroute-client: 2>&1 PLUTO_VERSION='1.1' PLUTO_VERB='unroute-client' PLUTO_CONNECTION='S_REF_IpsSitmycompanyos_1' PLUTO_NEXT_HOP='195.x.x.2' PLUTO_INTERFACE='eth9.6' PLUTO_REQID='17545' PLUTO_ME='10.x.x.1' PLUTO_MY_ID='85.x.x.10' PLUTO_MY_CLIENT='10.x.x.0/24' PLUTO_MY_CLIENT_NET='10.x.x.0' PLUTO_MY_CLIENT_MASK='255.255.255.0' PLUTO_MY_PORT='0' PLUTO_MY_PROTOCOL='0' PLUTO_PEER='195.x.x.2' PLUTO_PEER_ID='195.x.x.2' PLUTO_PEER_CLIENT='remote.crypto-net/24' PLUTO_PEER_CLIENT_NET='remote.crypto-net' PLUTO_PEER_CLIENT_MASK='255.255.255.0' PLUTO_PEER_PORT='0' PLUTO_PEER_PROTOCOL='0' PLUTO_PEER_CA='' PLUTO_MY_SOURCEIP='10.x.x.1' /usr/libexec/ipsec/updown classic

    Until it seems to give up:


    2019:04:21-08:28:35 LHL-fw-indre-2 pluto[30877]: | route owner of "S_REF_IpsSitmycompanyos_1" prospective erouted: self; eroute owner: self
    2019:04:21-08:28:35 LHL-fw-indre-1 pluto[20855]: | Queuing pending Quick Mode with 195.x.x.2 "S_REF_IpsSitmycompanyos_9"
    2019:04:21-08:28:35 LHL-fw-indre-1 pluto[20855]: "S_REF_IpsSitmycompanyos_9" #98847: initiating Main Mode
    2019:04:21-08:28:35 LHL-fw-indre-1 pluto[20855]: | Queuing pending Quick Mode with 195.x.x.2 "S_REF_IpsSitmycompanyos_0"
    2019:04:21-08:28:35 LHL-fw-indre-1 pluto[20855]: | Queuing pending Quick Mode with 195.x.x.2 "S_REF_IpsSitmycompanyos_8"
    2019:04:21-08:28:35 LHL-fw-indre-1 pluto[20855]: | Queuing pending Quick Mode with 195.x.x.2 "S_REF_IpsSitmycompanyos_7"
    2019:04:21-08:28:35 LHL-fw-indre-1 pluto[20855]: | Queuing pending Quick Mode with 195.x.x.2 "S_REF_IpsSitmycompanyos_6"
    2019:04:21-08:28:35 LHL-fw-indre-1 pluto[20855]: | Queuing pending Quick Mode with 195.x.x.2 "S_REF_IpsSitmycompanyos_5"
    2019:04:21-08:28:35 LHL-fw-indre-1 pluto[20855]: | Queuing pending Quick Mode with 195.x.x.2 "S_REF_IpsSitmycompanyos_4"
    2019:04:21-08:28:35 LHL-fw-indre-1 pluto[20855]: | Queuing pending Quick Mode with 195.x.x.2 "S_REF_IpsSitmycompanyos_3"
    2019:04:21-08:28:35 LHL-fw-indre-1 pluto[20855]: | Queuing pending Quick Mode with 195.x.x.2 "S_REF_IpsSitmycompanyos_2"
    2019:04:21-08:28:35 LHL-fw-indre-1 pluto[20855]: | Queuing pending Quick Mode with 195.x.x.2 "S_REF_IpsSitmycompanyos_19"
    2019:04:21-08:28:35 LHL-fw-indre-1 pluto[20855]: | Queuing pending Quick Mode with 195.x.x.2 "S_REF_IpsSitmycompanyos_18"
    2019:04:21-08:28:35 LHL-fw-indre-1 pluto[20855]: | Queuing pending Quick Mode with 195.x.x.2 "S_REF_IpsSitmycompanyos_17"
    2019:04:21-08:28:35 LHL-fw-indre-1 pluto[20855]: | Queuing pending Quick Mode with 195.x.x.2 "S_REF_IpsSitmycompanyos_16"
    2019:04:21-08:28:35 LHL-fw-indre-1 pluto[20855]: | Queuing pending Quick Mode with 195.x.x.2 "S_REF_IpsSitmycompanyos_15"
    2019:04:21-08:28:35 LHL-fw-indre-1 pluto[20855]: | Queuing pending Quick Mode with 195.x.x.2 "S_REF_IpsSitmycompanyos_14"
    2019:04:21-08:28:35 LHL-fw-indre-1 pluto[20855]: | Queuing pending Quick Mode with 195.x.x.2 "S_REF_IpsSitmycompanyos_13"
    2019:04:21-08:28:35 LHL-fw-indre-1 pluto[20855]: | Queuing pending Quick Mode with 195.x.x.2 "S_REF_IpsSitmycompanyos_12"
    2019:04:21-08:28:35 LHL-fw-indre-1 pluto[20855]: | Queuing pending Quick Mode with 195.x.x.2 "S_REF_IpsSitmycompanyos_11"
    2019:04:21-08:28:35 LHL-fw-indre-1 pluto[20855]: | Queuing pending Quick Mode with 195.x.x.2 "S_REF_IpsSitmycompanyos_10"
    2019:04:21-08:28:35 LHL-fw-indre-1 pluto[20855]: | Queuing pending Quick Mode with 195.x.x.2 "S_REF_IpsSitmycompanyos_1"
    2019:04:21-08:28:45 LHL-fw-indre-1 pluto[20855]: | handling event EVENT_RETRANSMIT for 195.x.x.2 "S_REF_IpsSitmycompanyos_9" #98847
    2019:04:21-08:29:05 LHL-fw-indre-1 pluto[20855]: | handling event EVENT_RETRANSMIT for 195.x.x.2 "S_REF_IpsSitmycompanyos_9" #98847
    2019:04:21-08:29:45 LHL-fw-indre-1 pluto[20855]: | handling event EVENT_RETRANSMIT for 195.x.x.2 "S_REF_IpsSitmycompanyos_9" #98847
    2019:04:21-08:30:25 LHL-fw-indre-1 pluto[20855]: | handling event EVENT_RETRANSMIT for 195.x.x.2 "S_REF_IpsSitmycompanyos_9" #98847
    2019:04:21-08:31:05 LHL-fw-indre-1 pluto[20855]: | handling event EVENT_RETRANSMIT for 195.x.x.2 "S_REF_IpsSitmycompanyos_9" #98847
    2019:04:21-08:31:45 LHL-fw-indre-1 pluto[20855]: | handling event EVENT_RETRANSMIT for 195.x.x.2 "S_REF_IpsSitmycompanyos_9" #98847
    2019:04:21-08:32:25 LHL-fw-indre-1 pluto[20855]: | handling event EVENT_RETRANSMIT for 195.x.x.2 "S_REF_IpsSitmycompanyos_9" #98847
    2019:04:21-08:33:05 LHL-fw-indre-1 pluto[20855]: | handling event EVENT_RETRANSMIT for 195.x.x.2 "S_REF_IpsSitmycompanyos_9" #98847
    2019:04:21-08:33:45 LHL-fw-indre-1 pluto[20855]: | handling event EVENT_RETRANSMIT for 195.x.x.2 "S_REF_IpsSitmycompanyos_9" #98847
    2019:04:21-08:34:25 LHL-fw-indre-1 pluto[20855]: | handling event EVENT_RETRANSMIT for 195.x.x.2 "S_REF_IpsSitmycompanyos_9" #98847
    2019:04:21-08:35:05 LHL-fw-indre-1 pluto[20855]: | handling event EVENT_RETRANSMIT for 195.x.x.2 "S_REF_IpsSitmycompanyos_9" #98847
    2019:04:21-08:35:45 LHL-fw-indre-1 pluto[20855]: | handling event EVENT_RETRANSMIT for 195.x.x.2 "S_REF_IpsSitmycompanyos_9" #98847
    2019:04:21-08:36:25 LHL-fw-indre-1 pluto[20855]: | handling event EVENT_RETRANSMIT for 195.x.x.2 "S_REF_IpsSitmycompanyos_9" #98847
    2019:04:21-08:37:05 LHL-fw-indre-1 pluto[20855]: | handling event EVENT_RETRANSMIT for 195.x.x.2 "S_REF_IpsSitmycompanyos_9" #98847
    2019:04:21-08:37:45 LHL-fw-indre-1 pluto[20855]: | handling event EVENT_RETRANSMIT for 195.x.x.2 "S_REF_IpsSitmycompanyos_9" #98847
    2019:04:21-08:38:25 LHL-fw-indre-1 pluto[20855]: | handling event EVENT_RETRANSMIT for 195.x.x.2 "S_REF_IpsSitmycompanyos_9" #98847
    2019:04:21-08:39:05 LHL-fw-indre-1 pluto[20855]: | handling event EVENT_RETRANSMIT for 195.x.x.2 "S_REF_IpsSitmycompanyos_9" #98847
    2019:04:21-08:39:45 LHL-fw-indre-1 pluto[20855]: | handling event EVENT_RETRANSMIT for 195.x.x.2 "S_REF_IpsSitmycompanyos_9" #98847
    2019:04:21-08:40:25 LHL-fw-indre-1 pluto[20855]: | handling event EVENT_RETRANSMIT for 195.x.x.2 "S_REF_IpsSitmycompanyos_9" #98847
    2019:04:21-08:41:05 LHL-fw-indre-1 pluto[20855]: | handling event EVENT_RETRANSMIT for 195.x.x.2 "S_REF_IpsSitmycompanyos_9" #98847
    2019:04:21-08:41:45 LHL-fw-indre-1 pluto[20855]: | handling event EVENT_RETRANSMIT for 195.x.x.2 "S_REF_IpsSitmycompanyos_9" #98847
    2019:04:21-08:41:45 LHL-fw-indre-1 pluto[20855]: "S_REF_IpsSitmycompanyos_9" #98847: max number of retransmissions (20) reached STATE_MAIN_I1. No response (or no acceptable response) to our first IKE message
    2019:04:21-08:41:45 LHL-fw-indre-1 pluto[20855]: "S_REF_IpsSitmycompanyos_9" #98847: starting keying attempt 2 of an unlimited number
    2019:04:21-08:41:45 LHL-fw-indre-1 pluto[20855]: "S_REF_IpsSitmycompanyos_9" #98874: initiating Main Mode to replace #98847

    On the other side there's also DPD events that takes the SAs down, tries to take them up and fail.

  • Hmm - have you checked the Intrusion Prevention logs to be sure there's no problem with UDP Flood Protection?

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
  • Hi Bob

    There are no traces of the involved ip addresses in the ips logs.

    This weekend we got this problem on two other NAT-T tunnels also. There are 4 NAT-T tunnels (and several other regular udp 500-tunnels) on this firewall and we have seen this problem on 3 out of 4 tunnels now.

    What I noticed this morning was the following:

     

    1. Disabled tunnels on both ends.

    2. Waited for sessions to die on the outer barrier

    3. enable tunnel on data center side, then on the inner barrier. The tunnel did not come up

    4. Disabled on both ends again, waited fore timeouts, enabled on the inner barrier first, then on data center. The tunnels then came up.

     

    So it seems to me like there may be problems when initiating the tunnel from the internet.

     

    But it still puzzles me that this started happening on connections that has been behaving nicely for years without making any changes to them.

  • A quick retraction of my previous answer, Bob. I finally found lines in the ips logs on outer barrier corresponding with the times of the tunnel failures:

     

    ips-2019-04-28.log.gz:2019:04:28-07:16:03 FW-ytre-2 snort[24327]: id="2101" severity="warn" sys="SecureNet" sub="ips" name="Intrusion protection alert" action="drop" reason="SERVER-OTHER Cisco IOS invalid IKE fragment length memory corruption or exhaustion attempt" group="130" srcip="x.x.x.x" dstip="y.y.y.y" proto="17" srcport="4500" dstport="4500" sid="37675" class="Attempted Administrator Privilege Gain" priority="1" generator="1" msgid="0"

    x.x.x.x=Data center gateway

    y.y.y.y=Inner barrier gateway

    I have made the proper adjustments in the firewall to remedy this. Thank you for your help, Bob.