Stale IPSEC SA verhindert IPSEC Failover, trotz DPD - Warum ?

Question

Hallo, 
 ich habe mehrere Filialen &uuml;ber IPSEC Tunnel (Site2Site) an die Zentrale angebunden. 
 Sowohl die Zentrale als auch die Filialen verf&uuml;gen &uuml;ber Backup Leitungen. 
 Der betroffene Tunnel soll bei Ausfall einer Hauptleitung &uuml;ber die Backupleitung neu aufgebaut werden. 
 Klappt dem Grunde nach. Allerdings vergeht bis zu eine Stunde (= SA Lifetime) bis der Ersatztunnel hochkommt. 
 Meines Erachtens liegt das daran, dass die SA der toten Verbindung bis zum Ablauf der Lifetime aktiv bleibt. 
 Im IPSEC Log l&auml;&szlig;t sich verfolgen, wie die Filiale erfolglos versucht den Ersatztunnel aufzubauen und scheitert 
 ... 
 2019:06:22-19:14:31 gate pluto[19213]: "S_Filiale1"[3] 80.187.80.xxx #192: responding to Main Mode from unknown peer 80.187.80.xxx (Backupleitung Filiale) 2019:06:22-19:14:31 gate pluto[19213]: "S_Filiale1"[3] 80.187.80.xxx #192: Peer ID is ID_FQDN: 'filiale1.xxxx.de' 2019:06:22-19:14:31 gate pluto[19213]: "S_Filiale1"[3] 80.187.80.xxx #195: responding to Quick Mode 2019:06:22-19:14:43 gate pluto[19213]: "S_Filiale1"[3] 80.187.80.xxx #195: cannot route -- route already in use for "S_Filiale1" 
 ... 
 2019:06:22-19:42:56 gate pluto[19213]: ERROR: asynchronous network error report on eth1 for message to 84.141.238.xxx port 500, complainant 84.141.238.xxx: No route to host [errno 113, origin ICMP type 3 code 1 (not authenticated)] (Alter Tunnel - 84.141.238.xxx = Hauptleitung in der Filiale - Stecker gezogen, da kann nichts ankommen.) 
 ... 
 In der Routingtabelle findet sich: 192.168.50.0/24 dev eth1 proto ipsec scope link src 172.16.0.1 (Netz Zentrale: 172.16.0.0/16, Filiale 192.168.50.0/24) 
 
 Sobald die SA der urspr&uuml;nglichen Verbindung abgelaufen ist, klappt der Verbindungsaufbau und Daten fliessen &uuml;ber den Ersatztunnel. 
 2019:06:22-19:42:58 gate pluto[19213]: "S_Filiale1"[1] 84.141.238.xxx #277: initiating Quick Mode PUBKEY+ENCRYPT+TUNNEL+PFS to replace #276 {using isakmp#187} 2019:06:22-19:43:01 gate pluto[19213]: "S_Filiale1"[1] 84.141.238.xxx #277: ERROR: asynchronous network error report on eth1 for message to 84.141.238.xxx port 500, complainant 84.141.238.xxx: No route to host [errno 113, origin ICMP type 3 code 1 (not authenticated)] 2019:06:22-19:43:06 gate pluto[19213]: "S_Filiale1"[2] 84.141.238.xxx #187: DPD: No response from peer - declaring peer dead 2019:06:22-19:43:06 gate pluto[19213]: "S_Filiale1"[2] 84.141.238.xxx #187: DPD: Terminating all SAs using this connection 2019:06:22-19:43:06 gate pluto[19213]: "S_Filiale1"[2] 84.141.238.xxx #187: deleting connection "S_Filiale1"[2] instance with peer 84.141.238.xxx {isakmp=#187/ipsec=#0}2019:06:22-19:43:06 gate pluto[19213]: "S_Filiale1" #187: deleting state (STATE_MAIN_R3) 
 ... 
 2019:06:22-19:43:35 gate pluto[19213]: "S_Filiale1"[25] 80.187.80.xxx:8455 #278: responding to Quick Mode 2019:06:22-19:43:36 gate pluto[19213]: id="2203" severity="info" sys="SecureNet" sub="vpn" event="Site-to-site VPN up" variant="ipsec" connection="Filiale1" address="212.161.xxx.xxx" local_net="172.16.0.0/16" remote_net="192.168.50.0/24" 2019:06:22-19:43:36 gate pluto[19213]: "S_Filiale1"[25] 80.187.80.xxx:8455 #278: IPsec SA established {ESP=>0xf74dbd44 <0x22148545 NATOA=0.0.0.0 DPD} 
 Zur&uuml;ck von der Backupleitung zur Hauptleitung selbes Spiel. Die SA bleibt auch aktiv wenn ich die Filiale komplett Offline nehme (Haupt und Backupltg.) 
 
 Sollte DPD nicht daf&uuml;r sorgen, dass die 'tote' SA schon vor dem Rekeying entfernt wird ? 
 Von beiden Seiten laufen Daten (Ping alle x Sekunden zur jeweils anderen Seite). 
 In der Filiale (Initiator wg. NAT) wird die SA nach sp&auml;testens 3 Minuten entfernt.

Zum restlichen Setup: 
 
 Uplink Balancing (Haupt:Active/ Backup:Active - Haupt:100/ Backup:0) in den UTM auf beiden Seiten aktiviert. 
 IPSEC Tunnel wird vom Filialrouter (NAT/T, DPD) via vorgelagerten Router (NAT auf Haupt und Backupleitung) zur Zentrale (RespondOnly) aufgebaut. 
 Ziel via Availibility Group zuerst die Hauptleitung, dann die Backupleitung der Zentrale (beides &ouml;ffentliche IP, statisch) 
 Multipath Regeln (nach Schnittstelle), die den Traffic zwischen Zentrale und Filialen auf das prim&auml;re Uplink Interface binden.

BAlfson · Answer

Hallo, 
 Herzlich willkommen hier in der Community ! 
 (Sorry, my German-speaking brain isn't creating thoughts at the moment. [:(]) 
 There are two best approaches to this issue: (1) Auto-Failover IPsec VPN Connections and (2) Sophos UTM multiple S2S IPsec VPN mit Failover &ndash; Tutorial (DE) . (1) is easier to accomplish, but failover takes time because a new tunnel has to be built. (2) provides virtually instantaneous failover as the backup tunnel is already established. 
 I would first change your approach to that described in (1). If you continue to have problems, please show pictures of the Edits of the IPsec Connection, Remote Gateway, Availability Group and Interface Group for both sides. Also, confirm that none of the manually-created Network/Host definitions violates #3 in Rulz (last updated 2019-04-17) . 
 MfG - Bob (Bitte auf Deutsch weiterhin.)