Hallo zusammen,
vorab zur Infrastruktur: Wir haben 15 Standorte, die jeweils einen Sophos SG Cluster haben, miteinander über MPLS und VPN vernetzt. Bisher gab es vereinzelt Meldungen über Verbindungsabbrüche, soweit nichts ungewöhnliches, das kommt immer mal vor. Allerdings haben sind die Beschwerden in den letzten Tagen und besonders heute extrem gestiegen und es ist von Abbrüchen im Minutentakt die Rede. Auf der Suche nach der Fehlerquelle bin ich auf unseren Cluster am Haupstandort (mit Rechenzentrum) gestoßen, da mir aufgefallen ist, dass die Verbindungen parallel abbrechen.
Das Fehlerbild sieht aus, als wäre das Routing asynchron bzw. als würde der Sophos Cluster für einen Moment alle Routen vergessen und nur das Default Gateway nehmen, was wiederrum in einem Drop endet.
Im Middleware log sieht es so aus, als würde der Cluster ca. alle 1 - 2 Minuten einen cycle starten und alle Routen neu eintragen.
In diesem Moment werden dementsprechend alle Verbindung an das Default Gateway gesendet. Zeitlich passt es auch genau zu den Drops die ich sehen kann.
Etwa zur gleichen Zeit erscheint im Fallback log diese Meldung: "2019:04:11-15:37:53 fw-xxx-01-2 [daemon:err] nwd[7078]: Netlink: OVERRUN on rtnl socket".
Da ich keinen Lösungsansatz weiß, hoffe ich, dass mir hier jemand helfen kann!
EDIT:
Hier mal ein Logauszug aus dem middleware log - Ersteres der "Problemcluster", zweiteres ein normal funktionierender Cluster
Ich habe aus dem ersten log die vielen "route add" Meldungen entfernt, alle in dieser Art:
2019:04:12-13:44:14 fw-xxx-01-2 middleware[4502]: T utils::IPRoute::ip_route_cmd:869() => /sbin/ip -f inet route add proto static metric 1 xxx.xxx.xxx.xxx/xx via xxx.xxx.xxx.xxx
2019:04:12-13:44:13 fw-xxx-01-2 middleware[4502]: T main::top-level:213() => starting cycle 3131, caught 1 signals
2019:04:12-13:44:13 fw-xxx-01-2 middleware[4502]: T core::Config::Changed:198() => configversion=3156
2019:04:12-13:44:13 fw-xxx-01-2 middleware[4502]: T core::Config::Changed:208() => nodes=0 objects=1 triggers=0
2019:04:12-13:44:13 fw-xxx-01-2 middleware[4502]: T core::Config::load:351() => modules=6,52
2019:04:12-13:44:13 fw-xxx-01-1 middleware[4561]: T main::top-level:213() => starting cycle 2820, caught 1 signals
2019:04:12-13:44:13 fw-xxx-01-1 middleware[4561]: T core::Config::Changed:198() => configversion=2848
2019:04:12-13:44:13 fw-xxx-01-1 middleware[4561]: T core::Config::Changed:208() => nodes=0 objects=1 triggers=0
2019:04:12-13:44:13 fw-xxx-01-1 middleware[4561]: T core::Config::load:351() => modules=6,52
2019:04:12-13:44:15 fw-xxx-01-2 middleware[4502]: T modules::up2date::load:106() => iaas_deployment=
2019:04:12-13:44:15 fw-xxx-01-2 middleware[4502]: T core::Config::load:394() => core::Config: wrote crontab
2019:04:12-13:44:16 fw-xxx-01-2 middleware[4502]: T modules::ipset::deleteUnused:342() => auto#=69/682 confd#=1/341
2019:04:12-13:44:16 fw-xxx-01-2 middleware[4502]: T main::top-level:264() => ending cycle 3131, caught 0 signals, 0 children still running
2019:04:12-13:44:18 fw-xxx-01-2 middleware[4502]: T main::top-level:213() => starting cycle 3132, caught 1 signals
2019:04:12-13:44:18 fw-xxx-01-1 middleware[4561]: T modules::up2date::load:106() => iaas_deployment=
2019:04:12-13:44:18 fw-xxx-01-2 middleware[4502]: T core::Config::Changed:198() => configversion=3157
2019:04:12-13:44:18 fw-xxx-01-2 middleware[4502]: T core::Config::Changed:208() => nodes=0 objects=1 triggers=0
2019:04:12-13:44:18 fw-xxx-01-2 middleware[4502]: T core::Config::load:351() => modules=6,52
2019:04:12-13:44:18 fw-xxx-01-1 middleware[4561]: T core::Config::load:394() => core::Config: wrote crontab
2019:04:12-13:44:19 fw-xxx-01-1 middleware[4561]: T modules::ipset::deleteUnused:342() => auto#=69/682 confd#=1/341
2019:04:12-13:44:19 fw-xxx-01-1 middleware[4561]: T main::top-level:264() => ending cycle 2820, caught 1 signals, 0 children still running
2019:04:12-13:44:19 fw-xxx-01-1 middleware[4561]: T main::top-level:213() => starting cycle 2821, caught 1 signals
2019:04:12-13:44:19 fw-xxx-01-1 middleware[4561]: T core::Config::Changed:198() => configversion=2849
2019:04:12-13:44:19 fw-xxx-01-1 middleware[4561]: T core::Config::Changed:208() => nodes=0 objects=1 triggers=0
2019:04:12-13:44:19 fw-xxx-01-1 middleware[4561]: T core::Config::load:351() => modules=6,52
2019:04:12-13:44:20 fw-xxx-01-2 middleware[4502]: T modules::up2date::load:106() => iaas_deployment=
2019:04:12-13:44:20 fw-xxx-01-2 middleware[4502]: T core::Config::load:394() => core::Config: wrote crontab
2019:04:12-13:44:21 fw-xxx-01-2 middleware[4502]: T modules::ipset::deleteUnused:342() => auto#=69/682 confd#=1/341
2019:04:12-13:44:21 fw-xxx-01-2 middleware[4502]: T main::top-level:264() => ending cycle 3132, caught 0 signals, 0 children still running
2019:04:12-13:44:24 fw-xxx-01-1 middleware[4561]: T modules::up2date::load:106() => iaas_deployment=
2019:04:12-13:44:24 fw-xxx-01-1 middleware[4561]: T core::Config::load:394() => core::Config: wrote crontab
2019:04:12-13:44:25 fw-xxx-01-1 middleware[4561]: T modules::ipset::deleteUnused:342() => auto#=69/682 confd#=1/341
2019:04:12-13:44:25 fw-xxx-01-1 middleware[4561]: T main::top-level:264() => ending cycle 2821, caught 0 signals, 0 children still running
2019:04:12-13:35:29 fw-zzz-01-2 middleware[4248]: T main::top-level:213() => starting cycle 1279, caught 1 signals
2019:04:12-13:35:29 fw-zzz-01-2 middleware[4248]: T core::Config::Changed:198() => configversion=1304
2019:04:12-13:35:29 fw-zzz-01-2 middleware[4248]: T core::Config::Changed:208() => nodes=1 objects=0 triggers=0
2019:04:12-13:35:29 fw-zzz-01-2 middleware[4248]: T core::Config::load:351() => modules=1,1
2019:04:12-13:35:29 fw-zzz-01-1 middleware[4322]: T main::top-level:213() => starting cycle 1274, caught 1 signals
2019:04:12-13:35:29 fw-zzz-01-2 middleware[4248]: T modules::ipset::deleteUnused:342() => auto#=21/682 confd#=1/341
2019:04:12-13:35:29 fw-zzz-01-2 middleware[4248]: T main::top-level:275() => cycle 1279 waiting for 2 children
2019:04:12-13:35:29 fw-zzz-01-1 middleware[4322]: T core::Config::Changed:198() => configversion=1160
2019:04:12-13:35:29 fw-zzz-01-1 middleware[4322]: T core::Config::Changed:208() => nodes=1 objects=0 triggers=0
2019:04:12-13:35:29 fw-zzz-01-1 middleware[4322]: T core::Config::load:351() => modules=1,1
2019:04:12-13:35:30 fw-zzz-01-1 middleware[4322]: T modules::ipset::deleteUnused:342() => auto#=21/682 confd#=1/341
2019:04:12-13:35:30 fw-zzz-01-1 middleware[4322]: T main::top-level:275() => cycle 1274 waiting for 2 children
2019:04:12-13:35:31 fw-zzz-01-2 middleware[4248]: T main::top-level:275() => cycle 1279 waiting for 1 children
2019:04:12-13:35:31 fw-zzz-01-1 middleware[4322]: T main::top-level:275() => cycle 1274 waiting for 1 children
2019:04:12-13:35:33 fw-zzz-01-2 middleware[4248]: T main::top-level:264() => ending cycle 1279, caught 0 signals, 0 children still running
2019:04:12-13:35:33 fw-zzz-01-1 middleware[4322]: T main::top-level:264() => ending cycle 1274, caught 0 signals, 0 children still running
2019:04:12-13:35:49 fw-zzz-01-2 middleware[4248]: T main::top-level:213() => starting cycle 1280, caught 1 signals
2019:04:12-13:35:49 fw-zzz-01-2 middleware[4248]: T core::Config::Changed:198() => configversion=1305
2019:04:12-13:35:49 fw-zzz-01-2 middleware[4248]: T core::Config::Changed:208() => nodes=1 objects=0 triggers=0
2019:04:12-13:35:49 fw-zzz-01-2 middleware[4248]: T core::Config::load:351() => modules=1,1
2019:04:12-13:35:49 fw-zzz-01-1 middleware[4322]: T main::top-level:213() => starting cycle 1275, caught 1 signals
2019:04:12-13:35:49 fw-zzz-01-2 middleware[4248]: T modules::ipset::deleteUnused:342() => auto#=21/682 confd#=1/341
2019:04:12-13:35:49 fw-zzz-01-2 middleware[4248]: T main::top-level:275() => cycle 1280 waiting for 2 children
2019:04:12-13:35:49 fw-zzz-01-1 middleware[4322]: T core::Config::Changed:198() => configversion=1161
2019:04:12-13:35:49 fw-zzz-01-1 middleware[4322]: T core::Config::Changed:208() => nodes=1 objects=0 triggers=0
2019:04:12-13:35:49 fw-zzz-01-1 middleware[4322]: T core::Config::load:351() => modules=1,1
2019:04:12-13:35:49 fw-zzz-01-1 middleware[4322]: T modules::ipset::deleteUnused:342() => auto#=21/682 confd#=1/341
2019:04:12-13:35:49 fw-zzz-01-1 middleware[4322]: T main::top-level:275() => cycle 1275 waiting for 2 children
2019:04:12-13:35:51 fw-zzz-01-2 middleware[4248]: T main::top-level:275() => cycle 1280 waiting for 1 children
2019:04:12-13:35:51 fw-zzz-01-1 middleware[4322]: T main::top-level:275() => cycle 1275 waiting for 1 children
2019:04:12-13:35:53 fw-zzz-01-1 middleware[4322]: T main::top-level:264() => ending cycle 1275, caught 0 signals, 0 children still running
2019:04:12-13:35:53 fw-zzz-01-2 middleware[4248]: T main::top-level:264() => ending cycle 1280, caught 0 signals, 0 children still running
Vielleicht hat jemand mehr Erfahrung mit dem Log, ich werde daraus nicht so richtig schlau
Gruß
Alex
This thread was automatically locked due to age.