Hi everyone,
this morning my colleague realized that all internet traffic was non-functional. It seemed like both HA nodes were in active state. After shutting down one of the nodes, things started working again. Looking into the logs I can see this:
2021:07:19-23:04:04 m-2 ha_daemon[4300]: id="38A2" severity="error" sys="System" sub="ha" seq="M: 407 04.766" name="send_backup_heartbeat(): send(): No buffer space available"
2021:07:19-23:00:31 m-2 kernel: [437910.124002] ------------[ cut here ]------------
2021:07:19-23:00:31 m-2 kernel: [437910.124014] WARNING: CPU: 3 PID: 6214 at net/sched/sch_generic.c:264 dev_watchdog+0xe6/0x181()
2021:07:19-23:00:31 m-2 kernel: [437910.124016] NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out
2021:07:19-23:00:31 m-2 kernel: [437910.124104] CPU: 3 PID: 6214 Comm: sasi Tainted: G O 3.12.74-0.377903089.g4999875.rb3-smp64 #1
2021:07:19-23:00:31 m-2 kernel: [437910.124106] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
2021:07:19-23:00:31 m-2 kernel: [437910.124107] 0000000000000000 ffffffff8136c181 ffffffff813074b0 ffffffff813074b0
2021:07:19-23:00:31 m-2 kernel: [437910.124109] ffff88023fd83dd0 ffffffff81046a60 ffff880235358000 0000000000000000
2021:07:19-23:00:31 m-2 kernel: [437910.124111] ffff880235358000 ffff880235358348 ffffffff813073ca ffffffff81046b11
2021:07:19-23:00:31 m-2 kernel: [437910.124113] Call Trace:
2021:07:19-23:00:31 m-2 kernel: [437910.124115] <IRQ> [<ffffffff8136c181>] ? dump_stack+0x61/0x80
2021:07:19-23:00:31 m-2 kernel: [437910.124122] [<ffffffff813074b0>] ? dev_watchdog+0xe6/0x181
2021:07:19-23:00:31 m-2 kernel: [437910.124125] [<ffffffff813074b0>] ? dev_watchdog+0xe6/0x181
2021:07:19-23:00:31 m-2 kernel: [437910.124131] [<ffffffff81046a60>] ? warn_slowpath_common+0x74/0x8b
2021:07:19-23:00:31 m-2 kernel: [437910.124133] [<ffffffff813073ca>] ? netif_tx_lock+0x7e/0x7e
2021:07:19-23:00:31 m-2 kernel: [437910.124135] [<ffffffff81046b11>] ? warn_slowpath_fmt+0x45/0x4a
2021:07:19-23:00:31 m-2 kernel: [437910.124137] [<ffffffff8130738f>] ? netif_tx_lock+0x43/0x7e
2021:07:19-23:00:31 m-2 kernel: [437910.124143] [<ffffffff813073ca>] ? netif_tx_lock+0x7e/0x7e
2021:07:19-23:00:31 m-2 kernel: [437910.124145] [<ffffffff813074b0>] ? dev_watchdog+0xe6/0x181
2021:07:19-23:00:31 m-2 kernel: [437910.124152] [<ffffffff81050bc3>] ? call_timer_fn+0x6a/0x10e
2021:07:19-23:00:31 m-2 kernel: [437910.124154] [<ffffffff813073ca>] ? netif_tx_lock+0x7e/0x7e
2021:07:19-23:00:31 m-2 kernel: [437910.124156] [<ffffffff81050ddd>] ? run_timer_softirq+0x176/0x1bd
2021:07:19-23:00:31 m-2 kernel: [437910.124160] [<ffffffff811cf36c>] ? timerqueue_add+0x79/0x94
2021:07:19-23:00:31 m-2 kernel: [437910.124163] [<ffffffff8104ae7a>] ? __do_softirq+0x128/0x24c
2021:07:19-23:00:31 m-2 kernel: [437910.124166] [<ffffffff813772dc>] ? call_softirq+0x1c/0x30
2021:07:19-23:00:31 m-2 kernel: [437910.124173] [<ffffffff8100f6c2>] ? do_softirq+0x3f/0x79
2021:07:19-23:00:31 m-2 kernel: [437910.124174] [<ffffffff8104ac7e>] ? irq_exit+0x46/0xa1
2021:07:19-23:00:31 m-2 kernel: [437910.124180] [<ffffffff810336f6>] ? smp_apic_timer_interrupt+0x22/0x2d
2021:07:19-23:00:31 m-2 kernel: [437910.124184] [<ffffffff8137661d>] ? apic_timer_interrupt+0x6d/0x80
2021:07:19-23:00:31 m-2 kernel: [437910.124185] <EOI>
2021:07:19-23:00:31 m-2 kernel: [437910.124187] ---[ end trace 2ab76b7259a68d8d ]---
2021:07:19-23:00:31 m-2 kernel: [437910.124197] e1000 0000:02:00.0 eth0: Reset adapter
2021:07:19-23:02:03 m-1 kernel: [437746.005143] IPv4: martian source 192.168.173.15 from 192.168.173.15, on dev lo
2021:07:19-23:02:03 m-1 kernel: [437746.005158] ll header: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 08 00 ..............
name="send_backup_heartbeat(): send(): No buffer space available
" message in HA logs until now. Does anyone else have this behaviour or even an explanation what might have happened here? I've attached the full HA log of the firewall that was active after the incident.This thread was automatically locked due to age.