Why did HA fail on 9.707-5?

Question

Hi everyone,

this morning my colleague realized that all internet traffic was non-functional. It seemed like both HA nodes were in active state. After shutting down one of the nodes, things started working again. Looking into the logs I can see this:

2021:07:19-23:04:04 m-2 ha_daemon[4300]: id="38A2" severity="error" sys="System" sub="ha" seq="M: 407 04.766" name="send_backup_heartbeat(): send(): No buffer space available"

Kernel log shows this:

2021:07:19-23:00:31 m-2 kernel: [437910.124002] ------------[ cut here ]------------

2021:07:19-23:00:31 m-2 kernel: [437910.124014] WARNING: CPU: 3 PID: 6214 at net/sched/sch_generic.c:264 dev_watchdog+0xe6/0x181()

2021:07:19-23:00:31 m-2 kernel: [437910.124016] NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out

2021:07:19-23:00:31 m-2 kernel: [437910.124104] CPU: 3 PID: 6214 Comm: sasi Tainted: G           O 3.12.74-0.377903089.g4999875.rb3-smp64 #1

2021:07:19-23:00:31 m-2 kernel: [437910.124106] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018

2021:07:19-23:00:31 m-2 kernel: [437910.124107]  0000000000000000 ffffffff8136c181 ffffffff813074b0 ffffffff813074b0

2021:07:19-23:00:31 m-2 kernel: [437910.124109]  ffff88023fd83dd0 ffffffff81046a60 ffff880235358000 0000000000000000

2021:07:19-23:00:31 m-2 kernel: [437910.124111]  ffff880235358000 ffff880235358348 ffffffff813073ca ffffffff81046b11

2021:07:19-23:00:31 m-2 kernel: [437910.124113] Call Trace:

2021:07:19-23:00:31 m-2 kernel: [437910.124115] <IRQ> [<ffffffff8136c181>] ? dump_stack+0x61/0x80

2021:07:19-23:00:31 m-2 kernel: [437910.124122] [<ffffffff813074b0>] ? dev_watchdog+0xe6/0x181

2021:07:19-23:00:31 m-2 kernel: [437910.124125] [<ffffffff813074b0>] ? dev_watchdog+0xe6/0x181

2021:07:19-23:00:31 m-2 kernel: [437910.124131] [<ffffffff81046a60>] ? warn_slowpath_common+0x74/0x8b

2021:07:19-23:00:31 m-2 kernel: [437910.124133] [<ffffffff813073ca>] ? netif_tx_lock+0x7e/0x7e

2021:07:19-23:00:31 m-2 kernel: [437910.124135] [<ffffffff81046b11>] ? warn_slowpath_fmt+0x45/0x4a

2021:07:19-23:00:31 m-2 kernel: [437910.124137] [<ffffffff8130738f>] ? netif_tx_lock+0x43/0x7e

2021:07:19-23:00:31 m-2 kernel: [437910.124143] [<ffffffff813073ca>] ? netif_tx_lock+0x7e/0x7e

2021:07:19-23:00:31 m-2 kernel: [437910.124145] [<ffffffff813074b0>] ? dev_watchdog+0xe6/0x181

2021:07:19-23:00:31 m-2 kernel: [437910.124152] [<ffffffff81050bc3>] ? call_timer_fn+0x6a/0x10e

2021:07:19-23:00:31 m-2 kernel: [437910.124154] [<ffffffff813073ca>] ? netif_tx_lock+0x7e/0x7e

2021:07:19-23:00:31 m-2 kernel: [437910.124156] [<ffffffff81050ddd>] ? run_timer_softirq+0x176/0x1bd

2021:07:19-23:00:31 m-2 kernel: [437910.124160] [<ffffffff811cf36c>] ? timerqueue_add+0x79/0x94

2021:07:19-23:00:31 m-2 kernel: [437910.124163] [<ffffffff8104ae7a>] ? __do_softirq+0x128/0x24c

2021:07:19-23:00:31 m-2 kernel: [437910.124166] [<ffffffff813772dc>] ? call_softirq+0x1c/0x30

2021:07:19-23:00:31 m-2 kernel: [437910.124173] [<ffffffff8100f6c2>] ? do_softirq+0x3f/0x79

2021:07:19-23:00:31 m-2 kernel: [437910.124174] [<ffffffff8104ac7e>] ? irq_exit+0x46/0xa1

2021:07:19-23:00:31 m-2 kernel: [437910.124180] [<ffffffff810336f6>] ? smp_apic_timer_interrupt+0x22/0x2d

2021:07:19-23:00:31 m-2 kernel: [437910.124184] [<ffffffff8137661d>] ? apic_timer_interrupt+0x6d/0x80

2021:07:19-23:00:31 m-2 kernel: [437910.124185] <EOI>

2021:07:19-23:00:31 m-2 kernel: [437910.124187] ---[ end trace 2ab76b7259a68d8d ]---

2021:07:19-23:00:31 m-2 kernel: [437910.124197] e1000 0000:02:00.0 eth0: Reset adapter

2021:07:19-23:02:03 m-1 kernel: [437746.005143] IPv4: martian source 192.168.173.15 from 192.168.173.15, on dev lo

2021:07:19-23:02:03 m-1 kernel: [437746.005158] ll header: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 08 00        ..............

The last two lines keep repeating.

I haven't seen the name="send_backup_heartbeat(): send(): No buffer space available" message in HA logs until now. Does anyone else have this behaviour or even an explanation what might have happened here? I've attached the full HA log of the firewall that was active after the incident.

Regards

asc

ha-log-active-firewall.txt

This thread was automatically locked due to age.