This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Why did HA fail on 9.707-5?

Hi everyone,

this morning my colleague realized that all internet traffic was non-functional. It seemed like both HA nodes were in active state. After shutting down one of the nodes, things started working again. Looking into the logs I can see this:

2021:07:19-23:04:04 m-2 ha_daemon[4300]: id="38A2" severity="error" sys="System" sub="ha" seq="M:  407 04.766" name="send_backup_heartbeat(): send(): No buffer space available"

Kernel log shows this:
 
2021:07:19-23:00:31 m-2 kernel: [437910.124002] ------------[ cut here ]------------
2021:07:19-23:00:31 m-2 kernel: [437910.124014] WARNING: CPU: 3 PID: 6214 at net/sched/sch_generic.c:264 dev_watchdog+0xe6/0x181()
2021:07:19-23:00:31 m-2 kernel: [437910.124016] NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out
2021:07:19-23:00:31 m-2 kernel: [437910.124104] CPU: 3 PID: 6214 Comm: sasi Tainted: G           O 3.12.74-0.377903089.g4999875.rb3-smp64 #1
2021:07:19-23:00:31 m-2 kernel: [437910.124106] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
2021:07:19-23:00:31 m-2 kernel: [437910.124107]  0000000000000000 ffffffff8136c181 ffffffff813074b0 ffffffff813074b0
2021:07:19-23:00:31 m-2 kernel: [437910.124109]  ffff88023fd83dd0 ffffffff81046a60 ffff880235358000 0000000000000000
2021:07:19-23:00:31 m-2 kernel: [437910.124111]  ffff880235358000 ffff880235358348 ffffffff813073ca ffffffff81046b11
2021:07:19-23:00:31 m-2 kernel: [437910.124113] Call Trace:
2021:07:19-23:00:31 m-2 kernel: [437910.124115]  <IRQ>  [<ffffffff8136c181>] ? dump_stack+0x61/0x80
2021:07:19-23:00:31 m-2 kernel: [437910.124122]  [<ffffffff813074b0>] ? dev_watchdog+0xe6/0x181
2021:07:19-23:00:31 m-2 kernel: [437910.124125]  [<ffffffff813074b0>] ? dev_watchdog+0xe6/0x181
2021:07:19-23:00:31 m-2 kernel: [437910.124131]  [<ffffffff81046a60>] ? warn_slowpath_common+0x74/0x8b
2021:07:19-23:00:31 m-2 kernel: [437910.124133]  [<ffffffff813073ca>] ? netif_tx_lock+0x7e/0x7e
2021:07:19-23:00:31 m-2 kernel: [437910.124135]  [<ffffffff81046b11>] ? warn_slowpath_fmt+0x45/0x4a
2021:07:19-23:00:31 m-2 kernel: [437910.124137]  [<ffffffff8130738f>] ? netif_tx_lock+0x43/0x7e
2021:07:19-23:00:31 m-2 kernel: [437910.124143]  [<ffffffff813073ca>] ? netif_tx_lock+0x7e/0x7e
2021:07:19-23:00:31 m-2 kernel: [437910.124145]  [<ffffffff813074b0>] ? dev_watchdog+0xe6/0x181
2021:07:19-23:00:31 m-2 kernel: [437910.124152]  [<ffffffff81050bc3>] ? call_timer_fn+0x6a/0x10e
2021:07:19-23:00:31 m-2 kernel: [437910.124154]  [<ffffffff813073ca>] ? netif_tx_lock+0x7e/0x7e
2021:07:19-23:00:31 m-2 kernel: [437910.124156]  [<ffffffff81050ddd>] ? run_timer_softirq+0x176/0x1bd
2021:07:19-23:00:31 m-2 kernel: [437910.124160]  [<ffffffff811cf36c>] ? timerqueue_add+0x79/0x94
2021:07:19-23:00:31 m-2 kernel: [437910.124163]  [<ffffffff8104ae7a>] ? __do_softirq+0x128/0x24c
2021:07:19-23:00:31 m-2 kernel: [437910.124166]  [<ffffffff813772dc>] ? call_softirq+0x1c/0x30
2021:07:19-23:00:31 m-2 kernel: [437910.124173]  [<ffffffff8100f6c2>] ? do_softirq+0x3f/0x79
2021:07:19-23:00:31 m-2 kernel: [437910.124174]  [<ffffffff8104ac7e>] ? irq_exit+0x46/0xa1
2021:07:19-23:00:31 m-2 kernel: [437910.124180]  [<ffffffff810336f6>] ? smp_apic_timer_interrupt+0x22/0x2d
2021:07:19-23:00:31 m-2 kernel: [437910.124184]  [<ffffffff8137661d>] ? apic_timer_interrupt+0x6d/0x80
2021:07:19-23:00:31 m-2 kernel: [437910.124185]  <EOI> 
2021:07:19-23:00:31 m-2 kernel: [437910.124187] ---[ end trace 2ab76b7259a68d8d ]---
2021:07:19-23:00:31 m-2 kernel: [437910.124197] e1000 0000:02:00.0 eth0: Reset adapter
2021:07:19-23:02:03 m-1 kernel: [437746.005143] IPv4: martian source 192.168.173.15 from 192.168.173.15, on dev lo
2021:07:19-23:02:03 m-1 kernel: [437746.005158] ll header: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 08 00        ..............
The last two lines keep repeating.
I haven't seen the name="send_backup_heartbeat(): send(): No buffer space available" message in HA logs until now. Does anyone else have this behaviour or even an explanation what might have happened here? I've attached the full HA log of the firewall that was active after the incident.
Regards
asc


This thread was automatically locked due to age.
  • This could be a number of things for that error from ICMP to a NIC failure, maybe even proxy authentication issues.  No buffer space available has a number of things tied to it.  I'm no expert on HA, but yes a communication issue there. (I haven't looked at the full log you posted yet, no time at the moment).

    OPNSense 64-bit | Intel Xeon 4-core v3 1225 3.20Ghz
    16GB Memory | 500GB SSD HDD | ATT Fiber 1GB
    (Former Sophos UTM Veteran, Former XG Rookie)

  • As mentioned by Amodin, it appears to be a network related issue.

    Which hypervisor are you using? Is it a supported one? Are you also updating the hypervisor? SG need jumbo packages for the HA-Link. Most of the time, the (virtual-)switches have problems here.


    Dirk

    Systema Gesellschaft für angewandte Datentechnik mbH  // Sophos Platinum Partner
    Sophos Solution Partner since 2003
    If a post solves your question, click the 'Verify Answer' link at this post.