Heartbeat is missing or at risk random endpoints

Question

Hi, 
 We are having heartbeat issues. Several times a day, computers report missing or at risk status. But this is because endpoints are unable to communicate with the Firewall or something. 
 At the endpoints we have already reinstalled Sophos central and have windows 10 professional always up to date. 
 We opened a ticket with Sophos, which instructed us to re-image. But almost every case we open tells us to reimage it. We cannot believe that this is indeed the solution. We are in version 17.5.9. 
 Similar or similar errors occur in new installations that we have already made and are operating normally. So we are not sure if the errors are related. They also talk about core dumps, but by date they are old .. it's from other firmwares. 
 After Sophos testing, we now have no endpoints connected to Heartbeat .. at least the console does not show. 
 I would like some indications of how we can proceed in this case, with the experience of colleagues the only way out is to make reimage? 
 
 This was the return of Sophos support: 
 
 - Database-related services have been restarted: 
 XG115_XN03_SFOS 17.5.9 MR-9# service postgres:restart -ds nosync 200 OK XG115_XN03_SFOS 17.5.9 MR-9# service sigdb:restart -ds nosync 503 Service Failed XG115_XN03_SFOS 17.5.9 MR-9# service reportdb:restart -ds nosync 503 Service Failed XG115_XN03_SFOS 17.5.9 MR-9# service garner:restart -ds nosync 200 OK XG115_XN03_SFOS 17.5.9 MR-9# service heartbeatd:restart -ds nosync 400 Service not found XG115_XN03_SFOS 17.5.9 MR-9# service heartbeat:restart -ds nosync 200 OK XG115_XN03_SFOS 17.5.9 MR-9# 
 
 - We checked the heartbeat log 
 gr_io: Broken pipe, Offset => 0 2019-11-27 15:21:39 INFO Main.cpp[23648]:140 initLogger - Heartbeat daemon build time: 16:17:07 Nov 1 2019 2019-11-27 15:21:39 INFO Main.cpp[23648]:219 main - Heartbeat daemon starting 2019-11-27 15:21:39 INFO Main.cpp[23648]:241 main - Maximum connected clients: 10000 2019-11-27 15:21:39 INFO EndpointStorage.cpp[23648]:41 EndpointStorage - Working with persistent endpoint storage 2019-11-27 15:21:39 INFO EndpointStorage.cpp[23648]:43 EndpointStorage - Calling EndpointStorageBackend::get_all_endpoints 2019-11-27 15:21:39 INFO Main.cpp[23648]:418 main - Heartbeat daemon running 2019-11-27 15:21:39 INFO EacEventReader.cpp[23648]:128 start - EacEventReader has been successfully started 2019-11-27 15:21:39 INFO Main.cpp[23648]:115 dropPrivileges - Privdrop to uid 5 with gid 1007 successful 2019-11-27 15:21:39 INFO Main.cpp[23648]:118 dropPrivileges - reduced capabilities: effective=net_admin, sys_resource, permitted=net_admin, sys_resource 2019-11-27 15:21:39 INFO Main.cpp[23648]:189 sendHeartbeatReadyOpcode - heartbeat_ready opcode sent. 2019-11-27 15:21:45 INFO ModuleEac.cpp[23648]:115 handOverEacState - Send EacSwitchRequest to all directly connected endpoints (state=1) 2019-11-27 15:28:04 INFO GarnerEventReader.cpp[23648]:129 acceptConnectionHandler - Garner plugin connected. Ready to receive garner events. 2019-11-27 15:28:04 ERROR ModuleStatus.cpp[23648]:111 update - mac address is invalid 2019-11-27 15:28:29 ERROR ModuleStatus.cpp[23648]:111 update - mac address is invalid 2019-11-27 15:28:31 ERROR ModuleStatus.cpp[23648]:111 update - mac address is invalid 2019-11-27 15:28:31 ERROR ModuleStatus.cpp[23648]:111 update - mac address is invalid XG115_XN03_SFOS 17.5.9 MR-9# 
 
 - Garner's record has been verified: 
 =========================== 
 nov 27 16:15:09: OPPOSTGRES: oppostgres_output: log event couldn't inserted nvram_get failed with -12 ERROR Nov 27 16:15:09 [4123261760]: [SCM::get_is_password_random] '/bin/nvram get scm.RandomAdminPass' failed ERROR Nov 27 16:15:09 [4123261760]: [SCM::who_was_killer] '/bin/nvram get scm.RandomAdminPass' terminated with exit code 244 nvram_get(): failed with -16 ERROR Nov 27 16:15:09 [4123261760]: [SCM::scm_get_expire_days] scm_get_expire_days: lic_get_details failed for 'li.epsup' 
 nvram_get(): failed with -16 ERROR Nov 27 16:15:10 [4123261760]: [SCM::scm_get_module_status] scm_get_module_status: lic_get_details failed for 'li.epsup' 
 ================================== 
 - We see the Heartbeat 
 - we restart log settings 
 - We consulted the use of hd, we see that it has 86% used 
 XG115_XN03_SFOS 17.5.9 MR-9# df -h Filesystem Size Used Available Use% Mounted on rootfs 301.5M 2.6M 279.0M 1% / df: /newroot: No such file or directory df: /newroot/dev: No such file or directory df: /newrootrw: No such file or directory none 301.5M 2.6M 279.0M 1% / none 1.9G 36.0K 1.9G 0% /dev none 1.9G 36.2M 1.8G 2% /tmp none 1.9G 14.7M 1.8G 1% /dev/shm /dev/conf 385.4M 74.4M 311.0M 19% /conf /dev/content 5.6G 384.7M 5.2G 7% /content /dev/var 46.6G 40.3G 6.3G 86% /var XG115_XN03_SFOS 17.5.9 MR-9# 
 - We can identify some colors dumps 
 xrwx 2 root 0 4.0K Apr 13 2019 . drwxr-xr-x 37 root 0 4.0K Nov 27 15:20 .. -rw------- 1 root 0 482.1M Apr 13 2019 core.avd -rw------- 1 root 0 35.2M Oct 20 2018 core.awed XG115_XN03_SFOS 17.5.9 MR-9# 
 Looking at LOGs, nvram and core failures, we recommend Re-Image 
 
 Thanks

Vishal_R · Answer

Hi Christovam 
 To have detail investigation on same you may start the "heartbeat" service in debug and trace mode to capture detail logs. command to start service in debug mode: 
 #service -t json -b '{"debug":"2"}' -ds nosync heartbeat:debug 
 To stop the debug you use below command : 
 # service -t json -b '{"debug":"0"}' -ds nosync heartbeat:debug 
 For any machine where you found any missing event or status change you may check the logs from hearbeatd.log file. 
 #grep " Connectivity changed for" /log/heartbeatd.log 
 Also on firewall you may keep packet capture running on below port and as soon as you get any notification for missing heartbeat you may check heartbeat log and below packets: 
 #tcpdump 'port 8347 Also what notification coming on Sophos Central? Can you share the snapshot?