Heartbeat is missing or at risk random endpoints

Hi,

We are having heartbeat issues. Several times a day, computers report missing or at risk status. But this is because endpoints are unable to communicate with the Firewall or something.

At the endpoints we have already reinstalled Sophos central and have windows 10 professional always up to date.

We opened a ticket with Sophos, which instructed us to re-image. But almost every case we open tells us to reimage it. We cannot believe that this is indeed the solution.
We are in version 17.5.9.

Similar or similar errors occur in new installations that we have already made and are operating normally. So we are not sure if the errors are related.
They also talk about core dumps, but by date they are old .. it's from other firmwares.

After Sophos testing, we now have no endpoints connected to Heartbeat .. at least the console does not show.

I would like some indications of how we can proceed in this case, with the experience of colleagues the only way out is to make reimage?

 

This was the return of Sophos support:

 

- Database-related services have been restarted:

XG115_XN03_SFOS 17.5.9 MR-9# service postgres:restart -ds nosync
200 OK
XG115_XN03_SFOS 17.5.9 MR-9# service sigdb:restart -ds nosync
503 Service Failed
XG115_XN03_SFOS 17.5.9 MR-9# service reportdb:restart -ds nosync
503 Service Failed
XG115_XN03_SFOS 17.5.9 MR-9# service garner:restart -ds nosync
200 OK
XG115_XN03_SFOS 17.5.9 MR-9# service heartbeatd:restart -ds nosync
400 Service not found
XG115_XN03_SFOS 17.5.9 MR-9# service heartbeat:restart -ds nosync
200 OK
XG115_XN03_SFOS 17.5.9 MR-9#

 

- We checked the heartbeat log

gr_io: Broken pipe, Offset => 0
2019-11-27 15:21:39 INFO Main.cpp[23648]:140 initLogger - Heartbeat daemon build time: 16:17:07 Nov 1 2019
2019-11-27 15:21:39 INFO Main.cpp[23648]:219 main - Heartbeat daemon starting
2019-11-27 15:21:39 INFO Main.cpp[23648]:241 main - Maximum connected clients: 10000
2019-11-27 15:21:39 INFO EndpointStorage.cpp[23648]:41 EndpointStorage - Working with persistent endpoint storage
2019-11-27 15:21:39 INFO EndpointStorage.cpp[23648]:43 EndpointStorage - Calling EndpointStorageBackend::get_all_endpoints
2019-11-27 15:21:39 INFO Main.cpp[23648]:418 main - Heartbeat daemon running
2019-11-27 15:21:39 INFO EacEventReader.cpp[23648]:128 start - EacEventReader has been successfully started
2019-11-27 15:21:39 INFO Main.cpp[23648]:115 dropPrivileges - Privdrop to uid 5 with gid 1007 successful
2019-11-27 15:21:39 INFO Main.cpp[23648]:118 dropPrivileges - reduced capabilities: effective=net_admin, sys_resource, permitted=net_admin, sys_resource
2019-11-27 15:21:39 INFO Main.cpp[23648]:189 sendHeartbeatReadyOpcode - heartbeat_ready opcode sent.
2019-11-27 15:21:45 INFO ModuleEac.cpp[23648]:115 handOverEacState - Send EacSwitchRequest to all directly connected endpoints (state=1)
2019-11-27 15:28:04 INFO GarnerEventReader.cpp[23648]:129 acceptConnectionHandler - Garner plugin connected. Ready to receive garner events.
2019-11-27 15:28:04 ERROR ModuleStatus.cpp[23648]:111 update - mac address is invalid
2019-11-27 15:28:29 ERROR ModuleStatus.cpp[23648]:111 update - mac address is invalid
2019-11-27 15:28:31 ERROR ModuleStatus.cpp[23648]:111 update - mac address is invalid
2019-11-27 15:28:31 ERROR ModuleStatus.cpp[23648]:111 update - mac address is invalid
XG115_XN03_SFOS 17.5.9 MR-9#

 

- Garner's record has been verified:

===========================

nov 27 16:15:09: OPPOSTGRES: oppostgres_output: log event couldn't inserted
nvram_get failed with -12
ERROR Nov 27 16:15:09 [4123261760]: [SCM::get_is_password_random] '/bin/nvram get scm.RandomAdminPass' failed
ERROR Nov 27 16:15:09 [4123261760]: [SCM::who_was_killer] '/bin/nvram get scm.RandomAdminPass' terminated with exit code 244
nvram_get(): failed with -16
ERROR Nov 27 16:15:09 [4123261760]: [SCM::scm_get_expire_days] scm_get_expire_days: lic_get_details failed for 'li.epsup'

nvram_get(): failed with -16
ERROR Nov 27 16:15:10 [4123261760]: [SCM::scm_get_module_status] scm_get_module_status: lic_get_details failed for 'li.epsup'

==================================

- We see the Heartbeat

- we restart log settings

- We consulted the use of hd, we see that it has 86% used

XG115_XN03_SFOS 17.5.9 MR-9# df -h
Filesystem Size Used Available Use% Mounted on
rootfs 301.5M 2.6M 279.0M 1% /
df: /newroot: No such file or directory
df: /newroot/dev: No such file or directory
df: /newrootrw: No such file or directory
none 301.5M 2.6M 279.0M 1% /
none 1.9G 36.0K 1.9G 0% /dev
none 1.9G 36.2M 1.8G 2% /tmp
none 1.9G 14.7M 1.8G 1% /dev/shm
/dev/conf 385.4M 74.4M 311.0M 19% /conf
/dev/content 5.6G 384.7M 5.2G 7% /content
/dev/var 46.6G 40.3G 6.3G 86% /var
XG115_XN03_SFOS 17.5.9 MR-9#


- We can identify some colors dumps

xrwx 2 root 0 4.0K Apr 13 2019 .
drwxr-xr-x 37 root 0 4.0K Nov 27 15:20 ..
-rw------- 1 root 0 482.1M Apr 13 2019 core.avd
-rw------- 1 root 0 35.2M Oct 20 2018 core.awed
XG115_XN03_SFOS 17.5.9 MR-9#


Looking at LOGs, nvram and core failures, we recommend Re-Image

 

Thanks

 
  • Hi  

    Sorry for the inconvenience caused! Thank you for the detailed post.

    It would be great if you could PM us the service request number, we can check the history and provide you further assistance

  • Hi  

    To have detail investigation on same you may start the "heartbeat" service in debug and trace mode to capture detail logs.

    command to start service in debug mode:

    #service -t json -b '{"debug":"2"}' -ds nosync heartbeat:debug

    To stop the debug you use below command :

    # service -t json -b '{"debug":"0"}' -ds nosync heartbeat:debug

    For any machine where you found any missing event or status change you may check the logs from hearbeatd.log file.

    #grep " Connectivity changed for" /log/heartbeatd.log

    Also on firewall you may keep packet capture running on below port and as soon as you get any notification for missing heartbeat you may check heartbeat log and below packets:

    #tcpdump 'port 8347

    Also what notification coming on Sophos Central? Can you share the snapshot? 

  • Hi  

    It appears that your report database and signature database service is failing to restart.

    I recommend trying to reboot and recover the services to normal state.  However please ensure you have a backup first in case the device goes into failsafe mode.  You will be able to restore easily using that.

    If the device comes up fine, check to see if any services are in a "DEAD" or "STOPPED" state by running command: service -S from the advance console output off a SSH session.

    In regards to your original problem about endpoints going into a "missing" state for heartbeat, please check to see if the devices are going into hibernation/sleep mode at that specific time.  

    You should also only be receiving 1 notification per day per endpoint.

    Will wait for your response.

    Thanks!