Hardware sensors offline after update to 1.8.0-2366

Hello community,

we have 4 hardware sensors running in our environment. Many more of them are waiting to be deployed. All of them have been set up with version 1.7.1-2263. The 4 sensors already in the field were working with no issues. On the 15th of October they all received the update to version 1.8.0-2366. 2 of them came back up and are online and are working with no issue. But 2 of them stayed offline in Sophos Central. Even through they show offline in Sophos Central we can ping them and can access the login page of the Appliance Management Console. The problem is when we try to login it errors with with the message "Invalid username or password". SSH access works fine with the same password. In the upgrade_progress.log file it shows that it completed the update to the new version. This file is identical on all 4 sensors. But in the syslog file we see the following error message on the 2 not working sensors.

Oct 15 00:05:54 localhost ndrsensorapi[1980653]: Network settings are not configured correctly
Oct 15 00:05:54 localhost ndrsensorapi[1980653]: cni0 Interface is not available, falling back to localhost
Oct 15 00:05:54 localhost ndrsensorapi[1980653]: Starting the Sensor API on localhost...
Oct 15 00:05:54 localhost ndrsensorapi[1981175]: {"level":"info","message":"Reading Interface Mapping","timestamp":"2024-10-15T00:05:54Z"}

On the working sensors the message displays.

Oct 15 00:07:17 localhost ndrsensorapi[1465992]: 5: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default qlen 1000
Oct 15 00:07:17 localhost ndrsensorapi[1465899]: cni0 Interface is up
Oct 15 00:07:17 localhost ndrsensorapi[1465899]: Starting the Sensor API Server on cni0...
Oct 15 00:07:17 localhost ndrsensorapi[1465996]: {"level":"info","message":"Reading Interface Mapping","timestamp":"2024-10-15T00:07:17Z"}

The 2 broken sensors also show version 2.2.0 in the datalake. Probably a default value or something like that.

Since they were all working before the update and there was no change in the config our guess is that it has to do with the update.

Anyone else having this issues? Is there a known way to fix this? Maybe a way to restart the ndrsensorapi? Our guess is that a restart will fix them but we also don't want to make the issue worse since they are in the field and we would have to drive there if they become unavailable via SSH.

UPDATE 18.10.2024
Both broken sensors came back up online around 6:30pm yesterday. Both of them received some kind of update just before they came online again. Our guess is that the update triggered a restart of the ndrsensorapi service. The other 2 sensors also received the same update but didn't restart the ndrsensorapi service. So this may just be coincidence. So probably restarting the ndrsensorapi service fixes the issue but we are still waiting to hear back from our Sophos Partner and Sophos Support. Once we receive updated information I will document this here in case anyone else is facing the same issue.

Oct 17 18:26:46 localhost systemd[1]: Starting Update APT News...
Oct 17 18:26:46 localhost systemd[1]: Starting Update the local ESM caches...
Oct 17 18:26:46 localhost systemd[1]: apt-news.service: Succeeded.
Oct 17 18:26:46 localhost systemd[1]: Finished Update APT News.
Oct 17 18:26:46 localhost systemd[1]: esm-cache.service: Succeeded.
Oct 17 18:26:46 localhost systemd[1]: Finished Update the local ESM caches.
Oct 17 18:26:47 localhost dbus-daemon[800]: [system] Activating via systemd: service name='org.freedesktop.PackageKit' unit='packagekit.service' requested by ':1.3059' (uid=0 pid=3174970 comm="/usr/bin/gdbus call --system --dest org.freedeskto" label="unconfined")
Oct 17 18:26:47 localhost systemd[1]: Starting PackageKit Daemon...
Oct 17 18:26:47 localhost PackageKit: daemon start
Oct 17 18:26:47 localhost dbus-daemon[800]: [system] Successfully activated service 'org.freedesktop.PackageKit'
Oct 17 18:26:47 localhost systemd[1]: Started PackageKit Daemon.
Oct 17 18:26:52 localhost systemd[1]: Stopping Start/stop NDR Sensor API...
Oct 17 18:26:52 localhost systemd[1]: ndrsensorapi.service: Killing process 1981184 (ndrsensorapi) with signal SIGKILL.
Oct 17 18:26:52 localhost systemd[1]: ndrsensorapi.service: Succeeded.
Oct 17 18:26:52 localhost systemd[1]: Stopped Start/stop NDR Sensor API.
Oct 17 18:26:52 localhost systemd[1]: Started Start/stop NDR Sensor API.



UPDATE 18.10.2024
[edited by: Jens Frankiewicz at 9:10 AM (GMT -7) on 18 Oct 2024]
Parents
  • According to Sophos Support this issue was caused by degraded Ubuntu update servers. The developers are working on this issue, so it doesn't happen again in the future.

    To resolve the issue, the recommendation is to restart the system.

  • Hello Jens,

    I had the same issue on almost all of our virtual NDR sensors during the past week: offline in Central and unable to login to their web consoles due to invalid user/password. Restarting them threw an error in its application startup sequence about not being able to reach security.ubuntu.com and determined it had no Internet connection at all. Sophos only today responded with this information (I must remember to keep checking these forums, even after the issue issubmitted to Support!) and after a reboot they are all working again.

    Looking at the data lake query 'NDR - Application usage-BAR CHART' it does look like they were actually offline, not just a cosmetic bug. I've asked Sophos for verification on this.

    If they were offline, I think it is unfathomable that a DDoS attack (as Support said was the case) or outage at security.ubuntu.com would cause a worldwide outage of every Sophos NDR sensor, as it's beginning to look like that's the case here. Let's hope the devs implement a failsafe for this quickly, as the impact is quite severe.

Reply
  • Hello Jens,

    I had the same issue on almost all of our virtual NDR sensors during the past week: offline in Central and unable to login to their web consoles due to invalid user/password. Restarting them threw an error in its application startup sequence about not being able to reach security.ubuntu.com and determined it had no Internet connection at all. Sophos only today responded with this information (I must remember to keep checking these forums, even after the issue issubmitted to Support!) and after a reboot they are all working again.

    Looking at the data lake query 'NDR - Application usage-BAR CHART' it does look like they were actually offline, not just a cosmetic bug. I've asked Sophos for verification on this.

    If they were offline, I think it is unfathomable that a DDoS attack (as Support said was the case) or outage at security.ubuntu.com would cause a worldwide outage of every Sophos NDR sensor, as it's beginning to look like that's the case here. Let's hope the devs implement a failsafe for this quickly, as the impact is quite severe.

Children