This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Linux antivirus update causing system load to increase indefinitely

Every few weeks, we lose the ability to SSH to one of our CentOS 8 Linux servers. Monitoring shows that the system load increases to unreasonable levels and systemd-logind stops responding.

Here are the last messages from /var/log/messages prior to having to reboot.

Dec 9 12:42:15 rml-dev06 systemd[1]: Starting "Sophos Anti-Virus update"...
Dec 9 12:42:23 rml-dev06 savd[1203]: update.updated: Updating from versions - SAV: 10.5.2, Engine: 3.79.0, Data: 5.80
Dec 9 12:42:23 rml-dev06 savd[1203]: update.updated: Updating Sophos Anti-Virus....#012Updating SAVScan on-demand scanner#012Updating Virus Engine and Data#012Updating Manifest#012Update completed.
Dec 9 12:42:23 rml-dev06 savd[1203]: update.updated: Updated to versions - SAV: 10.5.2, Engine: 3.79.0, Data: 5.80
Dec 9 12:42:23 rml-dev06 savd[1203]: update.updated: Successfully updated Sophos Anti-Virus from sdds:SOPHOS
Dec 9 12:42:23 rml-dev06 systemd[1]: Started "Sophos Anti-Virus update".

I don't have the screenshots of the system load and CPU usage but around 12:42:30, there was a brief spike to 100% CPU usage for the sav-protect service (presumably corresponding to the update and restart logged above). Immediately after this the system load started climbing and didn't come back down until we rebooted. Over a few hours it reaches loads greater than 100.

Some processes keep running, e.g. webservers, but given that we can't login we have to reboot.

Any ideas what's causing this or how we can investigate further?



This thread was automatically locked due to age.
Parents
  • Hello Carl Fischer,

    10.5.2 is the Central managed version, therefore I've moved this thread.
    This one looks like just an IDE update, anything abnormal in the savlog?

    Christian

  • Hi Christian,

    Thanks for moving the thread (although I still see it under Sophos Anti-Virus for Linux Basic).

    savlog just says the same thing in a slightly different format. Noting remarkable.

    Wed 09 Dec 2020 12:42:23 GMT: update.updated Updating from versions - SAV: 10.5.2, Engine: 3.79.0, Data: 5.80
    Wed 09 Dec 2020 12:42:23 GMT: update.updated Updating Sophos Anti-Virus....
    Updating SAVScan on-demand scanner
    Updating Virus Engine and Data
    Updating Manifest
    Update completed.
    Wed 09 Dec 2020 12:42:23 GMT: update.updated Updated to versions - SAV: 10.5.2, Engine: 3.79.0, Data: 5.80
    Wed 09 Dec 2020 12:42:23 GMT: update.updated Successfully updated Sophos Anti-Virus from sdds:SOPHOS

    There have been identical updates over the past few weeks, including one exactly 6 hours earlier, that haven't caused any problems. Usually there are a few weeks or months between this issue occurring but at least once it reoccurred within a few days.

    Looking in sav-protect.log, I see a number of lines such as MountMonitor: EXCLUDING UNKNOWN FILESYSTEM FROM SCANNING: nsfs at /run/docker/netns/a48db6bfcc5c. We do use Docker so this isn't surprising.

    The mtdd logs include <warning> (RawSocketPacketFilter) Warning, losing packets! quite frequently. There are a number of other log files, some quite verbose, but nothing stands out as being an issue. 

  • FormerMember
    0 FormerMember in reply to Carl Fischer

    Have you gathered an SDU from an affected machine? 

    This would let us see the logs and machine state.

    However, an update shouldn't be causing any slow down unless it is taking a long time to download - but those times are all within the same minute - so that shouldn't be an issue.

    Are you able to get a top from an affected machine before reboot?

  • Hi RichardP,

    The update isn't causing a slow down as such. As far as I can tell the update completes just fine, but it does something to the system that prevents other processes from functioning correctly. Maybe it's keeping an essential file locked so that other processes end up stuck and contribute to the high load. Or maybe it's creating a deadlock for itself and creating a new process every few minutes. I'm just guessing from seeing the symptoms.

    Unfortunately, when this occurs I'm no longer able to log in to the machine remotely and don't have anyone with physical access, so I'm not able to provide much more info. On the one occasion when someone was able to check the screen in person, the repeated message was systemd[...]: systemd-journald.service: Failed to execute command: Operation not permitted.

    I've just run savdstatus --diagnose on the affected machine, however there's no issue at the moment so I don't know how useful this is. I still send this across if it would be helpful.

    I don't have screenshots of today's issue but found some from a previous occurrence. These are taken from Netdata and show the spike in CPU usage during the antivirus update (no problem there) followed by the load steadily increasing (this carries on for several hours until we hard reset the computer).

  • FormerMember
    0 FormerMember in reply to Carl Fischer

    Okay, I am going to have someone look at this for you.

  • Is Talpa being used as the driver for on-access scanning?

    If so, I believe you are probably hitting the conflict between Docker and Talpa. There are a few known issues caused by using Talpa on Docker container environments, some of which seriously impact performance. As a result, Support for Docker with Talpa was dropped in March. The recommendation is to switch to fanotify to provide the on-access driver.

    https://support.sophos.com/support/s/article/KB-000039332?language=en_US

    https://support.sophos.com/support/s/article/KB-000034610

    I hope this helps,

    RickS

    Senior Global Escalations Engineer

  • Hi Rick.

    Thanks for the suggestions.

    We hit the Docker/Talpa issue already and to fix it we switched to fanotify and also upgraded from CentOS 7 to 8 in order to get some recent kernel fixes to fanotify. If I remember correctly, with Talpa any Docker operations were very slow unless we stopped the antivirus. Then after switching to fanotify we were getting issues similar to the current one. After reading reports of bugs in fanotify that particularly affected antivirus scanners [HPE] [McAfee], we upgraded the OS and kernel but the issue seems to persist.

    Currently, we have disableFanotify=false, disableTalpa=true, preferFanotify=false. I assume that last one makes no difference if Talpa is disabled. Is that correct?

    Kernel is 4.18.0-193.28.1.el8_2.x86_64 on CentOS 8.

  • Hi, Good to hear that with Docker , you have switched to fanotify. 

    We have not seen any performance related issues from savupdate, so an investigation of the SDU would be helpful. The only other thing that sometimes occurs with an update is a talpa reload which should not affect you now.

    One more basic thing to confirm is that fanotify is not trying to scan any NFS4 filesystems as NFS4 can cause issues to.

    Beyond that, we would need a support case raised where we can investigate logs and if necessary, set flags to collect scan logs to see what is happening at the on-access level.

    kind regards

    RickS

    Senior Global Escalations Engineer

  • I'll see if our IT team can raise a support ticket. I've been trying to resolve this myself due to their lack of familiarity with Linux but I don't have access to any portals, licensing details, etc.

    We don't use NFS but we do have lots of CIFS shares.

Reply Children
No Data