This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

multiple rrdtool high (100%) cpu usage

Since 02:20 this morning 3 seperate systems I look after for friends have shown this problem.

Each is running at 100% CPU load and the FW is slow

After SSH'ing in I found that there are many (20+) instances of rrdtools running.

This problem looks identical to  rrdtool high cpu usage

I have tried what is suggested there which kills the rrdtool task but after a while the instances start again so I have commented out the lines in /etc/crontab.rrd for now

What is the permanent solution to this ?

Jeff



This thread was automatically locked due to age.
  • Any update to resolve this issue as one of our client UTM's in HA is maxed out at 100% CPU and is a complete outage for 1000 users.  CPU has gone up since early hours of Sunday morning.  This looks to be related to this issue raised as same firmware version 9.714-4.  Already raised with Sophos support 06378016.  Told that someone from technical would ring me back in 2 hours.  This is unacceptable as we have a complete outage for this client.

    If I were in your situation I'd have implemented the workaround which JeffreyLewcock suggested last night. This kills the rrd processes and then prevents the Cron jobs from running commenting them out. Takes two minutes [assuming you know how to use vi] and it doesn't require a reboot.

    I ssh'd in then sudo'd to root

    killall /usr/local/bin/create_rrd_graphs.plx
    killall rrdtool

    I then edited

    /etc/crontab
    /etc/crontab.rrd

    to comment out the cron entry that restarts the process

    If rrdtools has respawned you might have to re run the killall commands

    This prevents the rrdtools running however the graphing is then stopped

    Remove the commenting when you have a proper solution (and tell me !!)

    I'm suprised theres not more people having this problem TBH

  • My system was also set to Europe / London

    Jeff

  • So, 12 hours later, my CPU usage is still in single digits. 

    I did the following:

    Kill the tops using the two commands above

    Rebooted the UTM

    Generated an Executive report

    Once it had generated, I rebooted the UTM again.

    It's been stable since...I might have just gotten lucky, so I asked friend to do the same on his UTM, same outcome, now stable around 15% usage for him and 5% for myself.

    I did not comment out any jobs etc.

  • Hey  ,

    Thank you for reaching out to the community, this is a known issue - NUTM-14089. Our Dev team is working on it.

    Please try the below workaround to see if it fixes the issue

    • Change the timezone 
    • reboot or kill rrdtool 

    Thanks & Regards,
    _______________________________________________________________

    Vivek Jagad | Team Lead, Technical Support, Global Customer Experience

    Log a Support Case | Sophos Service Guide
    Best Practices – Support Case  | Security Advisories 
    Compare Sophos next-gen Firewall | Fortune Favors the prepared
    Sophos Community | Product Documentation | Sophos Techvids | SMS
    If a post solves your question please use the 'Verify Answer' button.

  • We are a support company and we have several hundred devices with several thousand users. All of which are effected

    Edit: All devices are running 9.714.004

    All appliances are set to Europe / London. We have had to manually login via WebAdmin and force a reboot, which is painfully slow taking 10-15 mins per device due to 100% CPU usage on rrdtool (multiple instances)

    On those devices that SSH is enabled, you can SSH in and..
    su root
    restart -r -t 1 now

    to force a reboot

    We have now been rebooting devices for over 10 man hours of support.

    Please note we have not had to change the time zone on any device, those devices that have been rebooted are acting normally

  • Before Sophos Support came back to me as i was facing a complete outage, i followed the above of running the following commands on initially the slave node, followed by the master node. 

    killall /usr/local/bin/create_rrd_graphs.plx
    killall rrdtool

    CPU dropped off immediately when both commands were run on both nodes

    I didnt amend crontab as i wasnt seeing the rrdtool proccess respawn over an hour of monitoring the processes, and was prepared to re-run the 2 commands when needed.

    Spoke to Sophos support 2 hours after logging call, and they recommended, change the timezone and reboot or kill rrdtool.

    I am scheduling a change this evening out of hours to amend time zone to UTC and rebooting both nodes incase.

    As per Thomas, the nodes seem to be ok at the moment, and time zone is still set to Europe/London.

    Is this issue only present when daylight savings change, and on firmware 9.714-4?  Therefore the next potential issue would be in the autumn if you were running this firmware still or is the rrdtool process respawned automatically during the day and this then triggers multiple rddtool processes, unless you amend timezone to UTC?

  • Hello Community,

    This is being investigated under NUTM-14089.

    The issue has been identified, and the fixed version has been set for the next release (9.716). No ETA at the moment.

    The current workaround is to change your time zone to "ETC/UTC" (Any other than IST/BST) and reboot your device.

    If the issue persists after this, please open a case with support and mention NUTM-14089 so it can be investigated further.

    Regards,


     
    Emmanuel (EmmoSophos)
    Technical Team Lead, Global Community Support
    Sophos Support VideosProduct Documentation  |  @SophosSupport  | Sign up for SMS Alerts
    If a post solves your question use the 'Verify Answer' link.
  • Settings to UTC sets the dashboard time back 1 hour, therefore the time is incorrect between client devices/domain controllers that will be set to British Summer Time?  Wouldnt this cause issues for UTM's set up as forward proxies and Kerberos tickets being assigned?

  • From what I've seen on the 4 systems I look after a simple reboot now clears the problem

    Obviously It will probably be back at the next DST change however Sophos might have a fix for the root cause by then

    My systems have now been up for 20Mins 4Hrs 5.5Hrs 14Hrs with crontabs uncommented with no problems

    Jeff

  • This is being investigated under NUTM-14089.

    The issue has been identified, and the fixed version has been set for the next release (9.716). No ETA at the moment.

    The current workaround is to change your time zone to "ETC/UTC" (Any other than IST/BST) and reboot your device.

    If the issue persists after this, please open a case with support and mention NUTM-14089 so it can be investigated further.


    Are we any closer to a permanent fix for this please? All that has been mentioned so far are workarounds.