This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

multiple rrdtool high (100%) cpu usage

Since 02:20 this morning 3 seperate systems I look after for friends have shown this problem.

Each is running at 100% CPU load and the FW is slow

After SSH'ing in I found that there are many (20+) instances of rrdtools running.

This problem looks identical to rrdtool high cpu usage

I have tried what is suggested there which kills the rrdtool task but after a while the instances start again so I have commented out the lines in /etc/crontab.rrd for now

What is the permanent solution to this ?

Jeff

This thread was automatically locked due to age.

Top Replies

emmosophos over 1 year ago +3 verified

Hello Community, This is being investigated under NUTM-14089. The issue has been identified, and the fixed version has been set for the next release (9.716). No ETA at the moment. The current workaround…

0 BlueSmoke over 1 year ago in reply to StepNo6

StepNo6 said:
Any update to resolve this issue as one of our client UTM's in HA is maxed out at 100% CPU and is a complete outage for 1000 users. CPU has gone up since early hours of Sunday morning. This looks to be related to this issue raised as same firmware version 9.714-4. Already raised with Sophos support 06378016. Told that someone from technical would ring me back in 2 hours. This is unacceptable as we have a complete outage for this client.

If I were in your situation I'd have implemented the workaround which JeffreyLewcock suggested last night. This kills the rrd processes and then prevents the Cron jobs from running commenting them out. Takes two minutes [assuming you know how to use vi] and it doesn't require a reboot.

JeffreyLewcock said:
I ssh'd in then sudo'd to root

killall /usr/local/bin/create_rrd_graphs.plx
killall rrdtool

I then edited

/etc/crontab
/etc/crontab.rrd

to comment out the cron entry that restarts the process

If rrdtools has respawned you might have to re run the killall commands

This prevents the rrdtools running however the graphing is then stopped

Remove the commenting when you have a proper solution (and tell me !!)

I'm suprised theres not more people having this problem TBH
Cancel
Vote Up 0 Vote Down

Cancel
0 JeffreyLewcock over 1 year ago in reply to Raphael Alganes

My system was also set to Europe / London

Jeff
Cancel
Vote Up +1 Vote Down

Cancel
0 lms87 over 1 year ago in reply to JeffreyLewcock

So, 12 hours later, my CPU usage is still in single digits.

I did the following:

Kill the tops using the two commands above

Rebooted the UTM

Generated an Executive report

Once it had generated, I rebooted the UTM again.

It's been stable since...I might have just gotten lucky, so I asked friend to do the same on his UTM, same outcome, now stable around 15% usage for him and 5% for myself.

I did not comment out any jobs etc.
Cancel
Vote Up 0 Vote Down

Cancel
+1 Vivek Jagad over 1 year ago in reply to JeffreyLewcock
Hey JeffreyLewcock ,

Thank you for reaching out to the community, this is a known issue - NUTM-14089. Our Dev team is working on it.

Please try the below workaround to see if it fixes the issue

Change the timezone

reboot or kill rrdtool
Thanks & Regards,
_______________________________________________________________

Vivek Jagad | Team Lead, Technical Support, Global Customer Experience

Log a Support Case | Sophos Service Guide
Best Practices – Support Case | Security Advisories
Compare Sophos next-gen Firewall | Fortune Favors the prepared
Sophos Community | Product Documentation | Sophos Techvids | SMS
If a post solves your question please use the 'Verify Answer' button.
Cancel
Vote Up +1 Vote Down

Cancel
0 Thomas Groom over 1 year ago

We are a support company and we have several hundred devices with several thousand users. All of which are effected

Edit: All devices are running 9.714.004

All appliances are set to Europe / London. We have had to manually login via WebAdmin and force a reboot, which is painfully slow taking 10-15 mins per device due to 100% CPU usage on rrdtool (multiple instances)

On those devices that SSH is enabled, you can SSH in and..
su root
restart -r -t 1 now

to force a reboot

We have now been rebooting devices for over 10 man hours of support.

Please note we have not had to change the time zone on any device, those devices that have been rebooted are acting normally
Cancel
Vote Up 0 Vote Down

Cancel
0 StepNo6 over 1 year ago in reply to Thomas Groom

Before Sophos Support came back to me as i was facing a complete outage, i followed the above of running the following commands on initially the slave node, followed by the master node.

killall /usr/local/bin/create_rrd_graphs.plx
killall rrdtool

CPU dropped off immediately when both commands were run on both nodes

I didnt amend crontab as i wasnt seeing the rrdtool proccess respawn over an hour of monitoring the processes, and was prepared to re-run the 2 commands when needed.

Spoke to Sophos support 2 hours after logging call, and they recommended, change the timezone and reboot or kill rrdtool.

I am scheduling a change this evening out of hours to amend time zone to UTC and rebooting both nodes incase.

As per Thomas, the nodes seem to be ok at the moment, and time zone is still set to Europe/London.

Is this issue only present when daylight savings change, and on firmware 9.714-4? Therefore the next potential issue would be in the autumn if you were running this firmware still or is the rrdtool process respawned automatically during the day and this then triggers multiple rddtool processes, unless you amend timezone to UTC?
Cancel
Vote Up 0 Vote Down

Cancel
+1 emmosophos over 1 year ago

Hello Community,

This is being investigated under NUTM-14089.

The issue has been identified, and the fixed version has been set for the next release (9.716). No ETA at the moment.

The current workaround is to change your time zone to "ETC/UTC" (Any other than IST/BST) and reboot your device.

If the issue persists after this, please open a case with support and mention NUTM-14089 so it can be investigated further.

Regards,

Emmanuel (EmmoSophos)

Technical Team Lead, Global Community Support
Sophos Support Videos | Product Documentation | @SophosSupport | Sign up for SMS Alerts
If a post solves your question use the 'Verify Answer' link.
Cancel
Vote Up +3 Vote Down

Cancel
0 StepNo6 over 1 year ago in reply to emmosophos

Settings to UTC sets the dashboard time back 1 hour, therefore the time is incorrect between client devices/domain controllers that will be set to British Summer Time? Wouldnt this cause issues for UTM's set up as forward proxies and Kerberos tickets being assigned?
Cancel
Vote Up 0 Vote Down

Cancel
0 JeffreyLewcock over 1 year ago

From what I've seen on the 4 systems I look after a simple reboot now clears the problem

Obviously It will probably be back at the next DST change however Sophos might have a fix for the root cause by then

My systems have now been up for 20Mins 4Hrs 5.5Hrs 14Hrs with crontabs uncommented with no problems

Jeff
Cancel
Vote Up 0 Vote Down

Cancel
0 BlueSmoke over 1 year ago in reply to emmosophos

emmosophos said:
This is being investigated under NUTM-14089.

The issue has been identified, and the fixed version has been set for the next release (9.716). No ETA at the moment.

The current workaround is to change your time zone to "ETC/UTC" (Any other than IST/BST) and reboot your device.

If the issue persists after this, please open a case with support and mention NUTM-14089 so it can be investigated further.

Are we any closer to a permanent fix for this please? All that has been mentioned so far are workarounds.
Cancel
Vote Up 0 Vote Down

Cancel