This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

[Solved]UTM problems with rrdcached since 9.315-2

Hi.

Since the update to 9.315-2, the UTM 320 is slow on speed and we got tons of erros from rrdcached. The "Fallback" log fills up with messages like this:

2015:08:18-08:43:44 han-e7-secapp [daemon:notice] rrdcached[10060]: queue_thread_main: rrd_update_r (/var/log/reporting/rrd/apusage_A400175F685D801.rrd) failed with status -1. (/var/log/reporting/rrd/apusage_A400175F685D801.rrd: illegal attempt to update using time 1439323024 when last update time is 1439323085 (minimum one second step))

and this

2015:08:18-08:44:19 han-e7-secapp [daemon:info] rrdcached[10124]: starting up
2015:08:18-08:44:19 han-e7-secapp [daemon:info] rrdcached[10124]: checking for journal files
2015:08:18-08:44:19 han-e7-secapp [daemon:notice] rrdcached[10124]: replaying from journal: /var/log/reporting/rrd/rrd.journal.1439316807.508822

Every hour we get emails with this subject:

[INFO-192] RRD cache daemon not running - restarted

Has anybody seen this before or has any hit to solve this?

I tried google on this but it seams to me, that his error is a kind of special on our system.

Any hint is welcome.

Best regards, Christian

This thread was automatically locked due to age.

0 cwoller over 9 years ago

To be precise:

- "slow on speed" means that the web interface needs 4-5 seconds to switch to the next option.

- The "Self monitoring deamon" protocol repeates

2015:08:18-08:58:59 han-e7-secapp selfmonng[3951]: W triggerAction: 'cmd'
2015:08:18-08:58:59 han-e7-secapp selfmonng[3951]: W actionCmd(+): '/etc/init.d/rrdcache restart'
2015:08:18-08:59:00 han-e7-secapp selfmonng[3951]: W child returned status: exit='0' signal='0'
2015:08:18-08:59:25 han-e7-secapp selfmonng[3951]: I check Failed increment rrdcache_running counter 1 - 3
2015:08:18-08:59:30 han-e7-secapp selfmonng[3951]: I check Failed increment rrdcache_running counter 2 - 3
2015:08:18-08:59:35 han-e7-secapp selfmonng[3951]: W check Failed increment rrdcache_running counter 3 - 3
2015:08:18-08:59:35 han-e7-secapp selfmonng[3951]: W NOTIFYEVENT Name=rrdcache_running Level=INFO Id=192 suppressed

- and the "System Messages" protocol tells me

2015:08:18-09:00:02 han-e7-secapp system-reporter.pl: WARN - Error updating rrd
2015:08:18-09:00:02 han-e7-secapp system-reporter.pl: WARN - updating rrd: connect: Connection refused
2015:08:18-09:00:08 han-e7-secapp ips-reporter.pl: WARN - updating rrd: sending command: Broken pipe
2015:08:18-09:00:29 han-e7-secapp postgres[3856]: [10353-1] WARNING: pgstat wait timeout
2015:08:18-09:01:05 han-e7-secapp postgres[3856]: [10354-1] WARNING: pgstat wait timeout

- Date and time is correct, ntp is used.

I am diving deeper:

- 1439323024 is 11.08.2015 / 21:57:04
- 1439323085 is 11.08.2015 - 21:58:05

How can I change the first value to the actual date?

Christian
Cancel
Vote Up 0 Vote Down

Cancel
0 BAlfson over 9 years ago

Christian, try restoring the config backup made just prior to the Up2Date. If that doesn't help, try a reboot. Any luck?

Cheers - Bob

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel
0 cwoller over 9 years ago in reply to BAlfson

Hi Bob.

Thanks for your reply.

The update we did a couple of days before, was from 9.313-3 over 9.314-13 to 9.315-2. Then the error happend.

Now, I recovered the UTM to a 9.313-3 config. After recovery, I waited for 10 minutes, watching the logs. After no changes happend (failure still existed), I rebooted the system - without any effect on the failure.

Do I need to make a full restore with an image or do you have any more ideas for me? :-)

Best regards,

Christian
Cancel
Vote Up 0 Vote Down

Cancel
+1 BAlfson over 9 years ago
NOTE: If you have this issue, please read the rest of the thread before trying anything.

Re-imaging from ISO might be necessary, but I definitely would get Sophos Support involved first.

If Support tells you to re-image, try the following first as root at the command line:
cd /var/up2date/sys
echo ' 9.313003'>/etc/version
wget ftp.astaro.com/.../u2d-sys-9.313003-314013.tgz.gpg
wget ftp.astaro.com/.../u2d-sys-9.314013-315002.tgz.gpg
/sbin/auisys.plx --showdesc

Then install in WebAdmin.

That has the effect of re-applying the Up2Dates as one of them might have been corrupted. The advantage is that it preserves Reporting, Graphs and Logs.

If that doesn't solve it, try the following to re-initialize the data behind the graphs:
rm /var/log/reporting/rrd/*
/etc/init.d/syslogng restart
/etc/init.d/rrdcache restart
/usr/local/bin/reporter/system-reporter.pl

Finally, as a last resort, try reinitializing the PostgreSQL data bases used by Reporting:
/etc/init.d/postgresql92 rebuild

Please let us know if you try any of that and what, if anything, solved the issue.

Cheers - Bob
Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel
0 cwoller over 9 years ago in reply to BAlfson

rm /var/log/reporting/rrd/*
/etc/init.d/syslogng restart
/etc/init.d/rrdcache restart
/usr/local/bin/reporter/system-reporter.pl

/etc/init.d/postgresql92 rebuild

Good morning Bob.

I tried the above commands and for the last 15 minutes, I dont get any error messages. rrdcached cames up with

2015:08:20-07:29:31 han-e7-secapp [daemon:info] rrdcached[28380]: clean shutdown; all RRDs flushed
2015:08:20-07:29:31 han-e7-secapp [daemon:info] rrdcached[28380]: removing journals
2015:08:20-07:29:31 han-e7-secapp [daemon[:D]ebug] rrdcached[28380]: removing old journal /var/log/reporting/rrd/rrd.journal.1440048086.534901
2015:08:20-07:29:31 han-e7-secapp [daemon:info] rrdcached[28380]: goodbye
2015:08:20-07:29:32 han-e7-secapp [daemon:info] rrdcached[29948]: starting up
2015:08:20-07:29:32 han-e7-secapp [daemon:info] rrdcached[29948]: checking for journal files
2015:08:20-07:29:32 han-e7-secapp [daemon[:D]ebug] rrdcached[29948]: started new journal /var/log/reporting/rrd/rrd.journal.1440048572.492666
2015:08:20-07:29:32 han-e7-secapp [daemon:info] rrdcached[29948]: journal processing complete
2015:08:20-07:29:32 han-e7-secapp [daemon:info] rrdcached[29948]: listening for connections

and I am not an expert, but this looks pretty good to me :-) The Webadmin is pretty much more responsive than before.

I will watch this over the day.

Thank you Bob for your help :-)

Best regards, Christian
Cancel
Vote Up 0 Vote Down

Cancel
0 cwoller over 9 years ago in reply to cwoller

Hi Bob.

I wanted to let you know, that the error not happend again.

So, the case is closed.

My UTM is happy again, so am I [:D]

Once again, thank you Bob.

Best regards, Christian

PS: Ok, this may be a stupid question but how can I marke this thread as solved? Changing the title from my first post does not work as expected...
Cancel
Vote Up 0 Vote Down

Cancel
0 BAlfson over 9 years ago

Christian, Do you think the problem had been solved by the second step, re-initializing the data bases behind the graphs? Or, was it not solved until the last command using rebuild?

Cheers - Bob
PS Changed the title.

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel
0 cwoller over 9 years ago in reply to BAlfson

Christian, Do you think the problem had been solved by the second step, re-initializing the data bases behind the graphs? Or, was it not solved until the last command using rebuild?

Cheers - Bob
PS Changed the title.

Hi Bob.

As I said, I am not an UTM expert but if I should make a guess, it was the combination of cleaning the rrd folder with re-initializing the db.

Finally, the error went away after the db re-initializing.

What I did was

1) issued rm /var/log/reporting/rrd/*
2) issued /etc/init.d/syslogng restart
3) issued /etc/init.d/rrdcache restart

... see another error about wrong time ("illegal attempt to update using time...")

4) issued /etc/init.d/postgresql92 rebuild
5) issued /etc/init.d/syslogng restart
6) issued /etc/init.d/rrdcache restart
... no more errors :-)

Nice side effect:

Some users told me yesterday, that while the rrdcached problem occurs, they had some "Proxy does not respond" errors within their browsers while surfing the web. I've seen this in my browser too but couldnt reproduce it. The proxy error came at random and went away after pressing the "reload" button.

Could this be a performance Problem on the UTM due to the rrdcached error?

I mean, I had a fallback log size growth of 6-7 GB/day and I could imagine that the UTM 320 had something else to do than deliver data to the internal network [;)]

Now, after the rrdcached problem, I havent seen the proxy errors anymore, and the whole UTM is much more responsive (WebAdmin, throughput while surfing the Web, ...) .

Best regards,

Christian
Cancel
Vote Up 0 Vote Down

Cancel
0 BAlfson over 9 years ago

Could this be a performance Problem on the UTM due to the rrdcached error?

Absolutely!

Thanks for responding! Since I think the rebuild command also re-initializes the graphing databases, I believe that the final step could have been the first step. If anyone else has this issue, start with /etc/init.d/postgresql92 rebuild and then comment in this thread if that fixed your problem.

Cheers - Bob

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel