This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

UTM, HA - FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 00000001000007B1000000FB has already been removed

I noticed today that we seem to have issues on our Sophos UTM 9.510-5 system in HA mode. We are getting continuous errors like this:

sophosutm-2 postgres[18415]: [4-1] FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 00000001000007B1000000FB has already been removed

and

sophosutm-1 postgres[14369]: [3-1] FATAL: requested WAL segment 00000001000007B1000000FB has already been removed

They occur every 5 seconds, so there are literally 1000's of these lines in the logs.

sophosutm-2 postgres[21561]: [3-1] LOG: streaming replication successfully connected to primary

Note: the reported WAL segment is always the same 00000001000007B1000000FB.

The system seems to be running fine, otherwise, but these messages are making us feel uneasy.

Ideas?

This thread was automatically locked due to age.

0 Alexander Busch over 6 years ago

Check postgres for corruption. I am a little unsure about the syntax, but in the forum you should find something.

Best

Alex

-
Cancel
Vote Up 0 Vote Down

Cancel
0 BAlfson over 6 years ago

Hoi Jan and welcome to the UTM Community!

I was going to suggest a failover, but there might be an issue. I think you should urgently get Sophos Support involved and then report back to us here.

Cheer - Bob

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel
0 mouk over 6 years ago in reply to BAlfson

Hi bob,

"There might be an issue" meaning: some error affecting everybody in this version of sophos utm? Or something specific on my side?

I have read here about re-initializing the postgresql database on utm. While the command itself is simple and I could run it on the failing node, I am unsure about the consequences in term of data loss...

So: What is contained in the postgresql database? I could run "/etc/init.d/postgresql92 rebuild", but what would I loose and what not?
Cancel
Vote Up 0 Vote Down

Cancel
0 mouk over 6 years ago in reply to mouk

I was advised to start with rebooting the slave node, as that would "force a new database sync", as they described it.

And after doing that, the databases became synced:

sophosutm-1 postgres[16564]: [3-1] NOTICE: pg_stop_backup complete, all required WAL segments have been archived
sophosutm-2 postgres[11496]: [2-1] LOG: database system is ready to accept read only connections
sophosutm-2 postgres[11497]: [6-1] LOG: consistent recovery state reached at 7CF/EFE60CA0
sophosutm-2 postgres[11517]: [2-1] LOG: streaming replication successfully connected to primary

Thanks both for your suggestions. :-)
Cancel
Vote Up 0 Vote Down

Cancel
0 Alexander Busch over 6 years ago in reply to mouk

Glad to hear that.

Best regards

Alex

-
Cancel
Vote Up 0 Vote Down

Cancel
0 mouk over 5 years ago in reply to Alexander Busch

Hi.

For the achives: the issue kept reappearing. We replaced both spinning disks with SSDs, didn't help. Sophos support was clueless: no real cause and/or fix.

So now, instead of fixing this, they have now offered us an upgrade from UTM to XG appliance.

I suggest for everyone to keep an eye of your log files, and NOT trust the WebAdmin HA status, as this reports a simple:

HA/Cluster is active in mode HA with 2/2 nodes

But under the hood (in syslog) you could see endless messages (many 1000 of them) like these:

utm-2 postgres[14278]: [4-1] FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 0000000100000BCA0000007D has already been removed

indicating that *actually* the database is NOT synced between the two hosts.

Since it does not show up in the gui, it would not surprise us if more people are affected by this, without realising it.
Cancel
Vote Up 0 Vote Down

Cancel