This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

UTM, HA - FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 00000001000007B1000000FB has already been removed

I noticed today that we seem to have issues on our Sophos UTM 9.510-5 system in HA mode. We are getting continuous errors like this:

sophosutm-2 postgres[18415]: [4-1] FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 00000001000007B1000000FB has already been removed

and

sophosutm-1 postgres[14369]: [3-1] FATAL: requested WAL segment 00000001000007B1000000FB has already been removed

They occur every 5 seconds, so there are literally 1000's of these lines in the logs.

sophosutm-2 postgres[21561]: [3-1] LOG: streaming replication successfully connected to primary

Note: the reported WAL segment is always the same 00000001000007B1000000FB.

The system seems to be running fine, otherwise, but these messages are making us feel uneasy.

Ideas?



This thread was automatically locked due to age.
  • Check postgres for corruption. I am a little unsure about the syntax, but in the forum you should find something.

    Best

    Alex

    -

  • Hoi Jan and welcome to the UTM Community!

    I was going to suggest a failover, but there might be an issue.  I think you should urgently get Sophos Support involved and then report back to us here.

    Cheer - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
  • Hi bob,

     "There might be an issue" meaning: some error affecting everybody in this version of sophos utm? Or something specific on my side?

    I have read here about re-initializing the postgresql database on utm. While the command itself is simple and I could run it on the failing node, I am unsure about the consequences in term of data loss...

    So: What is contained in the postgresql database? I could run "/etc/init.d/postgresql92 rebuild", but what would I loose and what not?

  • I was advised to start with rebooting the slave node, as that would "force a new database sync", as they described it.

    And after doing that, the databases became synced:

    sophosutm-1 postgres[16564]: [3-1] NOTICE:  pg_stop_backup complete, all required WAL segments have been archived
    sophosutm-2 postgres[11496]: [2-1] LOG:  database system is ready to accept read only connections
    sophosutm-2 postgres[11497]: [6-1] LOG:  consistent recovery state reached at 7CF/EFE60CA0
    sophosutm-2 postgres[11517]: [2-1] LOG:  streaming replication successfully connected to primary

    Thanks both for your suggestions. :-)

  • Glad to hear that.

    Best regards

    Alex

    -

  • Hi.

     

    For the achives: the issue kept reappearing. We replaced both spinning disks with SSDs, didn't help. Sophos support was clueless: no real cause and/or fix.

    So now, instead of fixing this, they have now offered us an upgrade from UTM to XG appliance.

     

    I suggest for everyone to keep an eye of your log files, and NOT trust the WebAdmin HA status, as this reports a simple:

    HA/Cluster is active in mode HA with 2/2 nodes

    But under the hood (in syslog) you could see endless messages (many 1000 of them) like these:

    utm-2 postgres[14278]: [4-1] FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 0000000100000BCA0000007D has already been removed

    indicating that *actually* the database is NOT synced between the two hosts.

     

    Since it does not show up in the gui, it would not surprise us if more people are affected by this, without realising it.