This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Sophos UTM 9.509-3 Home Edition as HA - Hourly messages: Pop3 proxy not running, ACC device Agent not running, HA selfcheck

Hi all,

since yesterday I have an issue with my  Sophos UTM 9.509-3 Home Edition. I've running it as virtual machines on free ESXi 6.5 in HA mode, distributed on two host systems with the preferred master on node 1.

Hourly there are the following messages:

  1. From master and slave: Pop3 proxy not running - restarted
  2. From slave: ACC device Agent not running - restarted
  3. From slave: HA selfcheck: HA SELFMON WARN: Restarting repctl for SLAVE(ACTIVE)

Yesterday, it wasn't possible to connect to the internet and I couldn't get access to the webadmin UI anymore until I stopped the master node. Then I started the node 1 (preferred master) and the HA mode came back but the synchronization didn't come to an end. So I shut down node 2 to stop synchronizing - yes I know this isn't a good solution but the synchronization took over half an hour which isn't normal. After bringing node 2 back online the synchronization worked fine but the above messages came back. The internet and webadmin UI access were still possible.

Then I did an update of the ESXi hosts with the latest patch from VMware. So I had to stop the UTM nodes one after another, beginning with the master node 1. The Failover to node 2 worked fine and after coming back online the synchronization didn't finish. I waited a very long time but then I shut down node 2 for installing the ESXi update on my second UTM node. And after bringing this node back online the synchronization finished in a normal time. But again, the above messages came back. The internet and webadmin UI access were still possible.

And it looks like the Up2Date isn't working any more because I don't get Up2Date messages. In webadmin UI it Shows that firmware and patterns are up to date. And there was no daily executive report tonight. It isn't possible to create this report manually.

What can I do to resolve this?



This thread was automatically locked due to age.
  • If you search here, you will see that this problem with VMs in HA first appeared in 9.508.  I don't think it was fixed in 9.509.  I think you will find that others solved the problem by going back to 9.506.

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
  • Hi BAlfson,

    sorry, I didn't find similar posts in the forum. And I didn't had this issue with 9.508. And with 9.509 there was no issue from March, 28 until yesterday.

    Can you give me a link to another post with this issue, please? I want to see if this is really the same issue. Thank you.

    Kind Regards

    TheExpert

  • Hallo,

    I don't know that this is the same problem.  I just remember that there were problems with VMs in HA in 9.508.  I would try a Google:

    site:community.sophos.com/products/unified-threat-management/f "9.508" VM

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
  • Hi BAlfson,

    these posts are about HTTP proxy. I have the issue with Pop3 proxy.

    Today, I switched the preferred master to node 2. I want to see if this helps. An interesting thing is that the CPU load (95 %) on node 1 is still very high even if it's not the master anymore. Normally, the CPU load averages at up to 5 % on the master in my environment. On node 2 the CPU is running at normal load level. After restarting node 1 the CPU load first came back to normal load level but after a successful HA synchronization the CPU load increased to 45 %. Waiting some time it came back again to normal CPU load. 

    After switching the master to node 2 Up2Date is working again. The patterns were updated and I got a success message per mail as configured. And node 1, now slave, got the updates after the restart, too.

    Now, I want to see if there are hourly messages about "Pop3 proxy not running", "ACC device agent not running" and "HA selfcheck", today.

    The daily executive report isn't working. When starting manually there's a little popup window with "Please be patient, while the report is generated." But the picture isn't shown and this window won't close automatically. And I don't get the report.

    UPDATE, April, 7, 2018, 10:05 CEST: I still get messages regarding "Pop3 proxy not running" from both nodes and "HA selfcheck" in a hourly schedule. Only the ACC messages are gone. The CPU load of both nodes is normal.

    UPDATE, April, 8, 2018, 06:55 CEST: There are still messages regarding "Pop3 proxy not running" from both nodes and "HA selfcheck" from the master node in a hourly schedule. I still don't get the daily executive report. The CPU load is at a normal level. I will now stop the slave node for today and will have a look, if this helps to prevent from the issue.

    UPDATE, April, 8, 2018, 20:05 CEST: There are still hourly messages regarding "Pop3 proxy not running". And of course there are messages that the HA group is broken.

    UPDATE, April, 9, 2018, 06:20 CEST: There are still hourly messages regarding "Pop3 proxy not running". And the daily executive report isn't working. I start node 1 to have HA running again.

    Kind Regards

    TheExpert

  • Hi all,

    after trying some things this weekend (see my post before with the updates) I don't know if this issue is related to a bug of Sophos UTM 9.509 or if there's an issue with the nodes itself. I updated to 9.509 early after the announcement and at first I didn't have errors or problems. And even if there are these messages, everything seems to work except the daily executive report.

    Are there more ideas what to check? Thank you.

    Kind Regards

    TheExpert

  • Hi all,

    does no one else have an issue like me?

    I'm thinking about reinstalling my both nodes with the iso file of Sophos UTM 9.509-3 and restoring my last good backup from the same version. Maybe this will help to not to get these messages anymore. And hopefully I will then get the daily executive reports again. Or has someone else other ideas how to solve the issues?

    Will the Postgres database then be 64 bit? In the past there was a way to update the database from 32 to 64 bit manually on the console. But this isn't needed anymore, right?

    Kind Regards

    TheExpert

  • Hi all,

    tonight I installed a fresh HA pair of Sophos UTM with the ISO file of 9.509-3 and the restore of a backup, where there was no issue. And for the moment it looks like the issue is resolved. There are no more error messages. Now it's possible to create a daily executive report manually and to get this report as scheduled task again.

    Regarding my question about the Postgres database: The web site https://community.sophos.com/kb/en-us/126593 describes the upgrade procedure. But if Sophos UTM is installed directly with version 9.5x (no upgrade from 9.x) it should be running as 64 bit module. How can I check what version of Postgres is running and if it's 32 or 64 bit?

    Before I set this thread as solved I will check if the issue is coming back again after some days.

    Kind Regards

    TheExpert

  • Schöne Arbeit !

    Check the version with:

    grep POSTGRES_ARCH /var/storage/pgsql92/postgres.default

    You should get POSTGRES_ARCH="64"

    Taken from the KB article Sophos UTM: Upgrading PostgreSQL to 64-bit on UTM 9.5.

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
  • Hi all,

    after one week with a new installed HA cluster of my virtual Sophos UTM environment I can confirm that everything is working fine. There are no issues and no error messages anymore and I get my daily executive reports.

    And I didn't have to upgrade the postgres database because it's installed as 64 bit version.

    Kind Regards

    TheExpert

  • Hi all,

    today I updated my VMware ESXi hosts to 6.7. For this I shut down the slave node within the management UI of the UTM. This worked as it should and I was able to upgrade the ESXi host. Then I waited until the slave node was back online again and then stopped the master node. From this time my internet connection didn't work correctly: Some web sites were reachable others not, i. e. HTTPS sites and it wasn't possible to download files. And there was no possibility to get access to the management UI of the UTM. After waiting a very long time I saw that the master node didn't shut down and so I stopped it with the VMware UI. Then the internet connection worked again without issues. So I did the upgrade of my other ESXi host and then started the UTM master node (preferred master). Everything seemed to work fine: The master node came back online and the faultback of the HA cluster worked.

    But now, the error messages are coming back again. And the daily executive report isn't working again :-(. It Looks like there's a bug with the UTM 9.509-3 and HA cluster. Maybe there will be files corrupted when a failover of the HA cluster will happen.

    I can't open a ticket because I'm a home edition user. Does someone know if this is a known bug and when it will be resolved by Sophos? Or is the only solution to install the nodes once again and hopefully never have a cluster failover or to go back to an older release?

    Kind Regards

    TheExpert