This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

HA rscync error on active/passive HA Cluster

hi guys,

ugh, with 9.5 sophos made a bad job, once again.

Since upgrading SG330 active/passive cluster to 9.5  (now 9.502-4) we had several problems. SSO didn´t update the group entries from AD. (i.e. group members in "internet allow" AD group won´t be updated, so that users new to this group can´t use internet, also users removed from group are still able to use)

Second, the db syncing of HA cluster seems to fail. Even the Status Tab in HA shows Master als Active and Slave as Ready, the HA log produces errors  every 10 seconds

2017:08:14-09:01:11 xxxx-01-2 repctl[3738]: [i] execute(1768): rsync: failed to connect to 198.19.250.1: Connection refused (111)
2017:08:14-09:01:11 xxxx-01-2 repctl[3738]: [i] execute(1768): rsync error: error in socket IO (code 10) at clientserver.c(122) [receiver=3.0.4]
2017:08:14-09:01:11 xxxx-01-2 repctl[3738]: [c] standby_clone(936): rsync failed on $VAR1 = {
2017:08:14-09:01:11 xxxx-01-2 repctl[3738]: [c] standby_clone(936): 'path' => '/postgres.default',
2017:08:14-09:01:11 xxxx-01-2 repctl[3738]: [c] standby_clone(936): 'dst' => '/var/storage/pgsql92/',
2017:08:14-09:01:11 xxxx-01-2 repctl[3738]: [c] standby_clone(936): 'module' => 'postgres-default'

After several retries it stops

repctl[3738]: [w] master_connection(2015): check_dbh: -1
2017:08:14-08:55:40 xxxx-01-2 repctl[3738]: [i] stop_backup_mode(765): stopped backup mode at 0000000100000020000000D3
2017:08:14-08:55:40 xxxx-01-2 repctl[3738]: [c] standby_clone(950): standby_clone failed: sync aborted (never executed successfully)
2017:08:14-08:55:40 xxxx-01-2 repctl[3738]: [e] prepare_secondary(346): prepare_secondary: clone failed
2017:08:14-08:55:40 xxxx-01-2 repctl[3738]: [c] prepare_secondary(360): failed to get database up, waiting for retry
2017:08:14-08:55:40 xxxx-01-2 repctl[3738]: [c] setup_replication(274): setup_replication was not properly executed
2017:08:14-08:55:40 xxxx-01-2 repctl[3738]: [i] setup_replication(278): checkinterval 300
2017:08:14-09:00:39 xxxx-01-2 repctl[3738]: [i] recheck(1057): got ALRM: replication recheck triggered Setup_replication_done = 0
2017:08:14-09:00:39 xxxx-01-2 repctl[3738]: [i] execute(1768): pg_ctl: no server running
2017:08:14-09:00:39 xxxx-01-2 ha_daemon[4346]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 462 39.097" name="HA control: cmd = 'sync start 1 database'"
2017:08:14-09:00:39 xxxx-01-2 ha_daemon[4346]: id="38A1" severity="info" sys="System" sub="ha" seq="S: 463 39.097" name="control_sync(): we are not in state SYNCING, ignoring sync for database/1"
2017:08:14-09:00:39 xxxx-01-2 repctl[3738]: [i] execute(1768): pg_ctl: PID file "/var/storage/pgsql92/data/postmaster.pid" does not exist
2017:08:14-09:00:39 xxxx-01-2 repctl[3738]: [i] execute(1768): Is server running?

 

and every 55 minutes we got message from alerting system:

HA SELFMON WARN: Restarting repctl for SLAVE(ACTIVE)

 

Any ideas? Exept of rebuild postgresql db´s or releasing HA status and firmware reset of Slave and rebuild HA.

Cu

Thomas



This thread was automatically locked due to age.
  • Hi Thomas,

     

    check out partitions with ssh command "df -h" and post your results.

    I had this problem once in combination with a full disk.

     

    Regards,

    Daniel

  •  Hi Daniel,

    thx for reply but partitions are far away from being full.  Use is between 1 and 60%

    Filesystem                        Size  Used Avail Use% Mounted on

    /dev/sda6                         5.2G  3.0G  2.0G  60% /
    udev                              5.9G  120K  5.9G   1% /dev
    tmpfs                             5.9G  104K  5.9G   1% /dev/shm
    /dev/sda1                         331M   16M  295M   5% /boot
    /dev/sda5                          62G   13G   47G  21% /var/storage
    /dev/sda7                          81G   44G   33G  58% /var/log
    /dev/sda8                         3.5G  237M  3.1G   8% /tmp
    /dev                              5.9G  120K  5.9G   1% /var/storage/chroot-clientlessvpn/dev
    tmpfs                             5.9G     0  5.9G   0% /var/sec/chroot-httpd/dev/shm
    /dev                              5.9G  120K  5.9G   1% /var/sec/chroot-openvpn/dev
    /dev                              5.9G  120K  5.9G   1% /var/sec/chroot-ppp/dev
    /dev                              5.9G  120K  5.9G   1% /var/sec/chroot-pppoe/dev
    /dev                              5.9G  120K  5.9G   1% /var/sec/chroot-pptp/dev
    /dev                              5.9G  120K  5.9G   1% /var/sec/chroot-pptpc/dev
    /dev                              5.9G  120K  5.9G   1% /var/sec/chroot-restd/dev
    tmpfs                             5.9G     0  5.9G   0% /var/storage/chroot-reverseproxy/dev/shm
    /opt/tmpfs                        5.2G  3.0G  2.0G  60% /var/sec/chroot-httpd/opt/tmpfs
    /var/storage                       62G   13G   47G  21% /var/sec/chroot-snmp/var/storage
    /var/log                           81G   44G   33G  58% /var/sec/chroot-snmp/var/log
    /etc/nwd.d/route                  5.2G  3.0G  2.0G  60% /var/sec/chroot-ipsec/etc/nwd.d/route
    /var/storage/chroot-smtp/spool     62G   13G   47G  21% /var/sec/chroot-httpd/var/spx/spool
    /var/storage/chroot-smtp/spx       62G   13G   47G  21% /var/sec/chroot-httpd/var/spx/public/images/spx
    tmpfs                             5.9G  128K  5.9G   1% /var/storage/chroot-smtp/tmp/ram
    tmpfs                             5.9G  214M  5.7G   4% /var/storage/chroot-http/tmp
    /var/sec/chroot-afc/var/run/navl  5.2G  3.0G  2.0G  60% /var/storage/chroot-http/var/run/navl

     

    So Disk filling rate wont be the problem.

  • Hi Thomas,

     

    please check the slave node too.

    You can connect to the slave node with the commands "ha_utils ssh". Check the filesystem and the path of the postgres SQL-DB. /var/storage/pgsql92/data/

     

    Regards,

    Daniel

  • Hi Daniel,

     

    of course i´v checked both systems before

    /var/storage is at 21%

    and /var/storage/pgsql92/data is used, i.e. there will be files with current data like postmaster.opts and directories like pg_xlog

    so the partition/mountpoint can be written.

     

    CU

    Thomas

  • Hi Thomas,

     

    so I think the rebuild of the database would be best. But it' not recommented without Sophos Support.

    Look at this: https://community.sophos.com/products/unified-threat-management/f/hardware-installation-up2date-licensing/75493/ha-stuck-in-syncing

     

    If you can't contakt the support you can try to break the cluster, reset the slave to factory default and rejoin to master.

     

    Regards,

    Daniel

  • Hi Thomas,

    Daniel has a potential suggestion to your question. It could be that the slave's partition filled up. Also, check if you see any log lines similar to,

    "[user:notice] cluster_sync[1590]: rsync: write failed on "/var/up2date/aptp/u2d-aptp-9.23906.tgz.gpg" (in cluster_sync): No space left on device (28)

    You can check in the fallback.log, execute:

    fallback.log | grep no space  

    If that doesn't help, rebuild postgres and that should resolve the issue.

    Thanks

    Sachin Gurung
    Team Lead | Sophos Technical Support
    Knowledge Base  |  @SophosSupport  |  Video tutorials
    Remember to like a post.  If a post (on a question thread) solves your question use the 'This helped me' link.

  • Hi,

     

    thx for information.

    As written before, it´s not  problem with "no space". All partitions, even on the slave, have enough space.

    And i´ve asked for a  solution without breaking up the cluster nor rebuilding the postgres.

     

    But if this is the only way, i´ve to do so.

    CU

    Thomas

  • Hi

     

    Did you manage to solve this issue?

    I have no exact the same situation and also do not want to break the cluster. 

     

    Regards

  • Hi aedii,

     

    it´s long ago, but at least we broke up the cluster, reinitialised the slave machine an than rejoined to the cluster.

    perhaps in the meantime, there are better soltions...?

     

    HTH

    Regards

  • Hi tkaufi

    Thanks for your reply.

    I rebuilt the Postgre DB on both machines.

     

    I connected to both machines with ssh and then issued these commands:

    - killall repctl

    - /etc/init.d/postgresql92 rebuild

    - repctl

    - reboot slave node

     

    and then it synced again without breaking the cluster, but lost the DB.