This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

SG450 A/S Node1 Stuck in up2date "No up2date path to '9.712012', try to fix it"

UPDATE: scroll down for fix.

big thanks to: dirkkotte and solae

tl;dr: i need access to the following UTM u2date package u2d-sys-9.711005-712012.tgz.gpg which was removed by sophos from the download page.

our customer bricked his SG450 A/S cluster today by trying to upgrade to from 9.711005 to 9.712013

The full upgrade path was shown as:

9.711005 to 9.712012

9.712012 to 9.712013

Unit2 managed to upgrade to 9.712012

Unit1 stucks in up2date 9.711005 to 9.712012, since the download was removed.

both units can't go to 9.712013! it's not possible to upgrade 5 to13 while one box stucks on 12 

unit2 is blocked to go to 9.712013 because unit1 still trying to upgrade to 9.712012

unit1 can't go directly to 9.712013 because unit2 already has 9.712012 installed.

after some research it looks like sophos removed 9.712012 and replaced it with 9.712013

Both relelease notes:

https://community.sophos.com/utm-firewall/b/blog/posts/utm-update-9-712-13-released

https://community.sophos.com/utm-firewall/b/blog/posts/utm-up2date-9-712-released-1300703171

refer to the new package: u2d-sys-9.711005-712013.tgz.gpg

but to fix the cluster, i would first need to install u2d-sys-9.711005-712012.tgz.gpg on both appliances.

2022:09:26-19:02:10 ictrz-fw-01-2 auisys[30172]: Showdesc ok.
2022:09:26-19:02:10 ictrz-fw-01-2 auisys[30172]: [INFO-301] New Firmware Up2Date is ready for installation
2022:09:26-19:02:16 ictrz-fw-01-2 audld[29943]: id="3701" severity="info" sys="system" sub="up2date" name="Authentication successful"
2022:09:26-19:02:16 ictrz-fw-01-2 audld[29943]: Using static download server list in HA mode
2022:09:26-19:02:16 ictrz-fw-01-2 audld[29943]: id="3707" severity="info" sys="system" sub="up2date" name="Successfully synchronized fileset" status="success" action="download" package="sys"
2022:09:26-19:02:17 ictrz-fw-01-2 auisys[30295]: running on HA master system or cluster node
2022:09:26-19:02:17 ictrz-fw-01-2 auisys[30295]: >=========================================================================
2022:09:26-19:02:17 ictrz-fw-01-2 auisys[30295]: Another instance of auisys is already running.
2022:09:26-19:02:17 ictrz-fw-01-2 auisys[30295]: Aappending job to queue! Exiting
2022:09:26-19:02:22 ictrz-fw-01-2 auisys[30341]: running on HA master system or cluster node
2022:09:26-19:02:22 ictrz-fw-01-2 auisys[30341]: >=========================================================================
2022:09:26-19:02:22 ictrz-fw-01-2 auisys[30341]: Another instance of auisys is already running.
2022:09:26-19:02:22 ictrz-fw-01-2 auisys[30341]: Aappending job to queue! Exiting
2022:09:26-19:02:31 ictrz-fw-01-2 auisys[30172]: No suitable packages of type <man9> found, skipping
2022:09:26-19:02:31 ictrz-fw-01-2 auisys[30172]: No suitable packages of type <aws> found, skipping
2022:09:26-19:02:31 ictrz-fw-01-2 auisys[30172]: No suitable packages of type <clvbrowser> found, skipping
2022:09:26-19:02:31 ictrz-fw-01-2 auisys[30172]: No suitable packages of type <appctrl43> found, skipping
2022:09:26-19:02:31 ictrz-fw-01-2 auisys[30172]: No suitable packages of type <ohelp9> found, skipping
2022:09:26-19:02:31 ictrz-fw-01-2 auisys[30172]: No suitable packages of type <aptp> found, skipping
2022:09:26-19:02:31 ictrz-fw-01-2 auisys[30172]: No suitable packages of type <cadata> found, skipping
2022:09:26-19:02:31 ictrz-fw-01-2 auisys[30172]: No suitable packages of type <geoip> found, skipping
2022:09:26-19:02:31 ictrz-fw-01-2 auisys[30172]: No suitable packages of type <man9> found, skipping
2022:09:26-19:02:31 ictrz-fw-01-2 auisys[30172]: No suitable packages of type <aws> found, skipping
2022:09:26-19:02:31 ictrz-fw-01-2 auisys[30172]: No suitable packages of type <clvbrowser> found, skipping
2022:09:26-19:02:31 ictrz-fw-01-2 auisys[30172]: No suitable packages of type <appctrl43> found, skipping
2022:09:26-19:02:31 ictrz-fw-01-2 auisys[30172]: No suitable packages of type <ohelp9> found, skipping
2022:09:26-19:02:31 ictrz-fw-01-2 auisys[30172]: No suitable packages of type <aptp> found, skipping
2022:09:26-19:02:31 ictrz-fw-01-2 auisys[30172]: No suitable packages of type <cadata> found, skipping
2022:09:26-19:02:31 ictrz-fw-01-2 auisys[30172]: No suitable packages of type <geoip> found, skipping
2022:09:26-19:02:31 ictrz-fw-01-2 auisys[30172]: Install u2d packages <sys>
2022:09:26-19:02:31 ictrz-fw-01-2 auisys[30172]: Starting installing up2date packages for type 'sys'
2022:09:26-19:02:31 ictrz-fw-01-2 auisys[30172]: unpacking up2date package: /var/up2date/sys/u2d-sys-9.712012-712013.tgz.gpg
2022:09:26-19:02:31 ictrz-fw-01-2 auisys[30172]: unpacking up2date package version: 9.712013
2022:09:26-19:02:31 ictrz-fw-01-2 auisys[30172]: Verifying up2date package signature
2022:09:26-19:02:32 ictrz-fw-01-2 auisys[30172]: Unpacking installation instructions
2022:09:26-19:02:32 ictrz-fw-01-2 auisys[30172]: parsing installation instructions
2022:09:26-19:02:32 ictrz-fw-01-2 auisys[30172]: Showdesc ok.
2022:09:26-19:02:32 ictrz-fw-01-2 auisys[30172]: [INFO-301] New Firmware Up2Date is ready for installation
2022:09:26-19:02:53 ictrz-fw-01-2 auisys[30172]: Doing HA sync
2022:09:26-19:02:53 ictrz-fw-01-2 auisys[30172]: calling: </usr/local/bin/up2date_sync.sh>
2022:09:26-19:02:53 ictrz-fw-01-2 auisys[30172]: id="3720" severity="info" sys="system" sub="up2date" name="Successfully triggered up2date sync" status="success" action="sync"
2022:09:26-19:02:53 ictrz-fw-01-2 auisys[30172]: Up2Date Package Installer finished, exiting
2022:09:26-19:02:53 ictrz-fw-01-2 auisys[30172]: id="3716" severity="info" sys="system" sub="up2date" name="Up2Date Package Installer finished, exiting"
2022:09:26-19:04:00 ictrz-fw-01-1 audld[11557]: running on HA slave system or cluster node
2022:09:26-19:04:00 ictrz-fw-01-1 audld[11557]: patch up2date possible
2022:09:26-19:04:00 ictrz-fw-01-1 audld[11557]: Starting Secured Up2Date Package Downloader
2022:09:26-19:04:00 ictrz-fw-01-1 audld[11557]: Using static update server list in HA mode
2022:09:26-19:04:01 ictrz-fw-01-1 audld[11557]: Secured Up2date Authentication
2022:09:26-19:04:02 ictrz-fw-01-1 audld[11557]: id="3701" severity="info" sys="system" sub="up2date" name="Authentication successful"
2022:09:26-19:04:02 ictrz-fw-01-1 audld[11557]: Using static download server list in HA mode
2022:09:26-19:04:02 ictrz-fw-01-1 auisys[11640]: running on HA slave system or cluster node
2022:09:26-19:04:02 ictrz-fw-01-1 auisys[11640]: running on slave/cluster node, skipping license check
2022:09:26-19:04:02 ictrz-fw-01-1 auisys[11640]: waiting for db_verify to return (30 seconds max)
2022:09:26-19:04:03 ictrz-fw-01-1 auisys[11640]: removing '/var/up2date/sys-install'
2022:09:26-19:04:03 ictrz-fw-01-1 auisys[11640]: Starting Up2Date Package Installer
2022:09:26-19:04:03 ictrz-fw-01-1 auisys[11640]: version of package '/var/up2date/sys/u2d-sys-9.711005-712013.tgz.gpg' doesn't fit, skipping
2022:09:26-19:04:03 ictrz-fw-01-1 auisys[11640]: No suitable packages of type <sys> found, skipping
2022:09:26-19:04:04 ictrz-fw-01-1 auisys[11640]: Up2Date Package Installer finished, exiting

STEPS to fix broken slave:

Requirements:

- SSH access to master with loginuser

- root password.

0) check if you have a preferred master setting in the GUI and make sure it's not the slave node, that way you avoid a failback after the upgrade is finished

1) Login to current Master node via SSH with user "loginuser" and download the u2date file 

ictrz-fw-01:~ # cd /home/login/
ictrz-fw-01:/home/login # wget https://www.show-run.ch/u2d-sys-9.711005-712012.tgz.gpg
--2022-09-28 15:45:47--  https://www.show-run.ch/u2d-sys-9.711005-712012.tgz.gpg
Resolving www.show-run.ch... 149.126.4.86, 2a01:ab20:0:4::86
Connecting to www.show-run.ch|149.126.4.86|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 268409635 (256M) [application/octet-stream]
Saving to: `u2d-sys-9.711005-712012.tgz.gpg.1'

100%[===============================================================================================================================================================================================>] 268,409,635 12.4M/s   in 23s     

2022-09-28 15:46:10 (11.3 MB/s) - `u2d-sys-9.711005-712012.tgz.gpg.1' saved [268409635/268409635]

NOTE from Sophos: We don't endorse the use of 3rd party links, the official download link is 

https://download.astaro.com/UTM/v9/up2date/u2d-sys-9.711005-712013.tgz.gpg

The command to download from the CLI of the UTM would be:

# wget https://download.astaro.com/UTM/v9/up2date/u2d-sys-9.711005-712013.tgz.gpg

2) SSH to the slave box via the HA Link. (it's either node1 or node2, it depends which box is currently slave, i my case it's node1)

ictrz-fw-01:/home/login # ssh loginuser@node1
loginuser@node1's password: 
Last login: Wed Sep 28 13:22:41 2022 from node2


Sophos UTM
(C) Copyright 2000-2022 Sophos Limited and others. All rights reserved.
Sophos is a registered trademark of Sophos Limited and Sophos Group.
All other product and company names mentioned are trademarks or registered
trademarks of their respective owners.

For more copyright information look at /doc/astaro-license.txt
or http://www.astaro.com/doc/astaro-license.txt

NOTE: If not explicitly approved by Sophos support, any modifications
      done by root will void your support.

<M> loginuser@ictrz-fw-01:/home/login > 

3) now copy the u2d file from the master node over to the slave node 

<S> loginuser@fw-01:/home/login > scp loginuser@node2:/home/login/u2d-sys-9.711005-712012.tgz.gpg /home/login/u2d-sys-9.711005-712012.tgz.gpg
loginuser@node2's password: 
u2d-sys-9.711005-712012.tgz.gpg 

<M> loginuser@ictrz-fw-01:/home/login > ls -lha
total 512M
drwxr-xr-x 3 loginuser users 4.0K Sep 28 13:08 .
drwxr-xr-x 3 root      root  4.0K Jul 27  2010 ..
-rw------- 1 loginuser users 1.8K Sep 28 14:02 .bash_history
drwx------ 2 loginuser users 4.0K Sep 28 13:01 .ssh
-rw-r--r-- 1 loginuser users 256M Sep 28 13:02 u2d-sys-9.711005-712012.tgz.gpg

4) sudo to root and copy the file to the up2date folder

sudo su

cp /home/login/u2d-sys-9.711005-712012.tgz.gpg /var/up2date/sys
cd /var/up2date/sys

fw-01:/var/up2date/sys # ls -lha
total 512M
drwxr-xr-x  2 root root 4.0K Sep 28 13:04 .
drwxr-xr-x 13 root root 4.0K Sep 26 19:04 ..
-rw-r--r--  1 root root 256M Sep 28 13:04 u2d-sys-9.711005-712012.tgz.gpg
-rw-r--r--  1 root root 256M Sep 26 07:18 u2d-sys-9.711005-712013.tgz.gpg

5) delete all other files in this folder than u2d-sys-9.711005-712012.tgz.gpg !

fw-01:/var/up2date/sys # rm u2d-sys-9.711005-712013.tgz.gpg

6) Optionally run an u2date simulation first:

fw-01:/var/up2date/sys # auisys.plx -simulation --verbose

fw-01:/var/up2date/sys # auisys.plx -simulation --verbose
'simulation' mode implicits sets noqueue!
running on HA slave system or cluster node
running on slave/cluster node, skipping license check
removing '/var/up2date/appctrl43-install'
removed directory: `/var/up2date/appctrl43-install'
removing '/var/up2date/aptp-install'
removed directory: `/var/up2date/aptp-install'
removing '/var/up2date/aws-install'
removed directory: `/var/up2date/aws-install'
removing '/var/up2date/cadata-install'
removed directory: `/var/up2date/cadata-install'
removing '/var/up2date/clvbrowser-install'
removed directory: `/var/up2date/clvbrowser-install'
removing '/var/up2date/geoip-install'
removed directory: `/var/up2date/geoip-install'
removing '/var/up2date/man9-install'
removed directory: `/var/up2date/man9-install'
removing '/var/up2date/ohelp9-install'
removed directory: `/var/up2date/ohelp9-install'
removing '/var/up2date/sys-install'
removed `/var/up2date/sys-install/u2d-sys-9.712012/install-sys-9.712012.xml'
removed directory: `/var/up2date/sys-install/u2d-sys-9.712012'
removed directory: `/var/up2date/sys-install'
<<<<---- Simulation enabled ---->>>>
(simulation) Starting Up2Date Package Installer
(simulation) No suitable packages of type <man9> found, skipping
(simulation) No suitable packages of type <aws> found, skipping
(simulation) No suitable packages of type <clvbrowser> found, skipping
(simulation) No suitable packages of type <appctrl43> found, skipping
(simulation) No suitable packages of type <ohelp9> found, skipping
(simulation) No suitable packages of type <aptp> found, skipping
(simulation) No suitable packages of type <cadata> found, skipping
(simulation) No suitable packages of type <geoip> found, skipping
(simulation) Install u2d packages <sys>
(simulation) Starting installing up2date packages for type 'sys'
(simulation) Installing up2date package: /var/up2date/sys/u2d-sys-9.711005-712012.tgz.gpg
(simulation) Verifying up2date package signature
(simulation) Unpacking installation instructions
(simulation) parsing installation instructions
(simulation) Unpacking up2date package container
(simulation) Running pre-installation checks
(simulation) Not installing optional aws-cfn-bootstrap

....and so on....

    Would do  7, 0 [ENV   300] sh -c exec /var/up2date/sys-install/u2d-sys-9.712012/update9.712012post_start
    Would do  9, 0 [NOENV  no] rm /var/up2date/sys/u2d-sys-9.711005-712012.tgz.gpg
    Would do  9, 1 [NOENV  no] sync
    Would touch '/tmp/.u2d-sys-9.711-9.712-5.12.1.tgz'
(simulation) New system version: 9.711005
(simulation) Up2Date Package Installer finished, exiting
(simulation) Simulation enabled. Would do a reboot now

only continuous if simulation had no errors

7) run the update

fw-01:/var/up2date/sys # auisys.plx --rpmargs --force --verbose

'verbose' mode implicits set noqueue option!
running on HA slave system or cluster node
running on slave/cluster node, skipping license check
waiting for db_verify to return (30 seconds max)
removing '/var/up2date/appctrl43-install'
removed directory: `/var/up2date/appctrl43-install'
removing '/var/up2date/aptp-install'
removed directory: `/var/up2date/aptp-install'
removing '/var/up2date/aws-install'
removed directory: `/var/up2date/aws-install'
removing '/var/up2date/cadata-install'
removed directory: `/var/up2date/cadata-install'
removing '/var/up2date/clvbrowser-install'
removed directory: `/var/up2date/clvbrowser-install'
removing '/var/up2date/geoip-install'
removed directory: `/var/up2date/geoip-install'
removing '/var/up2date/man9-install'
removed directory: `/var/up2date/man9-install'
removing '/var/up2date/ohelp9-install'
removed directory: `/var/up2date/ohelp9-install'
removing '/var/up2date/sys-install'
removed directory: `/var/up2date/sys-install'
Starting Up2Date Package Installer
No suitable packages of type <man9> found, skipping
No suitable packages of type <aws> found, skipping
No suitable packages of type <clvbrowser> found, skipping
No suitable packages of type <appctrl43> found, skipping
No suitable packages of type <ohelp9> found, skipping
No suitable packages of type <aptp> found, skipping
No suitable packages of type <cadata> found, skipping
No suitable packages of type <geoip> found, skipping
Install u2d packages <sys>
Starting installing up2date packages for type 'sys'
Installing up2date package: /var/up2date/sys/u2d-sys-9.711005-712012.tgz.gpg
Verifying up2date package signature
Unpacking installation instructions
parsing installation instructions
....
Installing rpm package: ep-webadmin-contentmanager-9.70-64.g56528fb.rb2.i686.rpm OK
Installing rpm package: chroot-reverseproxy-2.4.54-0.gdfdca5f.rb2.i686.rpm OK
Installing rpm package: ep-httpproxy-9.70-290.g6e88177f.rb3.i686.rpm       OK
Installing rpm package: kernel-smp64-3.12.74-0.424574463.ge309b77.rb7.x86_64.rpm OK
Installing rpm package: ep-release-9.712-12.noarch.rpm                     OK
....
New system version: 9.712012
Up2Date Package Installer finished, exiting
Initiating reboot

Broadcast message from root (pts/0) (Wed Sep 28 13:15:51 2022):

The system is going down for reboot NOW!

now your ssh session should stop and you should be back on the master node
you could watch the process from the CLI log or go back to the GUI

w-01:/home/login # tail -f /var/log/high-availability.log 
2022:09:28-13:16:34 fw-01-2 ha_daemon[5014]: id="38A3" severity="debug" sys="System" sub="ha" seq="M:   45 34.586" name="Netlink: Found link beat on eth16 again!"
2022:09:28-13:16:25 fw-01-2 conntrack-tools[5505]: no dedicated links available!<27>Sep 28 13:16:34 conntrack-tools[5505]: no dedicated links available!
2022:09:28-13:16:34 fw-01-2 ha_daemon[5014]: id="38A3" severity="debug" sys="System" sub="ha" seq="M:   46 34.879" name="Netlink: Found link beat on eth3 again!"
2022:09:28-13:16:35 fw-01-2 ha_daemon[5014]: id="38A3" severity="debug" sys="System" sub="ha" seq="M:   47 35.586" name="Netlink: Lost link beat on eth16!"
2022:09:28-13:16:35 fw-01-2 ha_daemon[5014]: id="38A3" severity="debug" sys="System" sub="ha" seq="M:   48 35.586" name="Netlink: Lost link beat on eth3!"
2022:09:28-13:16:35 fw-01-2 conntrack-tools[5505]: no dedicated links available!
2022:09:28-13:16:37 fw-01-2 conntrack-tools[5505]: no dedicated links available!
2022:09:28-13:16:37 fw-01-2 ha_daemon[5014]: id="38A3" severity="debug" sys="System" sub="ha" seq="M:   49 37.223" name="Netlink: Found link beat on eth16 again!"
2022:09:28-13:16:39 fw-01-2 conntrack-tools[5505]: no dedicated links available!
2022:09:28-13:16:39 fw-01-2 ha_daemon[5014]: id="38A3" severity="debug" sys="System" sub="ha" seq="M:   50 39.614" name="Netlink: Lost link beat on eth16!"
2022:09:28-13:16:44 fw-01-2 ha_daemon[5014]: id="38A3" severity="debug" sys="System" sub="ha" seq="M:   51 44.175" name="Netlink: Found link beat on eth3 again!"
2022:09:28-13:16:45 fw-01-2 ha_daemon[5014]: id="38A3" severity="debug" sys="System" sub="ha" seq="M:   52 45.703" name="Netlink: Found link beat on eth16 again!"


eventually you should see something like this:

2022:09:28-13:18:21 fw-01-1 ha_daemon[5025]: id="38A0" severity="info" sys="System" sub="ha" seq="S:   43 21.924" name="HA control: cmd = 'build'"
2022:09:28-13:18:21 fw-01-1 ha_daemon[5025]: id="38A0" severity="info" sys="System" sub="ha" seq="S:   44 21.926" name="HA control: cmd = 'up2date successful'"
2022:09:28-13:18:21 fw-01-1 ha_daemon[5025]: id="38A0" severity="info" sys="System" sub="ha" seq="S:   45 21.926" name="Set UTM version to 9.712012
2022:09:28-13:18:21 fw-01-1 ha_daemon[5025]: id="38A0" severity="info" sys="System" sub="ha" seq="S:   46 21.926" name="up2date to 9.712012 successful"
2022:09:28-13:18:21 fw-01-1 ha_daemon[5025]: id="38A0" severity="info" sys="System" sub="ha" seq="S:   47 21.926" name="start/reset initial synchronization timer = 300"
2022:09:28-13:18:21 fw-01-1 ha_daemon[5025]: id="38A0" severity="info" sys="System" sub="ha" seq="S:   48 21.926" name="state change UP2DATE(256) -> UP2DATE(258)"
2022:09:28-13:18:21 fw-01-1 ha_daemon[5025]: id="38A0" severity="info" sys="System" sub="ha" seq="S:   49 21.926" name="state change UP2DATE(258) -> SYNCING(2)"
2022:09:28-13:18:21 fw-01-1 ha_daemon[5025]: id="38A0" severity="info" sys="System" sub="ha" seq="S:   50 21.926" name="Executing (nowait) /etc/init.d/ha_mode enable"
2022:09:28-13:18:21 fw-01-1 ha_daemon[5025]: id="38A0" severity="info" sys="System" sub="ha" seq="S:   51 21.926" name="--- Node is enabled ---"
2022:09:28-13:18:21 fw-01-1 ha_mode[8063]: calling enable
2022:09:28-13:18:21 fw-01-1 ha_mode[8063]: enable: waiting for last ha_mode done
2022:09:28-13:18:21 fw-01-1 ha_mode[8063]: Switching enable mode
2022:09:28-13:18:22 fw-01-1 ha_mode[8063]: repctl[8096]: [i] daemonize_check(1480): daemonized, see syslog for further messages
2022:09:28-13:18:22 fw-01-1 repctl[8096]: [i] daemonize_check(1480): daemonized, see syslog for further messages
2022:09:28-13:18:22 fw-01-1 ha_mode[8063]: enable done (started at 13:18:21)
2022:09:28-13:18:22 fw-01-2 ha_daemon[5014]: id="38A0" severity="info" sys="System" sub="ha" seq="M:   60 22.095" name="Node 1 changed version! 9.711005 -> 9.712012"
2022:09:28-13:18:22 fw-01-2 ha_daemon[5014]: id="38A0" severity="info" sys="System" sub="ha" seq="M:   61 22.096" name="Node 1 changed state: UP2DATE(256) -> SYNCING(2)"
2022:09:28-13:18:22 fw-01-2 ha_daemon[5014]: id="38A0" severity="info" sys="System" sub="ha" seq="M:   62 22.096" name="Executing (nowait) /etc/init.d/ha_mode topology_changed"
2022:09:28-13:18:22 fw-01-2 ha_mode[29359]: calling topology_changed
2022:09:28-13:18:22 fw-01-2 ha_mode[29359]: topology_changed: waiting for last ha_mode done
2022:09:28-13:18:22 fw-01-1 repctl[8096]: [i] execute(1768): pg_ctl: server is running (PID: 5142)
2022:09:28-13:18:22 fw-01-1 repctl[8096]: [i] execute(1768): /usr/pgsql92-64/bin/postgres "-D" "/var/storage/pgsql92/data"
2022:09:28-13:18:22 fw-01-2 ha_mode[29359]: repctl[29376]: [i] daemonize_check(1480): daemonized, see syslog for further messages
2022:09:28-13:18:22 fw-01-2 repctl[29376]: [i] daemonize_check(1480): daemonized, see syslog for further messages
2022:09:28-13:18:22 fw-01-1 ha_daemon[5025]: id="38A0" severity="info" sys="System" sub="ha" seq="S:   52 22.355" name="HA control: cmd = 'sync start 2 database'"
2022:09:28-13:18:22 fw-01-1 ha_daemon[5025]: id="38A0" severity="info" sys="System" sub="ha" seq="S:   53 22.355" name="Activating sync process for database on node 2"
2022:09:28-13:18:22 fw-01-1 repctl[8096]: [i] execute(1768): waiting for server to shut down...
2022:09:28-13:18:22 fw-01-1 repctl[8096]: [i] execute(1768): .
2022:09:28-13:18:22 fw-01-2 ha_mode[29359]: topology_changed done (started at 13:18:22)
2022:09:28-13:18:22 fw-01-2 repctl[29376]: [i] execute(1768): pg_ctl: server is running (PID: 5170)
2022:09:28-13:18:22 fw-01-2 repctl[29376]: [i] execute(1768): /usr/pgsql92-64/bin/postgres "-D" "/var/storage/pgsql92/data"
2022:09:28-13:18:22 fw-01-2 repctl[29376]: [i] execute(1768): pg_ctl: server is running (PID: 5170)
2022:09:28-13:18:22 fw-01-2 repctl[29376]: [i] execute(1768): /usr/pgsql92-64/bin/postgres "-D" "/var/storage/pgsql92/data"
2022:09:28-13:18:22 fw-01-2 repctl[29376]: [i] setup_replication(278): checkinterval 300
2022:09:28-13:18:23 fw-01-1 repctl[8096]: [i] execute(1768):  done
2022:09:28-13:18:23 fw-01-1 repctl[8096]: [i] execute(1768): server stopped
2022:09:28-13:18:25 fw-01-1 repctl[8096]: [i] start_backup_mode(744): starting backup mode at 00000001000006A600000025
2022:09:28-13:18:25 fw-01-1 ha_daemon[5025]: id="38A0" severity="info" sys="System" sub="ha" seq="S:   54 25.281" name="HA control: cmd = 'sync start 2 database'"
2022:09:28-13:18:25 fw-01-1 ha_daemon[5025]: id="38A0" severity="info" sys="System" sub="ha" seq="S:   55 25.281" name="Activating sync process for database on node 2"
2022:09:28-13:18:27 fw-01-1 ha_daemon[5025]: id="38A0" severity="info" sys="System" sub="ha" seq="S:   56 27.562" name="Monitoring interfaces for link beat: lag0 eth17"
2022:09:28-13:18:28 fw-01-2 ha_daemon[5014]: id="38A0" severity="info" sys="System" sub="ha" seq="M:   63 28.026" name="Set syncing.files for node 1"
2022:09:28-13:18:44 fw-01-2 ha_daemon[5014]: id="38A0" severity="info" sys="System" sub="ha" seq="M:   64 44.091" name="Clear syncing.files for node 1"
2022:09:28-13:18:45 fw-01-1 ha_daemon[5025]: id="38A0" severity="info" sys="System" sub="ha" seq="S:   57 45.950" name="Monitoring interfaces for link beat: lag0 eth17"

In my case the node1 stuck in "syncing"
maybe i did not wait long enough, but i decided to rebuilt the db on node1 todo a fresh resync.

DONT DO THIS IF YOUR CUSTER IS OKAY AT THIS POINT

On node1 (slave node)


fw-01:/var/up2date/sys # killall repctl
fw-01:/var/up2date/sys # /etc/init.d/postgresql92 rebuild
Rebuilding PostgreSQL database, all reporting data will be lost!
Enter "yes" to continue...
yes
:: Stopping PostgreSQLpg_ctl: PID file "/var/storage/pgsql92/data/postmaster.pid" does not exist
Is server running?
                                                                                                                                                                                    d
:: Initializing the PostgreSQL database
                                                                                                                                                                                       e
:: Starting PostgreSQL                                                                                                                                                                                                        done
:: Restarting SMTP Proxy
:: Stopping SMTP Proxy
[ ok ]
:: Starting SMTP Proxy
[ ok ]
[ ok ]




fw-01:/var/up2date/sys # /usr/local/bin/repctl
repctl[14832]: [i] daemonize_check(1480): daemonized, see syslog for further messages


it took around 45min to sync the complete logs (around 80gb)


but eventually the cluster was acitve/standby again



This thread was automatically locked due to age.
Parents Reply
  • awesome.

    independent from you I was able to fix mx cluster with the same method as well. i was just about to post, when i saw your reply.

    there are some additional steps, which are worth to mentions for ppl to reproduce.

    i'll update my initial post with a tutorial for better visibility.

Children
No Data