This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

SG230 - HA Slave stay SYNCING

 Hello,

We're having some troubles with our SG230 cluster.

We have 2 SG230 in 2 différent datacenter. with a HA actf/passif configuration.

Everything was OK durint near 300 days. But last week we have had an incident in SLAVE datacenter (electrical incident in the datacenter).

The SG230MASTER in the other datacenter stay online, no incident in production.

 

We start the SLAVE but he shutdown alone after many minutes, and the status stay DEAD on the webconsole HA Status.

After read some similar case, we decided to :

- delete the HA from master

- reset factory on the SLAVE

- recreate the HA from master

- recreate basic configuration on SLAVE and HA for Node2

 

But the SLAVE status stay SYNCING :

I notice a few error on slave boot but I don't know if it's critical

 

 

And here the HA Live LOG.

But I'm not sure to identify pertinent information, what I need to do.

 

Thanks in advance 

Best regards

 

 

2016:11:24-12:19:44 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 41 44.752" name="Executing (nowait) /etc/init.d/ha_mode topology_changed"

2016:11:24-12:19:44 dcexresort-master-1 ha_mode[21139]: calling topology_changed

2016:11:24-12:19:44 dcexresort-master-1 ha_mode[21139]: topology_changed: waiting for last ha_mode done

2016:11:24-12:19:44 dcexresort-master-1 ha_mode[21139]: daemonized...

2016:11:24-12:19:44 dcexresort-master-1 repctl[21160]: [i] daemonize_check(1362): trying to signal daemon

2016:11:24-12:19:45 dcexresort-master-1 ha_mode[21139]: topology_changed done (started at 12:19:44)

2016:11:24-12:19:45 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 42 45.042" name="Reading cluster configuration"

2016:11:24-12:19:50 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 43 50.622" name="Set syncing.files for node 2"

2016:11:24-12:19:58 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 44 58.686" name="Node 2 changed state: SYNCING(2) -> SYNCING(3)"

2016:11:24-12:20:00 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 45 00.172" name="Monitoring interfaces for link beat: eth4 eth1 eth0"

2016:11:24-12:24:59 dcexresort-master-2 repctl[4237]: [i] start_backup_mode(643): starting backup mode at 000000010000011200000082
2016:11:24-12:24:59 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 62 59.911" name="HA control: cmd = 'sync start 1 database'"
2016:11:24-12:24:59 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 63 59.911" name="Activating sync process for database on node 1"
2016:11:24-12:24:59 dcexresort-master-2 repctl[4237]: [i] execute(1627): rsync: change_dir#3 "/var/storage/pgsql92/data/global" failed: No such file or directory (2)
2016:11:24-12:24:59 dcexresort-master-2 repctl[4237]: [c] standby_clone(825): rsync failed on $VAR1 = {
2016:11:24-12:25:01 dcexresort-master-2 repctl[4237]: [i] stop_backup_mode(664): stopped backup mode at 000000010000011200000082
2016:11:24-12:25:01 dcexresort-master-2 repctl[4237]: [c] standby_clone(837): sync aborted
2016:11:24-12:25:01 dcexresort-master-2 repctl[4237]: [e] prepare_secondary(293): clone failed
2016:11:24-12:25:01 dcexresort-master-2 repctl[4237]: [c] prepare_secondary(305): failed to get database up, waiting for retry
2016:11:24-12:25:01 dcexresort-master-2 repctl[4237]: [i] setup_replication(229): checkinterval 300


2016:11:24-12:25:01 dcexresort-master-2 repctl[4237]: [i] setup_replication(229): checkinterval 300
2016:11:24-12:29:50 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 51 50.622" name="Set syncing.files for node 2"
2016:11:24-12:30:33 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 52 33.412" name="filesync_end(): initial sync failed, status = 0x200"
2016:11:24-12:30:33 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 53 33.412" name="Clear syncing.files for node 2"
2016:11:24-12:34:50 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 54 50.622" name="Set syncing.files for node 2"
2016:11:24-12:35:11 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 55 11.733" name="filesync_end(): initial sync failed, status = 0x200"
2016:11:24-12:35:11 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 56 11.733" name="Clear syncing.files for node 2"
2016:11:24-12:39:50 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 57 50.622" name="Set syncing.files for node 2"
2016:11:24-12:40:09 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 58 09.601" name="filesync_end(): initial sync failed, status = 0x200"
2016:11:24-12:40:09 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 59 09.601" name="Clear syncing.files for node 2"
2016:11:24-12:44:50 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 60 50.622" name="Set syncing.files for node 2"
2016:11:24-12:45:22 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 61 22.984" name="filesync_end(): initial sync failed, status = 0x200"
2016:11:24-12:45:22 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 62 22.984" name="Clear syncing.files for node 2"
2016:11:24-12:50:16 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 64 16.155" name="filesync_end(): initial sync failed, status = 0x200"
2016:11:24-12:50:16 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 65 16.155" name="Clear syncing.files for node 2"
2016:11:24-12:54:50 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 66 50.622" name="Set syncing.files for node 2"
2016:11:24-12:55:17 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 67 17.758" name="filesync_end(): initial sync failed, status = 0x200"
2016:11:24-12:55:17 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 68 17.758" name="Clear syncing.files for node 2"
ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 69 50.622" name="Set syncing.files for node 2"
2016:11:24-13:00:39 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 70 39.097" name="filesync_end(): initial sync failed, status = 0x200"
2016:11:24-13:00:39 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 71 39.097" name="Clear syncing.files for node 2"
2016:11:24-13:04:50 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 72 50.622" name="Set syncing.files for node 2"
2016:11:24-13:05:05 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 73 05.081" name="filesync_end(): initial sync failed, status = 0x200"
2016:11:24-13:05:05 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 74 05.081" name="Clear syncing.files for node 2"
2016:11:24-13:05:05 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 73 05.081" name="filesync_end(): initial sync failed, status = 0x200"
2016:11:24-13:05:05 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 74 05.081" name="Clear syncing.files for node 2"
2016:11:24-13:09:44 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 75 44.622" name="Executing (wait) /usr/local/bin/confd-setha mode master master_ip 198.19.250.1 slave_ip 198.19.250.2"
2016:11:24-13:09:44 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 76 44.697" name="Executing (nowait) /etc/init.d/ha_mode check"
2016:11:24-13:09:44 dcexresort-master-1 ha_mode[1708]: calling check
2016:11:24-13:09:44 dcexresort-master-1 ha_mode[1708]: check: waiting for last ha_mode done
2016:11:24-13:09:44 dcexresort-master-1 ha_mode[1708]: check_ha() role=MASTER, status=UNLINKED
2016:11:24-13:09:44 dcexresort-master-1 ha_mode[1708]: check done (started at 13:09:44)
2016:11:24-13:09:50 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 77 50.622" name="Set syncing.files for node 2"
2016:11:24-13:10:04 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 78 04.331" name="filesync_end(): initial sync failed, status = 0x200"
2016:11:24-13:10:04 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 79 04.331" name="Clear syncing.files for node 2"

2016:11:24-13:14:44 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 64 44.586" name="Executing (wait) /usr/local/bin/confd-setha mode slave"
2016:11:24-13:14:44 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 65 44.740" name="Executing (nowait) /etc/init.d/ha_mode check"
2016:11:24-13:14:44 dcexresort-master-2 ha_mode[12353]: calling check
2016:11:24-13:14:44 dcexresort-master-2 ha_mode[12353]: check: waiting for last ha_mode done
2016:11:24-13:14:44 dcexresort-master-2 ha_mode[12353]: check_ha() role=SLAVE, status=SYNCING
2016:11:24-13:14:44 dcexresort-master-2 repctl[4237]: [i] execute(1627): pg_ctl: PID file "/var/storage/pgsql92/data/postmaster.pid" does not exist
2016:11:24-13:14:44 dcexresort-master-2 repctl[4237]: [i] execute(1627): Is server running?
2016:11:24-13:14:44 dcexresort-master-2 repctl[4237]: [i] execute(1627): starting server anyway
2016:11:24-13:14:44 dcexresort-master-2 repctl[4237]: [i] execute(1627): pg_ctl: could not read file "/var/storage/pgsql92/data/postmaster.opts"
2016:11:24-13:14:47 dcexresort-master-2 ha_mode[12353]: daemonized...
2016:11:24-13:14:47 dcexresort-master-2 repctl[12375]: [i] execute(1627): pg_ctl: no server running
2016:11:24-13:14:47 dcexresort-master-2 ha_mode[12353]: HA SELFMON WARN: Restarting repctl for SLAVE(SYNCING)
2016:11:24-13:14:47 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 66 47.928" name="HA control: cmd = 'sync start 1 database'"
2016:11:24-13:14:47 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 67 47.928" name="Activating sync process for database on node 1"
2016:11:24-13:14:47 dcexresort-master-2 repctl[12375]: [i] execute(1627): pg_ctl: PID file "/var/storage/pgsql92/data/postmaster.pid" does not exist
2016:11:24-13:14:47 dcexresort-master-2 repctl[12375]: [i] execute(1627): Is server running?
2016:11:24-13:14:47 dcexresort-master-2 ha_mode[12353]: check done (started at 13:14:44)
2016:11:24-13:14:48 dcexresort-master-2 repctl[12375]: [i] start_backup_mode(643): starting backup mode at 000000010000011200000085
2016:11:24-13:14:48 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 68 48.475" name="HA control: cmd = 'sync start 1 database'"
2016:11:24-13:14:48 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 69 48.475" name="Activating sync process for database on node 1"
2016:11:24-13:14:48 dcexresort-master-2 repctl[12375]: [i] execute(1627): rsync: change_dir#3 "/var/storage/pgsql92/data/global" failed: No such file or directory (2)
2016:11:24-13:14:48 dcexresort-master-2 repctl[12375]: [c] standby_clone(825): rsync failed on $VAR1 = {
2016:11:24-13:14:49 dcexresort-master-2 repctl[12375]: [i] stop_backup_mode(664): stopped backup mode at 000000010000011200000085
2016:11:24-13:14:49 dcexresort-master-2 repctl[12375]: [c] standby_clone(837): sync aborted
2016:11:24-13:14:49 dcexresort-master-2 repctl[12375]: [e] prepare_secondary(293): clone failed
2016:11:24-13:14:49 dcexresort-master-2 repctl[12375]: [i] start_backup_mode(643): starting backup mode at 000000010000011200000087
2016:11:24-13:14:49 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 70 49.809" name="HA control: cmd = 'sync start 1 database'"
2016:11:24-13:14:49 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 71 49.809" name="Activating sync process for database on node 1"
2016:11:24-13:14:49 dcexresort-master-2 repctl[12375]: [i] execute(1627): rsync: change_dir#3 "/var/storage/pgsql92/data/global" failed: No such file or directory (2)
2016:11:24-13:14:49 dcexresort-master-2 repctl[12375]: [c] standby_clone(825): rsync failed on $VAR1 = {
2016:11:24-13:14:50 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 80 50.622" name="Set syncing.files for node 2"
2016:11:24-13:14:50 dcexresort-master-2 repctl[12375]: [i] stop_backup_mode(664): stopped backup mode at 000000010000011200000087
2016:11:24-13:14:50 dcexresort-master-2 repctl[12375]: [c] standby_clone(837): sync aborted
2016:11:24-13:14:50 dcexresort-master-2 repctl[12375]: [e] prepare_secondary(293): clone failed
2016:11:24-13:14:51 dcexresort-master-2 repctl[12375]: [i] start_backup_mode(643): starting backup mode at 000000010000011200000089
2016:11:24-13:14:51 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 72 51.072" name="HA control: cmd = 'sync start 1 database'"
2016:11:24-13:14:51 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 73 51.072" name="Activating sync process for database on node 1"
2016:11:24-13:14:51 dcexresort-master-2 repctl[12375]: [i] execute(1627): rsync: change_dir#3 "/var/storage/pgsql92/data/global" failed: No such file or directory (2)
2016:11:24-13:14:51 dcexresort-master-2 repctl[12375]: [c] standby_clone(825): rsync failed on $VAR1 = {
2016:11:24-13:14:52 dcexresort-master-2 repctl[12375]: [i] stop_backup_mode(664): stopped backup mode at 000000010000011200000089
2016:11:24-13:14:52 dcexresort-master-2 repctl[12375]: [c] standby_clone(837): sync aborted
2016:11:24-13:14:52 dcexresort-master-2 repctl[12375]: [e] prepare_secondary(293): clone failed
2016:11:24-13:14:52 dcexresort-master-2 repctl[12375]: [c] prepare_secondary(305): failed to get database up, waiting for retry
2016:11:24-13:14:52 dcexresort-master-2 repctl[12375]: [i] setup_replication(229): checkinterval 300
2016:11:24-13:15:34 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 81 34.499" name="filesync_end(): initial sync failed, status = 0x200"
2016:11:24-13:15:34 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 82 34.499" name="Clear syncing.files for node 2"
2016:11:24-13:15:57 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 74 57.413" name="Monitoring interfaces for link beat: eth4 eth1 eth0"
2016:11:24-13:16:02 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 75 02.652" name="Monitoring interfaces for link beat: eth4 eth1 eth0"
2016:11:24-13:16:07 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 83 07.736" name="Monitoring interfaces for link beat: eth4 eth1 eth0"


2016:11:24-13:19:50 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 84 50.622" name="Set syncing.files for node 2"
2016:11:24-13:19:52 dcexresort-master-2 repctl[12375]: [e] db_connect(2058): error while connecting to database(DBI:Pg:dbname=repmgr): could not connect to server: No such file or directory
2016:11:24-13:19:52 dcexresort-master-2 repctl[12375]: [c] local_connection(1946): cannot connect to local database: could not connect to server: No such file or directory
2016:11:24-13:19:52 dcexresort-master-2 repctl[12375]: [w] recheck(1030): re-initialising replication
2016:11:24-13:19:52 dcexresort-master-2 repctl[12375]: [i] execute(1627): pg_ctl: no server running
2016:11:24-13:19:52 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 76 52.201" name="HA control: cmd = 'sync start 1 database'"
2016:11:24-13:19:52 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 77 52.201" name="Activating sync process for database on node 1"
2016:11:24-13:19:52 dcexresort-master-2 repctl[12375]: [i] execute(1627): pg_ctl: PID file "/var/storage/pgsql92/data/postmaster.pid" does not exist
2016:11:24-13:19:52 dcexresort-master-2 repctl[12375]: [i] start_backup_mode(643): starting backup mode at 00000001000001120000008B
2016:11:24-13:19:52 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 78 52.776" name="HA control: cmd = 'sync start 1 database'"
2016:11:24-13:19:52 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 79 52.776" name="Activating sync process for database on node 1"
2016:11:24-13:19:52 dcexresort-master-2 repctl[12375]: [i] execute(1627): rsync: change_dir#3 "/var/storage/pgsql92/data/global" failed: No such file or directory (2)
2016:11:24-13:19:52 dcexresort-master-2 repctl[12375]: [c] standby_clone(825): rsync failed on $VAR1 = {
2016:11:24-13:19:53 dcexresort-master-2 repctl[12375]: [i] stop_backup_mode(664): stopped backup mode at 00000001000001120000008B
2016:11:24-13:19:53 dcexresort-master-2 repctl[12375]: [c] standby_clone(837): sync aborted
2016:11:24-13:19:53 dcexresort-master-2 repctl[12375]: [e] prepare_secondary(293): clone failed
2016:11:24-13:19:54 dcexresort-master-2 repctl[12375]: [i] start_backup_mode(643): starting backup mode at 00000001000001120000008D
2016:11:24-13:19:54 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 80 54.116" name="HA control: cmd = 'sync start 1 database'"
2016:11:24-13:19:54 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 81 54.116" name="Activating sync process for database on node 1"
2016:11:24-13:19:54 dcexresort-master-2 repctl[12375]: [i] execute(1627): rsync: change_dir#3 "/var/storage/pgsql92/data/global" failed: No such file or directory (2)
2016:11:24-13:19:54 dcexresort-master-2 repctl[12375]: [c] standby_clone(825): rsync failed on $VAR1 = {
2016:11:24-13:19:55 dcexresort-master-2 repctl[12375]: [i] stop_backup_mode(664): stopped backup mode at 00000001000001120000008D
2016:11:24-13:19:55 dcexresort-master-2 repctl[12375]: [c] standby_clone(837): sync aborted
2016:11:24-13:19:55 dcexresort-master-2 repctl[12375]: [e] prepare_secondary(293): clone failed
2016:11:24-13:19:55 dcexresort-master-2 repctl[12375]: [i] start_backup_mode(643): starting backup mode at 00000001000001120000008F
2016:11:24-13:19:55 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 82 55.959" name="HA control: cmd = 'sync start 1 database'"
2016:11:24-13:19:55 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 83 55.959" name="Activating sync process for database on node 1"
2016:11:24-13:19:56 dcexresort-master-2 repctl[12375]: [i] execute(1627): rsync: change_dir#3 "/var/storage/pgsql92/data/global" failed: No such file or directory (2)
2016:11:24-13:19:56 dcexresort-master-2 repctl[12375]: [c] standby_clone(825): rsync failed on $VAR1 = {
2016:11:24-13:19:57 dcexresort-master-2 repctl[12375]: [i] stop_backup_mode(664): stopped backup mode at 00000001000001120000008F
2016:11:24-13:19:57 dcexresort-master-2 repctl[12375]: [c] standby_clone(837): sync aborted
2016:11:24-13:19:57 dcexresort-master-2 repctl[12375]: [e] prepare_secondary(293): clone failed
2016:11:24-13:19:57 dcexresort-master-2 repctl[12375]: [c] prepare_secondary(305): failed to get database up, waiting for retry
2016:11:24-13:19:57 dcexresort-master-2 repctl[12375]: [i] setup_replication(229): checkinterval 300

 



This thread was automatically locked due to age.
Parents
  • first sync on new slave sometimes hang.

    goto management / HA and try to reboot the slave.

    This mostly solves this problem.


    Dirk

    Systema Gesellschaft für angewandte Datentechnik mbH  // Sophos Platinum Partner
    Sophos Solution Partner since 2003
    If a post solves your question, click the 'Verify Answer' link at this post.

  • Tahnk you for back.

    I already had reboot the slave.

    But I try again and log seems the same.

     


    Live Log: High availability
    Filter:
    Autoscroll
    Reload
    2016:11:25-11:14:51 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 525 51.089" name="Activating sync process for database on node 1"
    2016:11:25-11:14:51 dcexresort-master-2 repctl[2106]: [i] execute(1627): rsync: change_dir#3 "/var/storage/pgsql92/data/global" failed: No such file or directory (2)
    2016:11:25-11:14:51 dcexresort-master-2 repctl[2106]: [c] standby_clone(825): rsync failed on $VAR1 = {
    2016:11:25-11:14:52 dcexresort-master-2 repctl[2106]: [i] stop_backup_mode(664): stopped backup mode at 0000000100000113000000E0
    2016:11:25-11:14:52 dcexresort-master-2 repctl[2106]: [c] standby_clone(837): sync aborted
    2016:11:25-11:14:52 dcexresort-master-2 repctl[2106]: [e] prepare_secondary(293): clone failed
    2016:11:25-11:14:52 dcexresort-master-2 repctl[2106]: [c] prepare_secondary(305): failed to get database up, waiting for retry
    2016:11:25-11:14:52 dcexresort-master-2 repctl[2106]: [i] setup_replication(229): checkinterval 300
    2016:11:25-11:15:33 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 939 33.401" name="filesync_end(): initial sync failed, status = 0x200"
    2016:11:25-11:15:33 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 940 33.401" name="Clear syncing.files for node 2"
    2016:11:25-11:17:46 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 941 46.280" name="HA control: cmd = 'reboot 2'"
    2016:11:25-11:17:46 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 526 46.170" name="Received command to reboot!"
    2016:11:25-11:17:47 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 527 47.290" name="HA daemon shutting down (SIGTERM)"
    2016:11:25-11:17:47 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 528 47.290" name="Executing (nowait) /etc/init.d/ha_mode disable"
    2016:11:25-11:17:47 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 529 47.290" name="--- Node is disabled ---"
    2016:11:25-11:17:47 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 530 47.290" name="Executing (nowait) /etc/init.d/ha_mode shutdown"
    2016:11:25-11:17:47 dcexresort-master-2 ha_mode[2887]: calling shutdown
    2016:11:25-11:17:47 dcexresort-master-2 ha_mode[2886]: calling disable
    2016:11:25-11:17:47 dcexresort-master-2 ha_mode[2886]: disable: waiting for last ha_mode done
    2016:11:25-11:17:47 dcexresort-master-2 ha_mode[2886]: Switching disable mode
    2016:11:25-11:17:47 dcexresort-master-2 ha_mode[2886]: disable done (started at 11:17:47)
    2016:11:25-11:17:47 dcexresort-master-2 repctl[2106]: [i] execute(1627): pg_ctl: PID file "/var/storage/pgsql92/data/postmaster.pid" does not exist
    2016:11:25-11:17:47 dcexresort-master-2 ha_mode[2887]: shutdown: waiting for last ha_mode done
    2016:11:25-11:17:47 dcexresort-master-2 ha_mode[2887]: /var/mdw/scripts/confd-sync: /usr/local/bin/confd-sync stopped
    2016:11:25-11:17:47 dcexresort-master-2 ha_mode[2887]: shutdown done (started at 11:17:47)
    2016:11:25-11:17:47 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 531 47.398" name="HA daemon exits (SIGTERM)"
    2016:11:25-11:17:50 dcexresort-master-1 ha_daemon[11985]: id="38C1" severity="error" sys="System" sub="ha" seq="M: 942 50.622" name="Node 2 is dead, received no heart beats"
    2016:11:25-11:17:50 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 943 50.623" name="Executing (wait) /usr/local/bin/confd-setha mode master master_ip 198.19.250.1 slave_ip ''"
    2016:11:25-11:17:50 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 944 50.701" name="Executing (nowait) /etc/init.d/ha_mode topology_changed"
    2016:11:25-11:17:50 dcexresort-master-1 ha_mode[11160]: calling topology_changed
    2016:11:25-11:17:50 dcexresort-master-1 ha_mode[11160]: topology_changed: waiting for last ha_mode done
    2016:11:25-11:17:50 dcexresort-master-1 ha_mode[11160]: daemonized...
    2016:11:25-11:17:50 dcexresort-master-1 repctl[11177]: [i] daemonize_check(1362): trying to signal daemon
    2016:11:25-11:17:50 dcexresort-master-1 ha_mode[11160]: topology_changed done (started at 11:17:50)
    2016:11:25-11:17:50 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 945 50.968" name="Reading cluster configuration"
    2016:11:25-11:18:06 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 946 06.077" name="Monitoring interfaces for link beat: eth1 eth0"
    2016:11:25-11:19:03 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 947 03.679" name="Access granted to remote node 2!"
    2016:11:25-11:19:03 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 948 03.769" name="Node 2 changed version! 0.000000 -> 9.315002"
    2016:11:25-11:19:03 dcexresort-master-1 ha_daemon[11985]: id="38C0" severity="info" sys="System" sub="ha" seq="M: 949 03.769" name="Node 2 is alive"
    2016:11:25-11:19:03 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 950 03.769" name="Node 2 changed state: DEAD(2048) -> SYNCING(2)"
    2016:11:25-11:19:03 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 951 03.769" name="Node 2 changed role: DEAD -> SLAVE"
    2016:11:25-11:19:03 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 952 03.769" name="Executing (wait) /usr/local/bin/confd-setha mode master master_ip 198.19.250.1 slave_ip 198.19.250.2"
    2016:11:25-11:19:03 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 953 03.860" name="Executing (nowait) /etc/init.d/ha_mode topology_changed"
    2016:11:25-11:19:03 dcexresort-master-1 ha_mode[11586]: calling topology_changed
    2016:11:25-11:19:03 dcexresort-master-1 ha_mode[11586]: topology_changed: waiting for last ha_mode done
    2016:11:25-11:19:04 dcexresort-master-1 ha_mode[11586]: daemonized...
    2016:11:25-11:19:04 dcexresort-master-1 repctl[11610]: [i] daemonize_check(1362): trying to signal daemon
    2016:11:25-11:19:04 dcexresort-master-1 ha_mode[11586]: topology_changed done (started at 11:19:03)
    2016:11:25-11:19:04 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 954 04.204" name="Reading cluster configuration"
    2016:11:25-11:19:09 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 955 09.622" name="Set syncing.files for node 2"
    2016:11:25-11:19:19 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 956 19.319" name="Monitoring interfaces for link beat: eth1 eth0"
    2016:11:25-11:19:35 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 957 35.369" name="filesync_end(): initial sync failed, status = 0x200"
    2016:11:25-11:19:35 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 958 35.369" name="Clear syncing.files for node 2"
    2016:11:25-11:24:09 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 959 09.622" name="Set syncing.files for node 2"
    2016:11:25-11:24:10 dcexresort-master-2 repctl[4225]: [e] db_connect(2058): error while connecting to database(DBI:Pg:dbname=repmgr): could not connect to server: No such file or directory
    2016:11:25-11:24:10 dcexresort-master-2 repctl[4225]: [w] recheck(1030): re-initialising replication
    2016:11:25-11:24:10 dcexresort-master-2 repctl[4225]: [i] execute(1627): pg_ctl: no server running
    2016:11:25-11:24:10 dcexresort-master-2 ha_daemon[4178]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 45 10.810" name="HA control: cmd = 'sync start 1 database'"
    2016:11:25-11:24:10 dcexresort-master-2 ha_daemon[4178]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 46 10.810" name="Activating sync process for database on node 1"
    2016:11:25-11:24:10 dcexresort-master-2 repctl[4225]: [i] execute(1627): pg_ctl: PID file "/var/storage/pgsql92/data/postmaster.pid" does not exist
    2016:11:25-11:24:11 dcexresort-master-2 repctl[4225]: [i] start_backup_mode(643): starting backup mode at 0000000100000113000000E8
    2016:11:25-11:24:11 dcexresort-master-2 ha_daemon[4178]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 47 11.296" name="HA control: cmd = 'sync start 1 database'"
    2016:11:25-11:24:11 dcexresort-master-2 ha_daemon[4178]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 48 11.296" name="Activating sync process for database on node 1"
    2016:11:25-11:24:11 dcexresort-master-2 repctl[4225]: [i] execute(1627): rsync: change_dir#3 "/var/storage/pgsql92/data/global" failed: No such file or directory (2)
    2016:11:25-11:24:11 dcexresort-master-2 repctl[4225]: [i] execute(1627):
    2016:11:25-11:24:11 dcexresort-master-2 repctl[4225]: [i] execute(1627): rsync error: errors selecting input/output files, dirs (code 3) at main.c(614) [receiver=3.0.4]
    2016:11:25-11:24:11 dcexresort-master-2 repctl[4225]: [i] execute(1627):
    2016:11:25-11:24:11 dcexresort-master-2 repctl[4225]: [c] standby_clone(825): rsync failed on $VAR1 = {
    2016:11:25-11:24:12 dcexresort-master-2 repctl[4225]: [i] stop_backup_mode(664): stopped backup mode at 0000000100000113000000E8
    2016:11:25-11:24:12 dcexresort-master-2 repctl[4225]: [c] standby_clone(837): sync aborted
    2016:11:25-11:24:12 dcexresort-master-2 repctl[4225]: [e] prepare_secondary(293): clone failed
    2016:11:25-11:24:12 dcexresort-master-2 repctl[4225]: [i] start_backup_mode(643): starting backup mode at 0000000100000113000000EA
    2016:11:25-11:24:12 dcexresort-master-2 ha_daemon[4178]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 49 12.764" name="HA control: cmd = 'sync start 1 database'"
    2016:11:25-11:24:12 dcexresort-master-2 ha_daemon[4178]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 50 12.764" name="Activating sync process for database on node 1"
    2016:11:25-11:24:12 dcexresort-master-2 repctl[4225]: [i] execute(1627): rsync: change_dir#3 "/var/storage/pgsql92/data/global" failed: No such file or directory (2)
    2016:11:25-11:24:12 dcexresort-master-2 repctl[4225]: [c] standby_clone(825): rsync failed on $VAR1 = {
    2016:11:25-11:24:13 dcexresort-master-2 repctl[4225]: [i] stop_backup_mode(664): stopped backup mode at 0000000100000113000000EA
    2016:11:25-11:24:13 dcexresort-master-2 repctl[4225]: [c] standby_clone(837): sync aborted
    2016:11:25-11:24:13 dcexresort-master-2 repctl[4225]: [e] prepare_secondary(293): clone failed
    2016:11:25-11:24:14 dcexresort-master-2 repctl[4225]: [i] start_backup_mode(643): starting backup mode at 0000000100000113000000EC
    2016:11:25-11:24:14 dcexresort-master-2 ha_daemon[4178]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 51 14.599" name="HA control: cmd = 'sync start 1 database'"
    2016:11:25-11:24:14 dcexresort-master-2 ha_daemon[4178]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 52 14.599" name="Activating sync process for database on node 1"
    2016:11:25-11:24:14 dcexresort-master-2 repctl[4225]: [i] execute(1627): rsync: change_dir#3 "/var/storage/pgsql92/data/global" failed: No such file or directory (2)
    2016:11:25-11:24:14 dcexresort-master-2 repctl[4225]: [c] standby_clone(825): rsync failed on $VAR1 = {
    2016:11:25-11:24:15 dcexresort-master-2 repctl[4225]: [i] stop_backup_mode(664): stopped backup mode at 0000000100000113000000EC
    2016:11:25-11:24:15 dcexresort-master-2 repctl[4225]: [c] standby_clone(837): sync aborted
    2016:11:25-11:24:15 dcexresort-master-2 repctl[4225]: [e] prepare_secondary(293): clone failed
    2016:11:25-11:24:15 dcexresort-master-2 repctl[4225]: [c] prepare_secondary(305): failed to get database up, waiting for retry
    2016:11:25-11:24:15 dcexresort-master-2 repctl[4225]: [i] setup_replication(229): checkinterval 300
    2016:11:25-11:24:24 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 960 24.841" name="filesync_end(): initial sync failed, status = 0x200"
    2016:11:25-11:24:24 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 961 24.841" name="Clear syncing.files for node 2"

    2016:11:25-11:29:09 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 962 09.622" name="Set syncing.files for node 2"
    2016:11:25-11:29:56 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 963 56.346" name="filesync_end(): initial sync failed, status = 0x200"
    2016:11:25-11:29:56 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 964 56.346" name="Clear syncing.files for node 2"
  • Hi Sebastian,

    - recreate basic configuration on SLAVE and HA for Node2

    That's wrong. You have configured two Master. Just reset Appliance, plugin ha cable and Start the Slave. By default the SG Appliance is configured for automatic configuration.

    Regards

    Mod

  • Hi everobody,

     

    Thank you for your help and comment.

    I solved my problem.

    In fact it was not an HA problème but a file system corrupt.

    In my first post, the first screenshot identify an error on postgresql and error password. It's not normal.

    After read some post, I determine that I need to reinstall the UTM.

    https://community.sophos.com/kb/hu-hu/115879

     

    My process :

    - download UTM version on sophos  http://www.sophos.com/en-us/support/utm-downloads.aspx

    - I try to install with an USB boot, but it doesn't work for me, despite I read some tuto

    - Only USB cdrom boot work for me

    - After a reinstall I don't have any error on postgresql.

    - Plug the slave and all network cable

    - Start the slave

    - Recreate the HA from Master in automatique mode

    - And SLAVE status pass READY after few minutes.

     

    Best regards

Reply
  • Hi everobody,

     

    Thank you for your help and comment.

    I solved my problem.

    In fact it was not an HA problème but a file system corrupt.

    In my first post, the first screenshot identify an error on postgresql and error password. It's not normal.

    After read some post, I determine that I need to reinstall the UTM.

    https://community.sophos.com/kb/hu-hu/115879

     

    My process :

    - download UTM version on sophos  http://www.sophos.com/en-us/support/utm-downloads.aspx

    - I try to install with an USB boot, but it doesn't work for me, despite I read some tuto

    - Only USB cdrom boot work for me

    - After a reinstall I don't have any error on postgresql.

    - Plug the slave and all network cable

    - Start the slave

    - Recreate the HA from Master in automatique mode

    - And SLAVE status pass READY after few minutes.

     

    Best regards

Children
No Data