Hello,
We're having some troubles with our SG230 cluster.
We have 2 SG230 in 2 différent datacenter. with a HA actf/passif configuration.
Everything was OK durint near 300 days. But last week we have had an incident in SLAVE datacenter (electrical incident in the datacenter).
The SG230MASTER in the other datacenter stay online, no incident in production.
We start the SLAVE but he shutdown alone after many minutes, and the status stay DEAD on the webconsole HA Status.
After read some similar case, we decided to :
- delete the HA from master
- reset factory on the SLAVE
- recreate the HA from master
- recreate basic configuration on SLAVE and HA for Node2
But the SLAVE status stay SYNCING :
I notice a few error on slave boot but I don't know if it's critical
And here the HA Live LOG.
But I'm not sure to identify pertinent information, what I need to do.
Thanks in advance
Best regards
2016:11:24-12:19:44 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 41 44.752" name="Executing (nowait) /etc/init.d/ha_mode topology_changed"
2016:11:24-12:19:44 dcexresort-master-1 ha_mode[21139]: calling topology_changed
2016:11:24-12:19:44 dcexresort-master-1 ha_mode[21139]: topology_changed: waiting for last ha_mode done
2016:11:24-12:19:44 dcexresort-master-1 ha_mode[21139]: daemonized...
2016:11:24-12:19:44 dcexresort-master-1 repctl[21160]: [i] daemonize_check(1362): trying to signal daemon
2016:11:24-12:19:45 dcexresort-master-1 ha_mode[21139]: topology_changed done (started at 12:19:44)
2016:11:24-12:19:45 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 42 45.042" name="Reading cluster configuration"
2016:11:24-12:19:50 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 43 50.622" name="Set syncing.files for node 2"
2016:11:24-12:19:58 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 44 58.686" name="Node 2 changed state: SYNCING(2) -> SYNCING(3)"
2016:11:24-12:20:00 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 45 00.172" name="Monitoring interfaces for link beat: eth4 eth1 eth0"
2016:11:24-12:24:59 dcexresort-master-2 repctl[4237]: [i] start_backup_mode(643): starting backup mode at 000000010000011200000082
2016:11:24-12:24:59 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 62 59.911" name="HA control: cmd = 'sync start 1 database'"
2016:11:24-12:24:59 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 63 59.911" name="Activating sync process for database on node 1"
2016:11:24-12:24:59 dcexresort-master-2 repctl[4237]: [i] execute(1627): rsync: change_dir#3 "/var/storage/pgsql92/data/global" failed: No such file or directory (2)
2016:11:24-12:24:59 dcexresort-master-2 repctl[4237]: [c] standby_clone(825): rsync failed on $VAR1 = {
2016:11:24-12:25:01 dcexresort-master-2 repctl[4237]: [i] stop_backup_mode(664): stopped backup mode at 000000010000011200000082
2016:11:24-12:25:01 dcexresort-master-2 repctl[4237]: [c] standby_clone(837): sync aborted
2016:11:24-12:25:01 dcexresort-master-2 repctl[4237]: [e] prepare_secondary(293): clone failed
2016:11:24-12:25:01 dcexresort-master-2 repctl[4237]: [c] prepare_secondary(305): failed to get database up, waiting for retry
2016:11:24-12:25:01 dcexresort-master-2 repctl[4237]: [i] setup_replication(229): checkinterval 300
2016:11:24-12:25:01 dcexresort-master-2 repctl[4237]: [i] setup_replication(229): checkinterval 300
2016:11:24-12:29:50 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 51 50.622" name="Set syncing.files for node 2"
2016:11:24-12:30:33 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 52 33.412" name="filesync_end(): initial sync failed, status = 0x200"
2016:11:24-12:30:33 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 53 33.412" name="Clear syncing.files for node 2"
2016:11:24-12:34:50 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 54 50.622" name="Set syncing.files for node 2"
2016:11:24-12:35:11 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 55 11.733" name="filesync_end(): initial sync failed, status = 0x200"
2016:11:24-12:35:11 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 56 11.733" name="Clear syncing.files for node 2"
2016:11:24-12:39:50 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 57 50.622" name="Set syncing.files for node 2"
2016:11:24-12:40:09 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 58 09.601" name="filesync_end(): initial sync failed, status = 0x200"
2016:11:24-12:40:09 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 59 09.601" name="Clear syncing.files for node 2"
2016:11:24-12:44:50 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 60 50.622" name="Set syncing.files for node 2"
2016:11:24-12:45:22 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 61 22.984" name="filesync_end(): initial sync failed, status = 0x200"
2016:11:24-12:45:22 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 62 22.984" name="Clear syncing.files for node 2"
2016:11:24-12:50:16 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 64 16.155" name="filesync_end(): initial sync failed, status = 0x200"
2016:11:24-12:50:16 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 65 16.155" name="Clear syncing.files for node 2"
2016:11:24-12:54:50 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 66 50.622" name="Set syncing.files for node 2"
2016:11:24-12:55:17 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 67 17.758" name="filesync_end(): initial sync failed, status = 0x200"
2016:11:24-12:55:17 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 68 17.758" name="Clear syncing.files for node 2"
ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 69 50.622" name="Set syncing.files for node 2"
2016:11:24-13:00:39 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 70 39.097" name="filesync_end(): initial sync failed, status = 0x200"
2016:11:24-13:00:39 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 71 39.097" name="Clear syncing.files for node 2"
2016:11:24-13:04:50 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 72 50.622" name="Set syncing.files for node 2"
2016:11:24-13:05:05 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 73 05.081" name="filesync_end(): initial sync failed, status = 0x200"
2016:11:24-13:05:05 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 74 05.081" name="Clear syncing.files for node 2"
2016:11:24-13:05:05 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 73 05.081" name="filesync_end(): initial sync failed, status = 0x200"
2016:11:24-13:05:05 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 74 05.081" name="Clear syncing.files for node 2"
2016:11:24-13:09:44 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 75 44.622" name="Executing (wait) /usr/local/bin/confd-setha mode master master_ip 198.19.250.1 slave_ip 198.19.250.2"
2016:11:24-13:09:44 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 76 44.697" name="Executing (nowait) /etc/init.d/ha_mode check"
2016:11:24-13:09:44 dcexresort-master-1 ha_mode[1708]: calling check
2016:11:24-13:09:44 dcexresort-master-1 ha_mode[1708]: check: waiting for last ha_mode done
2016:11:24-13:09:44 dcexresort-master-1 ha_mode[1708]: check_ha() role=MASTER, status=UNLINKED
2016:11:24-13:09:44 dcexresort-master-1 ha_mode[1708]: check done (started at 13:09:44)
2016:11:24-13:09:50 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 77 50.622" name="Set syncing.files for node 2"
2016:11:24-13:10:04 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 78 04.331" name="filesync_end(): initial sync failed, status = 0x200"
2016:11:24-13:10:04 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 79 04.331" name="Clear syncing.files for node 2"
2016:11:24-13:14:44 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 64 44.586" name="Executing (wait) /usr/local/bin/confd-setha mode slave"
2016:11:24-13:14:44 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 65 44.740" name="Executing (nowait) /etc/init.d/ha_mode check"
2016:11:24-13:14:44 dcexresort-master-2 ha_mode[12353]: calling check
2016:11:24-13:14:44 dcexresort-master-2 ha_mode[12353]: check: waiting for last ha_mode done
2016:11:24-13:14:44 dcexresort-master-2 ha_mode[12353]: check_ha() role=SLAVE, status=SYNCING
2016:11:24-13:14:44 dcexresort-master-2 repctl[4237]: [i] execute(1627): pg_ctl: PID file "/var/storage/pgsql92/data/postmaster.pid" does not exist
2016:11:24-13:14:44 dcexresort-master-2 repctl[4237]: [i] execute(1627): Is server running?
2016:11:24-13:14:44 dcexresort-master-2 repctl[4237]: [i] execute(1627): starting server anyway
2016:11:24-13:14:44 dcexresort-master-2 repctl[4237]: [i] execute(1627): pg_ctl: could not read file "/var/storage/pgsql92/data/postmaster.opts"
2016:11:24-13:14:47 dcexresort-master-2 ha_mode[12353]: daemonized...
2016:11:24-13:14:47 dcexresort-master-2 repctl[12375]: [i] execute(1627): pg_ctl: no server running
2016:11:24-13:14:47 dcexresort-master-2 ha_mode[12353]: HA SELFMON WARN: Restarting repctl for SLAVE(SYNCING)
2016:11:24-13:14:47 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 66 47.928" name="HA control: cmd = 'sync start 1 database'"
2016:11:24-13:14:47 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 67 47.928" name="Activating sync process for database on node 1"
2016:11:24-13:14:47 dcexresort-master-2 repctl[12375]: [i] execute(1627): pg_ctl: PID file "/var/storage/pgsql92/data/postmaster.pid" does not exist
2016:11:24-13:14:47 dcexresort-master-2 repctl[12375]: [i] execute(1627): Is server running?
2016:11:24-13:14:47 dcexresort-master-2 ha_mode[12353]: check done (started at 13:14:44)
2016:11:24-13:14:48 dcexresort-master-2 repctl[12375]: [i] start_backup_mode(643): starting backup mode at 000000010000011200000085
2016:11:24-13:14:48 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 68 48.475" name="HA control: cmd = 'sync start 1 database'"
2016:11:24-13:14:48 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 69 48.475" name="Activating sync process for database on node 1"
2016:11:24-13:14:48 dcexresort-master-2 repctl[12375]: [i] execute(1627): rsync: change_dir#3 "/var/storage/pgsql92/data/global" failed: No such file or directory (2)
2016:11:24-13:14:48 dcexresort-master-2 repctl[12375]: [c] standby_clone(825): rsync failed on $VAR1 = {
2016:11:24-13:14:49 dcexresort-master-2 repctl[12375]: [i] stop_backup_mode(664): stopped backup mode at 000000010000011200000085
2016:11:24-13:14:49 dcexresort-master-2 repctl[12375]: [c] standby_clone(837): sync aborted
2016:11:24-13:14:49 dcexresort-master-2 repctl[12375]: [e] prepare_secondary(293): clone failed
2016:11:24-13:14:49 dcexresort-master-2 repctl[12375]: [i] start_backup_mode(643): starting backup mode at 000000010000011200000087
2016:11:24-13:14:49 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 70 49.809" name="HA control: cmd = 'sync start 1 database'"
2016:11:24-13:14:49 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 71 49.809" name="Activating sync process for database on node 1"
2016:11:24-13:14:49 dcexresort-master-2 repctl[12375]: [i] execute(1627): rsync: change_dir#3 "/var/storage/pgsql92/data/global" failed: No such file or directory (2)
2016:11:24-13:14:49 dcexresort-master-2 repctl[12375]: [c] standby_clone(825): rsync failed on $VAR1 = {
2016:11:24-13:14:50 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 80 50.622" name="Set syncing.files for node 2"
2016:11:24-13:14:50 dcexresort-master-2 repctl[12375]: [i] stop_backup_mode(664): stopped backup mode at 000000010000011200000087
2016:11:24-13:14:50 dcexresort-master-2 repctl[12375]: [c] standby_clone(837): sync aborted
2016:11:24-13:14:50 dcexresort-master-2 repctl[12375]: [e] prepare_secondary(293): clone failed
2016:11:24-13:14:51 dcexresort-master-2 repctl[12375]: [i] start_backup_mode(643): starting backup mode at 000000010000011200000089
2016:11:24-13:14:51 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 72 51.072" name="HA control: cmd = 'sync start 1 database'"
2016:11:24-13:14:51 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 73 51.072" name="Activating sync process for database on node 1"
2016:11:24-13:14:51 dcexresort-master-2 repctl[12375]: [i] execute(1627): rsync: change_dir#3 "/var/storage/pgsql92/data/global" failed: No such file or directory (2)
2016:11:24-13:14:51 dcexresort-master-2 repctl[12375]: [c] standby_clone(825): rsync failed on $VAR1 = {
2016:11:24-13:14:52 dcexresort-master-2 repctl[12375]: [i] stop_backup_mode(664): stopped backup mode at 000000010000011200000089
2016:11:24-13:14:52 dcexresort-master-2 repctl[12375]: [c] standby_clone(837): sync aborted
2016:11:24-13:14:52 dcexresort-master-2 repctl[12375]: [e] prepare_secondary(293): clone failed
2016:11:24-13:14:52 dcexresort-master-2 repctl[12375]: [c] prepare_secondary(305): failed to get database up, waiting for retry
2016:11:24-13:14:52 dcexresort-master-2 repctl[12375]: [i] setup_replication(229): checkinterval 300
2016:11:24-13:15:34 dcexresort-master-1 ha_daemon[11985]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 81 34.499" name="filesync_end(): initial sync failed, status = 0x200"
2016:11:24-13:15:34 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 82 34.499" name="Clear syncing.files for node 2"
2016:11:24-13:15:57 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 74 57.413" name="Monitoring interfaces for link beat: eth4 eth1 eth0"
2016:11:24-13:16:02 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 75 02.652" name="Monitoring interfaces for link beat: eth4 eth1 eth0"
2016:11:24-13:16:07 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 83 07.736" name="Monitoring interfaces for link beat: eth4 eth1 eth0"
2016:11:24-13:19:50 dcexresort-master-1 ha_daemon[11985]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 84 50.622" name="Set syncing.files for node 2"
2016:11:24-13:19:52 dcexresort-master-2 repctl[12375]: [e] db_connect(2058): error while connecting to database(DBI:Pg:dbname=repmgr): could not connect to server: No such file or directory
2016:11:24-13:19:52 dcexresort-master-2 repctl[12375]: [c] local_connection(1946): cannot connect to local database: could not connect to server: No such file or directory
2016:11:24-13:19:52 dcexresort-master-2 repctl[12375]: [w] recheck(1030): re-initialising replication
2016:11:24-13:19:52 dcexresort-master-2 repctl[12375]: [i] execute(1627): pg_ctl: no server running
2016:11:24-13:19:52 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 76 52.201" name="HA control: cmd = 'sync start 1 database'"
2016:11:24-13:19:52 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 77 52.201" name="Activating sync process for database on node 1"
2016:11:24-13:19:52 dcexresort-master-2 repctl[12375]: [i] execute(1627): pg_ctl: PID file "/var/storage/pgsql92/data/postmaster.pid" does not exist
2016:11:24-13:19:52 dcexresort-master-2 repctl[12375]: [i] start_backup_mode(643): starting backup mode at 00000001000001120000008B
2016:11:24-13:19:52 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 78 52.776" name="HA control: cmd = 'sync start 1 database'"
2016:11:24-13:19:52 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 79 52.776" name="Activating sync process for database on node 1"
2016:11:24-13:19:52 dcexresort-master-2 repctl[12375]: [i] execute(1627): rsync: change_dir#3 "/var/storage/pgsql92/data/global" failed: No such file or directory (2)
2016:11:24-13:19:52 dcexresort-master-2 repctl[12375]: [c] standby_clone(825): rsync failed on $VAR1 = {
2016:11:24-13:19:53 dcexresort-master-2 repctl[12375]: [i] stop_backup_mode(664): stopped backup mode at 00000001000001120000008B
2016:11:24-13:19:53 dcexresort-master-2 repctl[12375]: [c] standby_clone(837): sync aborted
2016:11:24-13:19:53 dcexresort-master-2 repctl[12375]: [e] prepare_secondary(293): clone failed
2016:11:24-13:19:54 dcexresort-master-2 repctl[12375]: [i] start_backup_mode(643): starting backup mode at 00000001000001120000008D
2016:11:24-13:19:54 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 80 54.116" name="HA control: cmd = 'sync start 1 database'"
2016:11:24-13:19:54 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 81 54.116" name="Activating sync process for database on node 1"
2016:11:24-13:19:54 dcexresort-master-2 repctl[12375]: [i] execute(1627): rsync: change_dir#3 "/var/storage/pgsql92/data/global" failed: No such file or directory (2)
2016:11:24-13:19:54 dcexresort-master-2 repctl[12375]: [c] standby_clone(825): rsync failed on $VAR1 = {
2016:11:24-13:19:55 dcexresort-master-2 repctl[12375]: [i] stop_backup_mode(664): stopped backup mode at 00000001000001120000008D
2016:11:24-13:19:55 dcexresort-master-2 repctl[12375]: [c] standby_clone(837): sync aborted
2016:11:24-13:19:55 dcexresort-master-2 repctl[12375]: [e] prepare_secondary(293): clone failed
2016:11:24-13:19:55 dcexresort-master-2 repctl[12375]: [i] start_backup_mode(643): starting backup mode at 00000001000001120000008F
2016:11:24-13:19:55 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 82 55.959" name="HA control: cmd = 'sync start 1 database'"
2016:11:24-13:19:55 dcexresort-master-2 ha_daemon[4189]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 83 55.959" name="Activating sync process for database on node 1"
2016:11:24-13:19:56 dcexresort-master-2 repctl[12375]: [i] execute(1627): rsync: change_dir#3 "/var/storage/pgsql92/data/global" failed: No such file or directory (2)
2016:11:24-13:19:56 dcexresort-master-2 repctl[12375]: [c] standby_clone(825): rsync failed on $VAR1 = {
2016:11:24-13:19:57 dcexresort-master-2 repctl[12375]: [i] stop_backup_mode(664): stopped backup mode at 00000001000001120000008F
2016:11:24-13:19:57 dcexresort-master-2 repctl[12375]: [c] standby_clone(837): sync aborted
2016:11:24-13:19:57 dcexresort-master-2 repctl[12375]: [e] prepare_secondary(293): clone failed
2016:11:24-13:19:57 dcexresort-master-2 repctl[12375]: [c] prepare_secondary(305): failed to get database up, waiting for retry
2016:11:24-13:19:57 dcexresort-master-2 repctl[12375]: [i] setup_replication(229): checkinterval 300
This thread was automatically locked due to age.