This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Up2Date stuck on slave HA

SG 310

Here is what the live log says....

2019:08:16-09:17:29 epi-colo-fw01-2 repctl[6582]: [i] recheck(1057): got ALRM: replication recheck triggered Setup_replication_done = 1

2019:08:16-09:22:29 epi-colo-fw01-2 repctl[6582]: [i] recheck(1057): got ALRM: replication recheck triggered Setup_replication_done = 1

2019:08:16-09:27:29 epi-colo-fw01-2 repctl[6582]: [i] recheck(1057): got ALRM: replication recheck triggered Setup_replication_done = 1

2019:08:16-09:32:29 epi-colo-fw01-2 repctl[6582]: [i] recheck(1057): got ALRM: replication recheck triggered Setup_replication_done = 1

2019:08:16-09:37:29 epi-colo-fw01-2 repctl[6582]: [i] recheck(1057): got ALRM: replication recheck triggered Setup_replication_done = 1

2019:08:16-09:42:29 epi-colo-fw01-2 repctl[6582]: [i] recheck(1057): got ALRM: replication recheck triggered Setup_replication_done = 1

2019:08:16-09:47:29 epi-colo-fw01-2 repctl[6582]: [i] recheck(1057): got ALRM: replication recheck triggered Setup_replication_done = 1

2019:08:16-09:52:29 epi-colo-fw01-2 repctl[6582]: [i] recheck(1057): got ALRM: replication recheck triggered Setup_replication_done = 1

2019:08:16-09:57:29 epi-colo-fw01-2 repctl[6582]: [i] recheck(1057): got ALRM: replication recheck triggered Setup_replication_done = 1

2019:08:16-10:02:29 epi-colo-fw01-2 repctl[6582]: [i] recheck(1057): got ALRM: replication recheck triggered Setup_replication_done = 1

2019:08:16-10:07:30 epi-colo-fw01-2 repctl[6582]: [i] recheck(1057): got ALRM: replication recheck triggered Setup_replication_done = 1

2019:08:16-10:12:30 epi-colo-fw01-2 repctl[6582]: [i] recheck(1057): got ALRM: replication recheck triggered Setup_replication_done = 1

I was thinking of running auisys.plx --upto 9.605-1

This thread was automatically locked due to age.

0 James Shakour over 4 years ago

and looking back at it again I see the slave is marked as a Worker.
Cancel
Vote Up 0 Vote Down

Cancel
+1 dirkkotte over 4 years ago in reply to James Shakour

1. i would try to reboot the slave.

2. do you switch to active/active cluster within ha config?

Dirk

Dirk

Systema Gesellschaft für angewandte Datentechnik mbH // Sophos Platinum Partner
Sophos Solution Partner since 2003
If a post solves your question, click the 'Verify Answer' link at this post.
Cancel
Vote Up 0 Vote Down

Cancel
0 James Shakour over 4 years ago in reply to dirkkotte

Hello Dirk,

I am going to reboot the stuck node at some point this week and I will follow with my findings.

Thanks for the reply.
Cancel
Vote Up 0 Vote Down

Cancel
0 James Shakour over 4 years ago in reply to James Shakour

And we are ACTIVE ACTIVE!

The reboot did the trick!

Thanks
Cancel
Vote Up 0 Vote Down

Cancel
0 dirkkotte over 4 years ago in reply to James Shakour

ok, great.
is active/active planed or configured by mistake?
with active/active you have doubble of licensing-costs.

Dirk

Systema Gesellschaft für angewandte Datentechnik mbH // Sophos Platinum Partner
Sophos Solution Partner since 2003
If a post solves your question, click the 'Verify Answer' link at this post.
Cancel
Vote Up 0 Vote Down

Cancel
0 James Shakour over 4 years ago in reply to dirkkotte

Active Active was planned.

Thats how it was before the up2date.
Cancel
Vote Up 0 Vote Down

Cancel
0 UDPride over 4 years ago

Having the same issue on my SG125s getting my slave to synch with the latest Up2Date version as the master after doing an Up2Date procedure last night. Its been stuck for 12+ hours (ever since the Up2Date task). Master is at 9.700-5, while the Slave is at 9.605-1.

I have rebooted both the Master and Slave with no effect. Not sure what to try next.

Config:

Log:

2019:11:30-16:16:21 ddpnet-2 ha_daemon[3970]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 62 21.059" name="Executing (wait) /usr/local/bin/confd-setha mode slave master_ip '' slave_ip 198.19.250.2"

2019:11:30-16:16:21 ddpnet-2 ha_daemon[3970]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 63 21.375" name="Executing (nowait) /etc/init.d/ha_mode topology_changed"

2019:11:30-16:16:21 ddpnet-2 ha_mode[11673]: calling topology_changed

2019:11:30-16:16:21 ddpnet-2 ha_mode[11673]: topology_changed: waiting for last ha_mode done

2019:11:30-16:16:21 ddpnet-2 ha_mode[11673]: topology_changed done (started at 16:16:21)

2019:11:30-16:16:21 ddpnet-2 ha_daemon[3970]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 64 21.974" name="Reading cluster configuration"

2019:11:30-16:18:45 ddpnet-1 ha_daemon[3976]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 33 45.190" name="Monitoring interfaces for link beat: eth1 eth0"

2019:11:30-16:18:45 ddpnet-1 ha_daemon[3976]: id="38A3" severity="debug" sys="System" sub="ha" seq="M: 34 45.190" name="Netlink: Lost link beat on eth1!"

2019:11:30-16:18:45 ddpnet-1 ha_daemon[3976]: id="38A3" severity="debug" sys="System" sub="ha" seq="M: 35 45.191" name="Netlink: Found link beat on eth1 again!"

2019:11:30-16:19:44 ddpnet-1 ha_daemon[3976]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 36 44.017" name="Monitoring interfaces for link beat: eth1 eth0"
Cancel
Vote Up 0 Vote Down

Cancel
0 James Shakour over 4 years ago in reply to UDPride

Maybe factory reset the slave from the front panel?
Cancel
Vote Up 0 Vote Down

Cancel
0 dirkkotte over 4 years ago in reply to James Shakour

With SG1xx there is no front panel.

But deleting device from HA-Config should factory-reset the device too.

Dirk

Systema Gesellschaft für angewandte Datentechnik mbH // Sophos Platinum Partner
Sophos Solution Partner since 2003
If a post solves your question, click the 'Verify Answer' link at this post.
Cancel
Vote Up 0 Vote Down

Cancel
0 UDPride over 4 years ago in reply to dirkkotte

So I ended up finding a solution. Here's what I did to get around the problem and ultimately fix.

1. Shut down the Master ASG device with the current Up2Date thru WebAdmin. This forced the Slave device with the older Up2Date to take over as the primary (only) LAN/WAN gateway/firewall at the same IP address.

2. Logged into the WebAdmin which was ASG Slave box. I had to wait about 7-10 minutes however before I was able to log in for some reason. My pings to the 192.168.0.1 were dead until then. Maybe it took that long for the Slave to become the stud duck.

3. Once logged in, I did a manual Up2Date to the latest version. Interesting that on the Dashboard it said the latest version was ready to upload, but when I clicked on that link to take me to the Up2Date page to install it, the Up2Date page said I was current at the older version. Again, had to wait about 10 minutes until things caught up and the Up2Date page finally showed the latest Up2Date to install. Installed it and rebooted.

4. Logged back into Slave and confirmed it was now on latest Up2Date version. LNetwork functioning okay but obviously HA was broken. Logged out.

5. Restarted Master ASG manually on device w/power button. Took about 10 minutes before I could log back in to ASG which was now the Master and in ACTIVE status.

6. In WebAdmin HA status, showed Master as primary and Slave as backup. Status on the backup was it was syncing. Took about 15 minutes before syncing finished and went to READY status as sync'd up.

7. Interestingly it appears the HA was boogered up since perhaps August because when I shut down the Master ASG and logged into the Slave ASG, the mail manager queue was full of quarantined mail from August that was waiting to be sorted through by a human. Obviously this had been done already when the Master was in charge. So I think things have been screwball since August. But Im sure Ive done a Up2Date since then. Hmm. May need to look at more logs to figure out why the Up2Date failed in HA in begin with.
Cancel
Vote Up 0 Vote Down

Cancel