Flapping HA 8.203 (VMware vSphere)

Question

Hi there!

Last weekend, I migrated part of our setup to an ESXi5 virtualization solution. We used two ASG 220 in HA-failover mode before to achieve the exact same goal: a high-availability VPN gateway.
Since this migration, I experience some strange behavior with the HA system. To provide you with a bit more insight, here are parts of todays HA-logs:

2011:12:12-06:03:38 vpn-1 ha_daemon[6169]: id="38A1" severity="warn" sys="System" sub="ha" name="Lost heartbeat message from node 2! Expected 29082 but got 29083"

2011:12:12-06:05:44 vpn-1 ha_daemon[6169]: id="38A1" severity="warn" sys="System" sub="ha" name="Lost heartbeat message from node 2! Expected 29207 but got 29208"

2011:12:12-06:05:56 vpn-2 ha_proxy[32048]: Initializing tinyproxy ...

2011:12:12-06:05:56 vpn-2 ha_proxy[32048]: Reloading config file

2011:12:12-06:05:56 vpn-2 ha_proxy[32048]: Listening on IP 0.0.0.0

2011:12:12-06:05:56 vpn-2 ha_proxy[32048]: Setting "Via" header to 'Astaro HA Proxy'

2011:12:12-06:05:56 vpn-2 ha_proxy[32048]: Adding Port [443] to the list allowed by CONNECT

2011:12:12-06:05:56 vpn-2 ha_proxy[32048]: Now running as group "nogroup".

2011:12:12-06:05:56 vpn-2 ha_proxy[32048]: Now running as user "nobody".

2011:12:12-06:05:56 vpn-2 ha_proxy[32051]: Creating child number 1 of 1 ...

2011:12:12-06:05:56 vpn-2 ha_proxy[32051]: Finished creating all children.

2011:12:12-06:05:56 vpn-2 ha_proxy[32051]: Setting the various signals.

2011:12:12-06:05:56 vpn-2 ha_proxy[32051]: Starting main loop. Accepting connections.

2011:12:12-06:05:56 vpn-2 ha_daemon[6211]: id="38A1" severity="warn" sys="System" sub="ha" name="Another master around!"

2011:12:12-06:05:56 vpn-2 ha_daemon[6211]: id="38A0" severity="info" sys="System" sub="ha" name="Node 1 joined with version 8.203"

2011:12:12-06:05:56 vpn-2 ha_daemon[6211]: id="38C0" severity="info" sys="System" sub="ha" name="Node 1 is alive!"

2011:12:12-06:05:56 vpn-2 ha_daemon[6211]: id="38A0" severity="info" sys="System" sub="ha" name="Node 1 changed state: DEAD -> ACTIVE"

2011:12:12-06:05:56 vpn-2 ha_daemon[6211]: id="38A0" severity="info" sys="System" sub="ha" name="Node 1 changed mode: SLAVE -> MASTER"

2011:12:12-06:05:56 vpn-2 ha_daemon[6211]: id="38A1" severity="warn" sys="System" sub="ha" name="Received backup heartbeats from master node!"

2011:12:12-06:05:57 vpn-1 ha_daemon[6169]: id="38A1" severity="warn" sys="System" sub="ha" name="Another master around!"

2011:12:12-06:05:57 vpn-1 ha_daemon[6169]: id="38A1" severity="warn" sys="System" sub="ha" name="Lost heartbeat message from node 2! Expected 29215 but got 29221"

2011:12:12-06:05:57 vpn-1 ha_daemon[6169]: id="38A0" severity="info" sys="System" sub="ha" name="Node 2 changed mode: SLAVE -> MASTER"

2011:12:12-06:05:57 vpn-1 ha_daemon[6169]: id="38A0" severity="info" sys="System" sub="ha" name="Resending gratuitous arp"

2011:12:12-06:05:57 vpn-1 ha_daemon[6169]: id="38A1" severity="warn" sys="System" sub="ha" name="Received backup heartbeats from master node!"

2011:12:12-06:05:57 vpn-2 ha_daemon[6211]: id="38A1" severity="warn" sys="System" sub="ha" name="Another master around!"

2011:12:12-06:05:57 vpn-2 ha_daemon[6211]: id="38Bb" severity="info" sys="System" sub="ha" name="Going slave mode in favour of node 1 (-29303 sec)"

2011:12:12-06:05:57 vpn-2 ha_daemon[6211]: id="38B1" severity="info" sys="System" sub="ha" name="Switching to Slave mode"

2011:12:12-06:05:57 vpn-2 ha_daemon[6211]: id="38A0" severity="info" sys="System" sub="ha" name="cluster mode: set master id to 1"

2011:12:12-06:05:58 vpn-2 slon_control[6358]: Killing slon pop3 [8753]

2011:12:12-06:05:58 vpn-1 ha_daemon[6169]: id="38A1" severity="warn" sys="System" sub="ha" name="Another master around!"

2011:12:12-06:05:58 vpn-1 ha_daemon[6169]: id="38A0" severity="info" sys="System" sub="ha" name="Resending gratuitous arp"

2011:12:12-06:05:58 vpn-1 ha_daemon[6169]: id="38A0" severity="info" sys="System" sub="ha" name="Activating sync process for database on node 2"

2011:12:12-06:05:59 vpn-1 ha_daemon[6169]: id="38A0" severity="info" sys="System" sub="ha" name="Node 2 changed mode: MASTER -> SLAVE"

2011:12:12-06:06:01 vpn-1 ha_daemon[6169]: id="38A1" severity="warn" sys="System" sub="ha" name="Received no backup heartbeats from master node!"

2011:12:12-06:06:02 vpn-2 ha_daemon[6211]: id="38A1" severity="warn" sys="System" sub="ha" name="Lost heartbeat message from node 1! Expected 30373 but got 30374"

2011:12:12-06:06:02 vpn-2 slon_control[6358]: Killing slon epp [8754]

2011:12:12-06:06:02 vpn-2 ha_daemon[6211]: id="38A0" severity="info" sys="System" sub="ha" name="Reading cluster configuration"

2011:12:12-06:06:02 vpn-2 ha_proxy[32051]: Shutting down.

2011:12:12-06:06:02 vpn-2 ha_daemon[6211]: id="38A1" severity="warn" sys="System" sub="ha" name="Received backup heartbeats from master node!"

2011:12:12-06:06:04 vpn-2 slon_control[6358]: Starting controlled switchover from Node 1 to 2

2011:12:12-06:06:04 vpn-2 slon_control[6358]: Slonik error, process exited with value 255

2011:12:12-06:06:04 vpn-2 slon_control[6358]: Starting controlled switchover from Node 1 to 2

2011:12:12-06:06:04 vpn-2 slon_control[6358]: Slonik error, process exited with value 255

2011:12:12-06:06:04 vpn-2 slon_control[6358]: Starting controlled switchover from Node 1 to 2

2011:12:12-06:06:04 vpn-2 slon_control[6358]: Slonik error, process exited with value 255

2011:12:12-06:06:04 vpn-2 slon_control[6358]: Started slon process 32671 for reporting

2011:12:12-06:06:04 vpn-2 slon_control[6358]: Started slon process 32672 for pop3

2011:12:12-06:06:04 vpn-2 slon_control[6358]: Started slon process 32673 for epp

2011:12:12-06:06:05 vpn-2 slon_control[6358]: Set mode to SLAVE

2011:12:12-06:06:05 vpn-2 ha_daemon[6211]: id="38A0" severity="info" sys="System" sub="ha" name="Deactivating sync process for database on node 1"

2011:12:12-06:06:05 vpn-2 slon_control[6358]: Skipping slony cleanup for reporting

2011:12:12-06:06:05 vpn-2 slon_control[6358]: Skipping slony cleanup for pop3

2011:12:12-06:06:05 vpn-2 slon_control[6358]: Skipping slony cleanup for epp

2011:12:12-06:06:10 vpn-1 ha_daemon[6169]: id="38C1" severity="info" sys="System" sub="ha" name="Node 2 is dead, received no heart beats!"

2011:12:12-06:06:14 vpn-1 ha_daemon[6169]: id="38A0" severity="info" sys="System" sub="ha" name="Node 2 joined with version 8.203"

2011:12:12-06:06:14 vpn-1 ha_daemon[6169]: id="38C0" severity="info" sys="System" sub="ha" name="Node 2 is alive!"

2011:12:12-06:06:14 vpn-1 ha_daemon[6169]: id="38A0" severity="info" sys="System" sub="ha" name="Node 2 changed state: DEAD -> ACTIVE"

2011:12:12-06:06:15 vpn-2 ha_daemon[6211]: id="38A1" severity="warn" sys="System" sub="ha" name="Lost heartbeat message from node 1! Expected 30381 but got 30387"

2011:12:12-06:06:34 vpn-2 ha_daemon[6211]: id="38A0" severity="info" sys="System" sub="ha" name="Monitoring interfaces for link beat: eth0 eth1 eth3 "

2011:12:12-06:11:30 vpn-1 slon_control[6309]: Initial synchronization for node 2 finished!

2011:12:12-06:11:30 vpn-1 ha_daemon[6169]: id="38A0" severity="info" sys="System" sub="ha" name="Deactivating sync process for database on node 2"

If I am not misinterpreting something, the master node sometimes fails to receive a heartbeat signal from the slave, even though they are directly connected with a vSwitch (port: eth2). Has anyone else had these issues with a virtual ASG on the same physical host?

I am thinking about using the "cc set ha advanced virtual_mac 0" trick, but I fear that when there is no "clean" failover, the two nodes will announce themselves as default gateway with two different MAC addresses. And that would mess up a lot of things...

Any idea how to fix that? Thanks for helping me out.

Cheers,
Manuel

This thread was automatically locked due to age.