Last weekend, I migrated part of our setup to an ESXi5 virtualization solution. We used two ASG 220 in HA-failover mode before to achieve the exact same goal: a high-availability VPN gateway.
Since this migration, I experience some strange behavior with the HA system. To provide you with a bit more insight, here are parts of todays HA-logs:
2011:12:12-06:03:38 vpn-1 ha_daemon[6169]: id="38A1" severity="warn" sys="System" sub="ha" name="Lost heartbeat message from node 2! Expected 29082 but got 29083"
2011:12:12-06:05:44 vpn-1 ha_daemon[6169]: id="38A1" severity="warn" sys="System" sub="ha" name="Lost heartbeat message from node 2! Expected 29207 but got 29208"
2011:12:12-06:05:56 vpn-2 ha_proxy[32048]: Initializing tinyproxy ...
2011:12:12-06:05:56 vpn-2 ha_proxy[32048]: Reloading config file
2011:12:12-06:05:56 vpn-2 ha_proxy[32048]: Listening on IP 0.0.0.0
2011:12:12-06:05:56 vpn-2 ha_proxy[32048]: Setting "Via" header to 'Astaro HA Proxy'
2011:12:12-06:05:56 vpn-2 ha_proxy[32048]: Adding Port [443] to the list allowed by CONNECT
2011:12:12-06:05:56 vpn-2 ha_proxy[32048]: Now running as group "nogroup".
2011:12:12-06:05:56 vpn-2 ha_proxy[32048]: Now running as user "nobody".
2011:12:12-06:05:56 vpn-2 ha_proxy[32051]: Creating child number 1 of 1 ...
2011:12:12-06:05:56 vpn-2 ha_proxy[32051]: Finished creating all children.
2011:12:12-06:05:56 vpn-2 ha_proxy[32051]: Setting the various signals.
2011:12:12-06:05:56 vpn-2 ha_proxy[32051]: Starting main loop. Accepting connections.
2011:12:12-06:05:56 vpn-2 ha_daemon[6211]: id="38A1" severity="warn" sys="System" sub="ha" name="Another master around!"
2011:12:12-06:05:56 vpn-2 ha_daemon[6211]: id="38A0" severity="info" sys="System" sub="ha" name="Node 1 joined with version 8.203"
2011:12:12-06:05:56 vpn-2 ha_daemon[6211]: id="38C0" severity="info" sys="System" sub="ha" name="Node 1 is alive!"
2011:12:12-06:05:56 vpn-2 ha_daemon[6211]: id="38A0" severity="info" sys="System" sub="ha" name="Node 1 changed state: DEAD -> ACTIVE"
2011:12:12-06:05:56 vpn-2 ha_daemon[6211]: id="38A0" severity="info" sys="System" sub="ha" name="Node 1 changed mode: SLAVE -> MASTER"
2011:12:12-06:05:56 vpn-2 ha_daemon[6211]: id="38A1" severity="warn" sys="System" sub="ha" name="Received backup heartbeats from master node!"
2011:12:12-06:05:57 vpn-1 ha_daemon[6169]: id="38A1" severity="warn" sys="System" sub="ha" name="Another master around!"
2011:12:12-06:05:57 vpn-1 ha_daemon[6169]: id="38A1" severity="warn" sys="System" sub="ha" name="Lost heartbeat message from node 2! Expected 29215 but got 29221"
2011:12:12-06:05:57 vpn-1 ha_daemon[6169]: id="38A0" severity="info" sys="System" sub="ha" name="Node 2 changed mode: SLAVE -> MASTER"
2011:12:12-06:05:57 vpn-1 ha_daemon[6169]: id="38A0" severity="info" sys="System" sub="ha" name="Resending gratuitous arp"
2011:12:12-06:05:57 vpn-1 ha_daemon[6169]: id="38A1" severity="warn" sys="System" sub="ha" name="Received backup heartbeats from master node!"
2011:12:12-06:05:57 vpn-2 ha_daemon[6211]: id="38A1" severity="warn" sys="System" sub="ha" name="Another master around!"
2011:12:12-06:05:57 vpn-2 ha_daemon[6211]: id="38Bb" severity="info" sys="System" sub="ha" name="Going slave mode in favour of node 1 (-29303 sec)"
2011:12:12-06:05:57 vpn-2 ha_daemon[6211]: id="38B1" severity="info" sys="System" sub="ha" name="Switching to Slave mode"
2011:12:12-06:05:57 vpn-2 ha_daemon[6211]: id="38A0" severity="info" sys="System" sub="ha" name="cluster mode: set master id to 1"
2011:12:12-06:05:58 vpn-2 slon_control[6358]: Killing slon pop3 [8753]
2011:12:12-06:05:58 vpn-1 ha_daemon[6169]: id="38A1" severity="warn" sys="System" sub="ha" name="Another master around!"
2011:12:12-06:05:58 vpn-1 ha_daemon[6169]: id="38A0" severity="info" sys="System" sub="ha" name="Resending gratuitous arp"
2011:12:12-06:05:58 vpn-1 ha_daemon[6169]: id="38A0" severity="info" sys="System" sub="ha" name="Activating sync process for database on node 2"
2011:12:12-06:05:59 vpn-1 ha_daemon[6169]: id="38A0" severity="info" sys="System" sub="ha" name="Node 2 changed mode: MASTER -> SLAVE"
2011:12:12-06:06:01 vpn-1 ha_daemon[6169]: id="38A1" severity="warn" sys="System" sub="ha" name="Received no backup heartbeats from master node!"
2011:12:12-06:06:02 vpn-2 ha_daemon[6211]: id="38A1" severity="warn" sys="System" sub="ha" name="Lost heartbeat message from node 1! Expected 30373 but got 30374"
2011:12:12-06:06:02 vpn-2 slon_control[6358]: Killing slon epp [8754]
2011:12:12-06:06:02 vpn-2 ha_daemon[6211]: id="38A0" severity="info" sys="System" sub="ha" name="Reading cluster configuration"
2011:12:12-06:06:02 vpn-2 ha_proxy[32051]: Shutting down.
2011:12:12-06:06:02 vpn-2 ha_daemon[6211]: id="38A1" severity="warn" sys="System" sub="ha" name="Received backup heartbeats from master node!"
2011:12:12-06:06:04 vpn-2 slon_control[6358]: Starting controlled switchover from Node 1 to 2
2011:12:12-06:06:04 vpn-2 slon_control[6358]: Slonik error, process exited with value 255
2011:12:12-06:06:04 vpn-2 slon_control[6358]: Starting controlled switchover from Node 1 to 2
2011:12:12-06:06:04 vpn-2 slon_control[6358]: Slonik error, process exited with value 255
2011:12:12-06:06:04 vpn-2 slon_control[6358]: Starting controlled switchover from Node 1 to 2
2011:12:12-06:06:04 vpn-2 slon_control[6358]: Slonik error, process exited with value 255
2011:12:12-06:06:04 vpn-2 slon_control[6358]: Started slon process 32671 for reporting
2011:12:12-06:06:04 vpn-2 slon_control[6358]: Started slon process 32672 for pop3
2011:12:12-06:06:04 vpn-2 slon_control[6358]: Started slon process 32673 for epp
2011:12:12-06:06:05 vpn-2 slon_control[6358]: Set mode to SLAVE
2011:12:12-06:06:05 vpn-2 ha_daemon[6211]: id="38A0" severity="info" sys="System" sub="ha" name="Deactivating sync process for database on node 1"
2011:12:12-06:06:05 vpn-2 slon_control[6358]: Skipping slony cleanup for reporting
2011:12:12-06:06:05 vpn-2 slon_control[6358]: Skipping slony cleanup for pop3
2011:12:12-06:06:05 vpn-2 slon_control[6358]: Skipping slony cleanup for epp
2011:12:12-06:06:10 vpn-1 ha_daemon[6169]: id="38C1" severity="info" sys="System" sub="ha" name="Node 2 is dead, received no heart beats!"
2011:12:12-06:06:14 vpn-1 ha_daemon[6169]: id="38A0" severity="info" sys="System" sub="ha" name="Node 2 joined with version 8.203"
2011:12:12-06:06:14 vpn-1 ha_daemon[6169]: id="38C0" severity="info" sys="System" sub="ha" name="Node 2 is alive!"
2011:12:12-06:06:14 vpn-1 ha_daemon[6169]: id="38A0" severity="info" sys="System" sub="ha" name="Node 2 changed state: DEAD -> ACTIVE"
2011:12:12-06:06:15 vpn-2 ha_daemon[6211]: id="38A1" severity="warn" sys="System" sub="ha" name="Lost heartbeat message from node 1! Expected 30381 but got 30387"
2011:12:12-06:06:34 vpn-2 ha_daemon[6211]: id="38A0" severity="info" sys="System" sub="ha" name="Monitoring interfaces for link beat: eth0 eth1 eth3 "
2011:12:12-06:11:30 vpn-1 slon_control[6309]: Initial synchronization for node 2 finished!
2011:12:12-06:11:30 vpn-1 ha_daemon[6169]: id="38A0" severity="info" sys="System" sub="ha" name="Deactivating sync process for database on node 2"
If I am not misinterpreting something, the master node sometimes fails to receive a heartbeat signal from the slave, even though they are directly connected with a vSwitch (port: eth2). Has anyone else had these issues with a virtual ASG on the same physical host?
I am thinking about using the "cc set ha advanced virtual_mac 0" trick, but I fear that when there is no "clean" failover, the two nodes will announce themselves as default gateway with two different MAC addresses. And that would mess up a lot of things...
Any idea how to fix that? Thanks for helping me out.
Cheers,
Manuel
This thread was automatically locked due to age.