This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

HA Issues

Sophos UTM 9.5 X2

ESXi 6.5

 

I have two UTM's running on different servers connected with a direct cable for HA. Last year I had to turn off one of the servers because I get logs full of stuff like below.

So I thought maybe the 2nd UTM was borked. So this morning I wiped the 2nd UTM, reinstalled from the ISO, turned on HA and this started again.  Also, the traffic on the HA interface is always going 500 to 1000Mbit which I don't think is normal.  I had another set of UTM's in the company with the same setup and they never have this issue.  With HA is it better for a direct wire or using a switch? Is there anything I need to edit in ESXi to make this work better?  The errors started the moment I turned on HA.

Below is a graph of the traffic from ETH2 the HA interface.  This is in Mbit/sec

 

2018:05:06-03:52:46 gateway-2 ha_daemon[7734]: id="38A0" severity="info" sys="System" sub="ha" seq="S:   78 46.243" name="Activating sync process for database on node 1"
2018:05:06-03:51:47 gateway-1 ha_daemon[20000]: id="38A1" severity="warn" sys="System" sub="ha" seq="M:   53 47.064" name="Lost heartbeat message from node 2! Expected 550 but got 551"
2018:05:06-03:53:01 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:   79 01.206" name="Lost heartbeat message from node 1! Expected 747 but got 748"
2018:05:06-03:53:18 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:   80 18.371" name="Lost heartbeat message from node 1! Expected 764 but got 765"
2018:05:06-03:52:11 gateway-1 ha_daemon[20000]: id="38A1" severity="warn" sys="System" sub="ha" seq="M:   54 11.242" name="Lost heartbeat message from node 2! Expected 574 but got 575"
2018:05:06-03:53:22 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:   81 22.273" name="Lost heartbeat message from node 1! Expected 768 but got 769"
2018:05:06-03:53:25 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:   82 25.283" name="Lost heartbeat message from node 1! Expected 771 but got 772"
2018:05:06-03:52:25 gateway-1 ha_daemon[20000]: id="38A1" severity="warn" sys="System" sub="ha" seq="M:   55 25.287" name="Lost heartbeat message from node 2! Expected 588 but got 589"
2018:05:06-03:52:27 gateway-1 ha_daemon[20000]: id="38A1" severity="warn" sys="System" sub="ha" seq="M:   56 27.289" name="Lost heartbeat message from node 2! Expected 590 but got 591"
2018:05:06-03:53:37 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:   83 37.397" name="Lost heartbeat message from node 1! Expected 783 but got 784"
2018:05:06-03:52:34 gateway-1 repctl[21098]: [i] recheck(1057): got ALRM: replication recheck triggered Setup_replication_done = 1
2018:05:06-03:53:18 gateway-1 ha_daemon[20000]: id="38A1" severity="warn" sys="System" sub="ha" seq="M:   57 18.550" name="Lost heartbeat message from node 2! Expected 641 but got 642"
2018:05:06-03:54:26 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:   84 26.304" name="Lost heartbeat message from node 1! Expected 832 but got 833"
2018:05:06-03:53:22 gateway-1 ha_daemon[20000]: id="38A1" severity="warn" sys="System" sub="ha" seq="M:   58 22.554" name="Lost heartbeat message from node 2! Expected 645 but got 646"
2018:05:06-03:54:30 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:   85 30.297" name="Lost heartbeat message from node 1! Expected 836 but got 837"
2018:05:06-03:54:35 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:   86 35.299" name="Lost heartbeat message from node 1! Expected 841 but got 842"
2018:05:06-03:53:35 gateway-1 ha_daemon[20000]: id="38A1" severity="warn" sys="System" sub="ha" seq="M:   59 35.592" name="Lost heartbeat message from node 2! Expected 658 but got 659"
2018:05:06-03:54:44 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:   87 44.542" name="Lost heartbeat message from node 1! Expected 849 but got 851"
2018:05:06-03:53:40 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:   88 40.491" name="Lost heartbeat message from node 1! Expected 854 but got 855"
2018:05:06-03:53:47 gateway-1 ha_daemon[20000]: id="38A1" severity="warn" sys="System" sub="ha" seq="M:   60 47.745" name="Lost heartbeat message from node 2! Expected 668 but got 671"
2018:05:06-03:53:49 gateway-1 ha_daemon[20000]: id="38A1" severity="warn" sys="System" sub="ha" seq="M:   61 49.748" name="Lost heartbeat message from node 2! Expected 672 but got 673"
2018:05:06-03:53:50 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:   89 50.637" name="Lost heartbeat message from node 1! Expected 864 but got 865"
2018:05:06-03:53:53 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:   90 53.462" name="Lost heartbeat message from node 1! Expected 867 but got 868"
2018:05:06-03:53:54 gateway-1 ha_daemon[20000]: id="38A1" severity="warn" sys="System" sub="ha" seq="M:   62 54.754" name="Lost heartbeat message from node 2! Expected 677 but got 678"
2018:05:06-03:53:54 gateway-1 ha_daemon[20000]: id="38A0" severity="info" sys="System" sub="ha" seq="M:   63 54.848" name="Reading cluster configuration"
2018:05:06-03:53:59 gateway-2 ha_daemon[7734]: id="38A0" severity="info" sys="System" sub="ha" seq="S:   91 59.195" name="Reading cluster configuration"
2018:05:06-03:53:59 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:   92 59.723" name="Lost heartbeat message from node 1! Expected 872 but got 874"
2018:05:06-03:54:00 gateway-1 ha_daemon[20000]: id="38A1" severity="warn" sys="System" sub="ha" seq="M:   64 00.787" name="Lost heartbeat message from node 2! Expected 683 but got 684"
2018:05:06-03:54:01 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:   93 01.507" name="Lost heartbeat message from node 1! Expected 875 but got 876"
2018:05:06-03:54:02 gateway-1 ha_daemon[20000]: id="38A0" severity="info" sys="System" sub="ha" seq="M:   65 02.756" name="Reading cluster configuration"
2018:05:06-03:54:02 gateway-1 ha_daemon[20000]: id="38A0" severity="info" sys="System" sub="ha" seq="M:   66 02.756" name="Starting use of backup interface 'eth1'"
2018:05:06-03:54:04 gateway-2 ha_daemon[7734]: id="38A0" severity="info" sys="System" sub="ha" seq="S:   94 04.904" name="Monitoring interfaces for link beat: eth1"
2018:05:06-03:54:05 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:   95 05.473" name="Lost heartbeat message from node 1! Expected 879 but got 880"
2018:05:06-03:54:09 gateway-1 ha_daemon[20000]: id="38A0" severity="info" sys="System" sub="ha" seq="M:   67 09.934" name="Monitoring interfaces for link beat: eth1"
2018:05:06-03:54:15 gateway-1 ha_daemon[20000]: id="38A1" severity="warn" sys="System" sub="ha" seq="M:   68 15.858" name="Lost heartbeat message from node 2! Expected 698 but got 699"
2018:05:06-03:54:28 gateway-1 ha_daemon[20000]: id="38A1" severity="warn" sys="System" sub="ha" seq="M:   69 28.901" name="Lost heartbeat message from node 2! Expected 711 but got 712"
2018:05:06-03:54:31 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:   96 31.733" name="Lost heartbeat message from node 1! Expected 905 but got 906"
2018:05:06-03:54:39 gateway-2 ha_daemon[7734]: id="38A0" severity="info" sys="System" sub="ha" seq="S:   97 39.199" name="Reading cluster configuration"
2018:05:06-03:54:39 gateway-2 ha_daemon[7734]: id="38A0" severity="info" sys="System" sub="ha" seq="S:   98 39.199" name="Starting use of backup interface 'eth1'"
2018:05:06-03:54:43 gateway-1 ha_daemon[20000]: id="38A1" severity="warn" sys="System" sub="ha" seq="M:   70 43.939" name="Lost heartbeat message from node 2! Expected 725 but got 727"
2018:05:06-03:54:44 gateway-2 ha_daemon[7734]: id="38A0" severity="info" sys="System" sub="ha" seq="S:   99 44.814" name="Monitoring interfaces for link beat: eth1"
2018:05:06-03:54:46 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:  100 46.590" name="Lost heartbeat message from node 1! Expected 920 but got 921"
2018:05:06-03:54:52 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:  101 52.516" name="Lost heartbeat message from node 1! Expected 924 but got 927"
2018:05:06-03:54:53 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:  102 53.808" name="Received no backup heartbeats at interface 'eth1'"
2018:05:06-03:55:12 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:  103 12.554" name="Lost heartbeat message from node 1! Expected 946 but got 947"
2018:05:06-03:55:19 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:  104 19.542" name="Lost heartbeat message from node 1! Expected 953 but got 954"
2018:05:06-03:55:22 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:  105 22.547" name="Lost heartbeat message from node 1! Expected 956 but got 957"
2018:05:06-03:55:39 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:  106 39.574" name="Lost heartbeat message from node 1! Expected 972 but got 974"
2018:05:06-03:55:47 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:  107 47.572" name="Lost heartbeat message from node 1! Expected 981 but got 982"
2018:05:06-03:56:13 gateway-1 ha_daemon[20000]: id="38A1" severity="warn" sys="System" sub="ha" seq="M:   71 13.600" name="Lost heartbeat message from node 2! Expected 815 but got 816"
2018:05:06-03:56:30 gateway-1 ha_daemon[20000]: id="38A1" severity="warn" sys="System" sub="ha" seq="M:   72 30.635" name="Lost heartbeat message from node 2! Expected 831 but got 833"
2018:05:06-03:56:33 gateway-2 ha_daemon[7734]: id="38A1" severity="warn" sys="System" sub="ha" seq="S:  110 33.646" name="Lost heartbeat message from node 1! Expected 1027 but got 1028"


This thread was automatically locked due to age.
Parents
  • Hello JayMan,

    for an initial HA connection data rate up to 1000 MBit is ok, not in an running system. But it seems there could be a network issue? Because the system got a newer packet than expected. And this happens on the backup interface eth1 too.

    Maybe show some more details of the network setup of your system.

    Best

    Alex

    -

Reply
  • Hello JayMan,

    for an initial HA connection data rate up to 1000 MBit is ok, not in an running system. But it seems there could be a network issue? Because the system got a newer packet than expected. And this happens on the backup interface eth1 too.

    Maybe show some more details of the network setup of your system.

    Best

    Alex

    -

Children