Hello all, Having some issues with a pair of SG-330's running in HA Active-Passive mode.
When I get Node2 powered on it stays on for about a minute, begins Synchronizing, and then powers off with seemingly no warning.
When I power Node2 on without the replication or other network cable connected it stays alive until the replication interface is patched back in.
Logs from Node1 are as follows:
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A3" severity="debug" sys="System" sub="ha" seq="M: 504 02.037" name="Netlink: Found link beat on eth3 again!" 2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 505 02.235" name="Another master around!" 2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 506 02.235" name="Node 2 changed version! 0.000000 -> 9.710001" 2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 507 02.235" name="Lost heartbeat message from node 2! Expected 11 but got 77" 2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38C0" severity="info" sys="System" sub="ha" seq="M: 508 02.235" name="Node 2 is alive" 2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 509 02.235" name="Node 2 changed state: DEAD(2048) -> UNLINKED(1)" 2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 510 02.235" name="Node 2 changed role: DEAD -> MASTER" 2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 511 02.235" name="Executing (nowait) /etc/init.d/ha_mode topology_changed" 2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 512 02.235" name="Enforce MASTER, Resending gratuitous arp" 2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 513 02.235" name="Executing (nowait) /etc/init.d/ha_mode enforce_master" 2022:07:26-13:26:02 portal-1 ha_mode[23714]: calling topology_changed 2022:07:26-13:26:02 portal-1 ha_mode[23714]: topology_changed: waiting for last ha_mode done 2022:07:26-13:26:02 portal-1 ha_mode[23716]: calling enforce_master 2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 514 02.312" name="Node 2 changed state: UNLINKED(1) -> SYNCING(3)" 2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 515 02.312" name="Node 2 changed role: MASTER -> SLAVE" 2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 516 02.313" name="Executing (wait) /usr/local/bin/confd-setha mode master master_ip 198.19.250.1 slave_ip 198.19.250.2" 2022:07:26-13:26:02 portal-1 ha_mode[23714]: repctl[23789]: [i] daemonize_check(1480): daemonized, see syslog for further messages 2022:07:26-13:26:02 portal-1 repctl[30889]: [i] recheck(1057): got HUP: replication recheck triggered Setup_replication_done = 0 2022:07:26-13:26:02 portal-1 repctl[23789]: [i] daemonize_check(1480): daemonized, see syslog for further messages 2022:07:26-13:26:02 portal-1 repctl[23789]: [i] daemonize_check(1497): trying to signal daemon and exit 2022:07:26-13:26:02 portal-1 repctl[30889]: [i] execute(1768): pg_ctl: no server running 2022:07:26-13:26:02 portal-1 ha_mode[23714]: topology_changed done (started at 13:26:02) 2022:07:26-13:26:02 portal-1 ha_mode[23716]: enforce_master: waiting for last ha_mode done 2022:07:26-13:26:02 portal-1 ha_mode[23716]: enforce_master 2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 517 02.566" name="Executing (nowait) /etc/init.d/ha_mode topology_changed" 2022:07:26-13:26:02 portal-1 ha_mode[23798]: calling topology_changed 2022:07:26-13:26:02 portal-1 repctl[30889]: [e] db_connect(2206): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2): FATAL: password authentication failed for user "repmgr" 2022:07:26-13:26:02 portal-1 repctl[30889]: [e] master_connection(2045): FATAL: password authentication failed for user "repmgr" 2022:07:26-13:26:03 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 518 03.401" name="Reading cluster configuration" 2022:07:26-13:26:05 portal-1 repctl[30889]: [e] db_connect(2206): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2): FATAL: password authentication failed for user "repmgr" 2022:07:26-13:26:05 portal-1 repctl[30889]: [e] master_connection(2045): FATAL: password authentication failed for user "repmgr" 2022:07:26-13:26:07 portal-1 ha_mode[23716]: /var/mdw/scripts/confd-sync: /usr/local/bin/confd-sync stopped 2022:07:26-13:26:07 portal-1 ha_mode[23716]: /var/mdw/scripts/confd-sync: /usr/local/bin/confd-sync started 2022:07:26-13:26:07 portal-1 ha_mode[23716]: enforce_master done (started at 13:26:02) 2022:07:26-13:26:07 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 519 07.992" name="Set syncing.files for node 2" 2022:07:26-13:26:08 portal-1 ha_mode[23798]: topology_changed: waiting for last ha_mode done 2022:07:26-13:26:08 portal-1 ha_mode[23798]: repctl[24905]: [i] daemonize_check(1480): daemonized, see syslog for further messages 2022:07:26-13:26:08 portal-1 repctl[24905]: [i] daemonize_check(1480): daemonized, see syslog for further messages 2022:07:26-13:26:08 portal-1 repctl[24905]: [i] daemonize_check(1497): trying to signal daemon and exit 2022:07:26-13:26:08 portal-1 ha_mode[23798]: topology_changed done (started at 13:26:02) 2022:07:26-13:26:09 portal-1 repctl[30889]: [e] db_connect(2206): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2): FATAL: password authentication failed for user "repmgr" 2022:07:26-13:26:09 portal-1 repctl[30889]: [e] master_connection(2045): FATAL: password authentication failed for user "repmgr" 2022:07:26-13:26:16 portal-1 ha_daemon[6982]: id="38C1" severity="error" sys="System" sub="ha" seq="M: 520 16.992" name="Node 2 is dead, received no heart beats" 2022:07:26-13:26:16 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 521 16.992" name="Executing (wait) /usr/local/bin/confd-setha mode master master_ip 198.19.250.1 slave_ip ''" 2022:07:26-13:26:17 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 522 17.222" name="Executing (nowait) /etc/init.d/ha_mode topology_changed" 2022:07:26-13:26:17 portal-1 ha_mode[26600]: calling topology_changed 2022:07:26-13:26:17 portal-1 ha_mode[26600]: topology_changed: waiting for last ha_mode done 2022:07:26-13:26:17 portal-1 ha_mode[26600]: repctl[26650]: [i] daemonize_check(1480): daemonized, see syslog for further messages 2022:07:26-13:26:17 portal-1 ha_mode[26600]: topology_changed done (started at 13:26:17) 2022:07:26-13:26:17 portal-1 repctl[26650]: [i] daemonize_check(1480): daemonized, see syslog for further messages 2022:07:26-13:26:17 portal-1 repctl[26650]: [i] daemonize_check(1497): trying to signal daemon and exit 2022:07:26-13:26:18 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 523 18.000" name="Reading cluster configuration" 2022:07:26-13:26:18 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 524 18.543" name="Found no interfaces in /etc/ha/lbeat_interfaces for link beat monitoring" 2022:07:26-13:26:42 portal-1 repctl[30889]: [e] db_connect(2203): timeout while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2) 2022:07:26-13:26:42 portal-1 repctl[30889]: [e] master_connection(2045): (timeout) 2022:07:26-19:26:42 portal-1 conntrack-tools[7742]: no dedicated links available! 2022:07:26-13:26:45 portal-1 repctl[30889]: [c] prepare_secondary(315): prepare_secondary failed because master db's status can't be determined! Maybe unreachable? 2022:07:26-13:26:45 portal-1 repctl[30889]: [c] setup_replication(274): setup_replication was not properly executed 2022:07:26-13:26:45 portal-1 repctl[30889]: [i] setup_replication(278): checkinterval 300 2022:07:26-13:26:45 portal-1 repctl[30889]: [i] recheck(1057): got HUP: replication recheck triggered Setup_replication_done = 0 2022:07:26-13:26:45 portal-1 repctl[30889]: [i] execute(1768): pg_ctl: no server running 2022:07:26-13:26:46 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 525 46.385" name="Found no interfaces in /etc/ha/lbeat_interfaces for link beat monitoring" 2022:07:26-13:26:46 portal-1 ha_daemon[6982]: id="38A3" severity="debug" sys="System" sub="ha" seq="M: 526 46.385" name="Netlink: Lost link beat on eth3!" 2022:07:26-13:27:03 portal-1 repctl[30889]: [e] db_connect(2206): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2): could not connect to server: No route to host 2022:07:26-13:27:03 portal-1 repctl[30889]: [e] db_connect(2206): Is the server running on host "198.19.250.2" and accepting 2022:07:26-13:27:03 portal-1 repctl[30889]: [e] db_connect(2206): TCP/IP connections on port 5432? 2022:07:26-13:27:03 portal-1 repctl[30889]: [e] master_connection(2045): could not connect to server: No route to host 2022:07:26-13:27:03 portal-1 repctl[30889]: [e] master_connection(2045): Is the server running on host "198.19.250.2" and accepting 2022:07:26-13:27:03 portal-1 repctl[30889]: [e] master_connection(2045): TCP/IP connections on port 5432?
This strikes me as bizarre behavior, much appreciated if anyone has any pointers.
I'm not sure if they have worked in the past as I am just adopting this network and the device was in the powered-off state when I first laid eyes on it.
This thread was automatically locked due to age.