This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

SG-330 Node2 Powers off during HA Replication

Hello all, Having some issues with a pair of SG-330's running in HA Active-Passive mode.

When I get Node2 powered on it stays on for about a minute, begins Synchronizing, and then powers off with seemingly no warning.

When I power Node2 on without the replication or other network cable connected it stays alive until the replication interface is patched back in.

Logs from Node1 are as follows:

2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A3" severity="debug" sys="System" sub="ha" seq="M: 504 02.037" name="Netlink: Found link beat on eth3 again!"
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 505 02.235" name="Another master around!"
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 506 02.235" name="Node 2 changed version! 0.000000 -> 9.710001"
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 507 02.235" name="Lost heartbeat message from node 2! Expected 11 but got 77"
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38C0" severity="info" sys="System" sub="ha" seq="M: 508 02.235" name="Node 2 is alive"
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 509 02.235" name="Node 2 changed state: DEAD(2048) -> UNLINKED(1)"
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 510 02.235" name="Node 2 changed role: DEAD -> MASTER"
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 511 02.235" name="Executing (nowait) /etc/init.d/ha_mode topology_changed"
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 512 02.235" name="Enforce MASTER, Resending gratuitous arp"
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 513 02.235" name="Executing (nowait) /etc/init.d/ha_mode enforce_master"
2022:07:26-13:26:02 portal-1 ha_mode[23714]: calling topology_changed
2022:07:26-13:26:02 portal-1 ha_mode[23714]: topology_changed: waiting for last ha_mode done
2022:07:26-13:26:02 portal-1 ha_mode[23716]: calling enforce_master
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 514 02.312" name="Node 2 changed state: UNLINKED(1) -> SYNCING(3)"
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 515 02.312" name="Node 2 changed role: MASTER -> SLAVE"
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 516 02.313" name="Executing (wait) /usr/local/bin/confd-setha mode master master_ip 198.19.250.1 slave_ip 198.19.250.2"
2022:07:26-13:26:02 portal-1 ha_mode[23714]: repctl[23789]: [i] daemonize_check(1480): daemonized, see syslog for further messages
2022:07:26-13:26:02 portal-1 repctl[30889]: [i] recheck(1057): got HUP: replication recheck triggered Setup_replication_done = 0
2022:07:26-13:26:02 portal-1 repctl[23789]: [i] daemonize_check(1480): daemonized, see syslog for further messages
2022:07:26-13:26:02 portal-1 repctl[23789]: [i] daemonize_check(1497): trying to signal daemon and exit
2022:07:26-13:26:02 portal-1 repctl[30889]: [i] execute(1768): pg_ctl: no server running
2022:07:26-13:26:02 portal-1 ha_mode[23714]: topology_changed done (started at 13:26:02)
2022:07:26-13:26:02 portal-1 ha_mode[23716]: enforce_master: waiting for last ha_mode done
2022:07:26-13:26:02 portal-1 ha_mode[23716]: enforce_master
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 517 02.566" name="Executing (nowait) /etc/init.d/ha_mode topology_changed"
2022:07:26-13:26:02 portal-1 ha_mode[23798]: calling topology_changed
2022:07:26-13:26:02 portal-1 repctl[30889]: [e] db_connect(2206): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2): FATAL: password authentication failed for user "repmgr"
2022:07:26-13:26:02 portal-1 repctl[30889]: [e] master_connection(2045): FATAL: password authentication failed for user "repmgr"
2022:07:26-13:26:03 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 518 03.401" name="Reading cluster configuration"
2022:07:26-13:26:05 portal-1 repctl[30889]: [e] db_connect(2206): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2): FATAL: password authentication failed for user "repmgr"
2022:07:26-13:26:05 portal-1 repctl[30889]: [e] master_connection(2045): FATAL: password authentication failed for user "repmgr"
2022:07:26-13:26:07 portal-1 ha_mode[23716]: /var/mdw/scripts/confd-sync: /usr/local/bin/confd-sync stopped
2022:07:26-13:26:07 portal-1 ha_mode[23716]: /var/mdw/scripts/confd-sync: /usr/local/bin/confd-sync started
2022:07:26-13:26:07 portal-1 ha_mode[23716]: enforce_master done (started at 13:26:02)
2022:07:26-13:26:07 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 519 07.992" name="Set syncing.files for node 2"
2022:07:26-13:26:08 portal-1 ha_mode[23798]: topology_changed: waiting for last ha_mode done
2022:07:26-13:26:08 portal-1 ha_mode[23798]: repctl[24905]: [i] daemonize_check(1480): daemonized, see syslog for further messages
2022:07:26-13:26:08 portal-1 repctl[24905]: [i] daemonize_check(1480): daemonized, see syslog for further messages
2022:07:26-13:26:08 portal-1 repctl[24905]: [i] daemonize_check(1497): trying to signal daemon and exit
2022:07:26-13:26:08 portal-1 ha_mode[23798]: topology_changed done (started at 13:26:02)
2022:07:26-13:26:09 portal-1 repctl[30889]: [e] db_connect(2206): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2): FATAL: password authentication failed for user "repmgr"
2022:07:26-13:26:09 portal-1 repctl[30889]: [e] master_connection(2045): FATAL: password authentication failed for user "repmgr"
2022:07:26-13:26:16 portal-1 ha_daemon[6982]: id="38C1" severity="error" sys="System" sub="ha" seq="M: 520 16.992" name="Node 2 is dead, received no heart beats"
2022:07:26-13:26:16 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 521 16.992" name="Executing (wait) /usr/local/bin/confd-setha mode master master_ip 198.19.250.1 slave_ip ''"
2022:07:26-13:26:17 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 522 17.222" name="Executing (nowait) /etc/init.d/ha_mode topology_changed"
2022:07:26-13:26:17 portal-1 ha_mode[26600]: calling topology_changed
2022:07:26-13:26:17 portal-1 ha_mode[26600]: topology_changed: waiting for last ha_mode done
2022:07:26-13:26:17 portal-1 ha_mode[26600]: repctl[26650]: [i] daemonize_check(1480): daemonized, see syslog for further messages
2022:07:26-13:26:17 portal-1 ha_mode[26600]: topology_changed done (started at 13:26:17)
2022:07:26-13:26:17 portal-1 repctl[26650]: [i] daemonize_check(1480): daemonized, see syslog for further messages
2022:07:26-13:26:17 portal-1 repctl[26650]: [i] daemonize_check(1497): trying to signal daemon and exit
2022:07:26-13:26:18 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 523 18.000" name="Reading cluster configuration"
2022:07:26-13:26:18 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 524 18.543" name="Found no interfaces in /etc/ha/lbeat_interfaces for link beat monitoring"
2022:07:26-13:26:42 portal-1 repctl[30889]: [e] db_connect(2203): timeout while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2)
2022:07:26-13:26:42 portal-1 repctl[30889]: [e] master_connection(2045): (timeout)
2022:07:26-19:26:42 portal-1 conntrack-tools[7742]: no dedicated links available!
2022:07:26-13:26:45 portal-1 repctl[30889]: [c] prepare_secondary(315): prepare_secondary failed because master db's status can't be determined! Maybe unreachable?
2022:07:26-13:26:45 portal-1 repctl[30889]: [c] setup_replication(274): setup_replication was not properly executed
2022:07:26-13:26:45 portal-1 repctl[30889]: [i] setup_replication(278): checkinterval 300
2022:07:26-13:26:45 portal-1 repctl[30889]: [i] recheck(1057): got HUP: replication recheck triggered Setup_replication_done = 0
2022:07:26-13:26:45 portal-1 repctl[30889]: [i] execute(1768): pg_ctl: no server running
2022:07:26-13:26:46 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 525 46.385" name="Found no interfaces in /etc/ha/lbeat_interfaces for link beat monitoring"
2022:07:26-13:26:46 portal-1 ha_daemon[6982]: id="38A3" severity="debug" sys="System" sub="ha" seq="M: 526 46.385" name="Netlink: Lost link beat on eth3!"
2022:07:26-13:27:03 portal-1 repctl[30889]: [e] db_connect(2206): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2): could not connect to server: No route to host
2022:07:26-13:27:03 portal-1 repctl[30889]: [e] db_connect(2206): Is the server running on host "198.19.250.2" and accepting
2022:07:26-13:27:03 portal-1 repctl[30889]: [e] db_connect(2206): TCP/IP connections on port 5432?
2022:07:26-13:27:03 portal-1 repctl[30889]: [e] master_connection(2045): could not connect to server: No route to host
2022:07:26-13:27:03 portal-1 repctl[30889]: [e] master_connection(2045): Is the server running on host "198.19.250.2" and accepting
2022:07:26-13:27:03 portal-1 repctl[30889]: [e] master_connection(2045): TCP/IP connections on port 5432?

This strikes me as bizarre behavior, much appreciated if anyone has any pointers.

I'm not sure if they have worked in the past as I am just adopting this network and the device was in the powered-off state when I first laid eyes on it.



This thread was automatically locked due to age.
Parents
  • there are some options...

    1. missing license

    2. firmware differences

    3. new slave-device ... and the old slave already use the slot

    4. and some more ...

    possible there are more informations at ha-log from slave....


    Dirk

    Systema Gesellschaft für angewandte Datentechnik mbH  // Sophos Platinum Partner
    Sophos Solution Partner since 2003
    If a post solves your question, click the 'Verify Answer' link at this post.

Reply
  • there are some options...

    1. missing license

    2. firmware differences

    3. new slave-device ... and the old slave already use the slot

    4. and some more ...

    possible there are more informations at ha-log from slave....


    Dirk

    Systema Gesellschaft für angewandte Datentechnik mbH  // Sophos Platinum Partner
    Sophos Solution Partner since 2003
    If a post solves your question, click the 'Verify Answer' link at this post.

Children
No Data