This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

SG-330 Node2 Powers off during HA Replication

Hello all, Having some issues with a pair of SG-330's running in HA Active-Passive mode.

When I get Node2 powered on it stays on for about a minute, begins Synchronizing, and then powers off with seemingly no warning.

When I power Node2 on without the replication or other network cable connected it stays alive until the replication interface is patched back in.

Logs from Node1 are as follows:

2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A3" severity="debug" sys="System" sub="ha" seq="M: 504 02.037" name="Netlink: Found link beat on eth3 again!"
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 505 02.235" name="Another master around!"
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 506 02.235" name="Node 2 changed version! 0.000000 -> 9.710001"
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A1" severity="warn" sys="System" sub="ha" seq="M: 507 02.235" name="Lost heartbeat message from node 2! Expected 11 but got 77"
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38C0" severity="info" sys="System" sub="ha" seq="M: 508 02.235" name="Node 2 is alive"
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 509 02.235" name="Node 2 changed state: DEAD(2048) -> UNLINKED(1)"
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 510 02.235" name="Node 2 changed role: DEAD -> MASTER"
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 511 02.235" name="Executing (nowait) /etc/init.d/ha_mode topology_changed"
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 512 02.235" name="Enforce MASTER, Resending gratuitous arp"
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 513 02.235" name="Executing (nowait) /etc/init.d/ha_mode enforce_master"
2022:07:26-13:26:02 portal-1 ha_mode[23714]: calling topology_changed
2022:07:26-13:26:02 portal-1 ha_mode[23714]: topology_changed: waiting for last ha_mode done
2022:07:26-13:26:02 portal-1 ha_mode[23716]: calling enforce_master
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 514 02.312" name="Node 2 changed state: UNLINKED(1) -> SYNCING(3)"
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 515 02.312" name="Node 2 changed role: MASTER -> SLAVE"
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 516 02.313" name="Executing (wait) /usr/local/bin/confd-setha mode master master_ip 198.19.250.1 slave_ip 198.19.250.2"
2022:07:26-13:26:02 portal-1 ha_mode[23714]: repctl[23789]: [i] daemonize_check(1480): daemonized, see syslog for further messages
2022:07:26-13:26:02 portal-1 repctl[30889]: [i] recheck(1057): got HUP: replication recheck triggered Setup_replication_done = 0
2022:07:26-13:26:02 portal-1 repctl[23789]: [i] daemonize_check(1480): daemonized, see syslog for further messages
2022:07:26-13:26:02 portal-1 repctl[23789]: [i] daemonize_check(1497): trying to signal daemon and exit
2022:07:26-13:26:02 portal-1 repctl[30889]: [i] execute(1768): pg_ctl: no server running
2022:07:26-13:26:02 portal-1 ha_mode[23714]: topology_changed done (started at 13:26:02)
2022:07:26-13:26:02 portal-1 ha_mode[23716]: enforce_master: waiting for last ha_mode done
2022:07:26-13:26:02 portal-1 ha_mode[23716]: enforce_master
2022:07:26-13:26:02 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 517 02.566" name="Executing (nowait) /etc/init.d/ha_mode topology_changed"
2022:07:26-13:26:02 portal-1 ha_mode[23798]: calling topology_changed
2022:07:26-13:26:02 portal-1 repctl[30889]: [e] db_connect(2206): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2): FATAL: password authentication failed for user "repmgr"
2022:07:26-13:26:02 portal-1 repctl[30889]: [e] master_connection(2045): FATAL: password authentication failed for user "repmgr"
2022:07:26-13:26:03 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 518 03.401" name="Reading cluster configuration"
2022:07:26-13:26:05 portal-1 repctl[30889]: [e] db_connect(2206): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2): FATAL: password authentication failed for user "repmgr"
2022:07:26-13:26:05 portal-1 repctl[30889]: [e] master_connection(2045): FATAL: password authentication failed for user "repmgr"
2022:07:26-13:26:07 portal-1 ha_mode[23716]: /var/mdw/scripts/confd-sync: /usr/local/bin/confd-sync stopped
2022:07:26-13:26:07 portal-1 ha_mode[23716]: /var/mdw/scripts/confd-sync: /usr/local/bin/confd-sync started
2022:07:26-13:26:07 portal-1 ha_mode[23716]: enforce_master done (started at 13:26:02)
2022:07:26-13:26:07 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 519 07.992" name="Set syncing.files for node 2"
2022:07:26-13:26:08 portal-1 ha_mode[23798]: topology_changed: waiting for last ha_mode done
2022:07:26-13:26:08 portal-1 ha_mode[23798]: repctl[24905]: [i] daemonize_check(1480): daemonized, see syslog for further messages
2022:07:26-13:26:08 portal-1 repctl[24905]: [i] daemonize_check(1480): daemonized, see syslog for further messages
2022:07:26-13:26:08 portal-1 repctl[24905]: [i] daemonize_check(1497): trying to signal daemon and exit
2022:07:26-13:26:08 portal-1 ha_mode[23798]: topology_changed done (started at 13:26:02)
2022:07:26-13:26:09 portal-1 repctl[30889]: [e] db_connect(2206): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2): FATAL: password authentication failed for user "repmgr"
2022:07:26-13:26:09 portal-1 repctl[30889]: [e] master_connection(2045): FATAL: password authentication failed for user "repmgr"
2022:07:26-13:26:16 portal-1 ha_daemon[6982]: id="38C1" severity="error" sys="System" sub="ha" seq="M: 520 16.992" name="Node 2 is dead, received no heart beats"
2022:07:26-13:26:16 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 521 16.992" name="Executing (wait) /usr/local/bin/confd-setha mode master master_ip 198.19.250.1 slave_ip ''"
2022:07:26-13:26:17 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 522 17.222" name="Executing (nowait) /etc/init.d/ha_mode topology_changed"
2022:07:26-13:26:17 portal-1 ha_mode[26600]: calling topology_changed
2022:07:26-13:26:17 portal-1 ha_mode[26600]: topology_changed: waiting for last ha_mode done
2022:07:26-13:26:17 portal-1 ha_mode[26600]: repctl[26650]: [i] daemonize_check(1480): daemonized, see syslog for further messages
2022:07:26-13:26:17 portal-1 ha_mode[26600]: topology_changed done (started at 13:26:17)
2022:07:26-13:26:17 portal-1 repctl[26650]: [i] daemonize_check(1480): daemonized, see syslog for further messages
2022:07:26-13:26:17 portal-1 repctl[26650]: [i] daemonize_check(1497): trying to signal daemon and exit
2022:07:26-13:26:18 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 523 18.000" name="Reading cluster configuration"
2022:07:26-13:26:18 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 524 18.543" name="Found no interfaces in /etc/ha/lbeat_interfaces for link beat monitoring"
2022:07:26-13:26:42 portal-1 repctl[30889]: [e] db_connect(2203): timeout while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2)
2022:07:26-13:26:42 portal-1 repctl[30889]: [e] master_connection(2045): (timeout)
2022:07:26-19:26:42 portal-1 conntrack-tools[7742]: no dedicated links available!
2022:07:26-13:26:45 portal-1 repctl[30889]: [c] prepare_secondary(315): prepare_secondary failed because master db's status can't be determined! Maybe unreachable?
2022:07:26-13:26:45 portal-1 repctl[30889]: [c] setup_replication(274): setup_replication was not properly executed
2022:07:26-13:26:45 portal-1 repctl[30889]: [i] setup_replication(278): checkinterval 300
2022:07:26-13:26:45 portal-1 repctl[30889]: [i] recheck(1057): got HUP: replication recheck triggered Setup_replication_done = 0
2022:07:26-13:26:45 portal-1 repctl[30889]: [i] execute(1768): pg_ctl: no server running
2022:07:26-13:26:46 portal-1 ha_daemon[6982]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 525 46.385" name="Found no interfaces in /etc/ha/lbeat_interfaces for link beat monitoring"
2022:07:26-13:26:46 portal-1 ha_daemon[6982]: id="38A3" severity="debug" sys="System" sub="ha" seq="M: 526 46.385" name="Netlink: Lost link beat on eth3!"
2022:07:26-13:27:03 portal-1 repctl[30889]: [e] db_connect(2206): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2): could not connect to server: No route to host
2022:07:26-13:27:03 portal-1 repctl[30889]: [e] db_connect(2206): Is the server running on host "198.19.250.2" and accepting
2022:07:26-13:27:03 portal-1 repctl[30889]: [e] db_connect(2206): TCP/IP connections on port 5432?
2022:07:26-13:27:03 portal-1 repctl[30889]: [e] master_connection(2045): could not connect to server: No route to host
2022:07:26-13:27:03 portal-1 repctl[30889]: [e] master_connection(2045): Is the server running on host "198.19.250.2" and accepting
2022:07:26-13:27:03 portal-1 repctl[30889]: [e] master_connection(2045): TCP/IP connections on port 5432?

This strikes me as bizarre behavior, much appreciated if anyone has any pointers.

I'm not sure if they have worked in the past as I am just adopting this network and the device was in the powered-off state when I first laid eyes on it.



This thread was automatically locked due to age.
  • there are some options...

    1. missing license

    2. firmware differences

    3. new slave-device ... and the old slave already use the slot

    4. and some more ...

    possible there are more informations at ha-log from slave....


    Dirk

    Systema Gesellschaft für angewandte Datentechnik mbH  // Sophos Platinum Partner
    Sophos Solution Partner since 2003
    If a post solves your question, click the 'Verify Answer' link at this post.

  • Hi Elliana and welcome to the UTM Community!

    I just did a five-year search in a client's high-availability log for 'repmgr' and found nothing.  It seems that your problem is

          timeout while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2)

    My guess is that you will need to disconnect the Slave and re-image it from ISO.  My standard suggestion after that is:

         1. If needed, do a quick, temporary install so that the Slave can download Up2Dates.
         2. Apply the Up2Dates to the same version as the current Master, do a factory reset and shutdown.
         3. On the current Master, on the 'Configuration' tab of 'High Availability':
              a. Disable and then enable Hot-Standby
              b. Select eth3 as the Sync NIC
              c. Configure it as Node_1
              d. Enter an encryption key (I've never found a need to remember it)
              e. Select 'Enable automatic configuration of new devices'
              f. I prefer to use 'Preferred Master: None' and 'Backup interface: Internal'
         4. Cable eth3 to eth3 on the new device.
         5. Cable all of the other NICs exactly as they are on the original UTM.
         6. Power up the new device and wait for the good news. Wink

    You should open a case with Sophos Support and get their agreement.  Any luck with that?

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA