This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

UTM Slave node got unlicensed because of left "reserved"?

Hi,

running an a/p cluster of two SG 135. Some time ago we performed an update to 9.702-1 and left one node "reserved". Unfortunately we left him in this state for a couple of days.

Today at midnight, slave went to "DEAD". Today is the end of a license that has been renewed but I guess due to the fact that the slave was reserved there was no replication anymore. And no replication of the renewed license. And the slave  got unlicensed and turned to death state.

How to get the slave back to the licensed cluster?

 

Thanks for any advice.

Cheers

Philipp



This thread was automatically locked due to age.
Parents
  • Hey Philipp,

    I would try to reset / remove the dead node and then get the node back in the cluster via auto configuration.
    That can be done without any downtime. Until the end of the process I would detach all Ethernet cables except eth3 from the dead node.

    Best regards 

    Alex

    -

  • Hey Alex,

    thanks for your advice. Do you think that this will work even the dead slave node isn‘t on the same firmware release as the current master?

    Regards,

    Philipp

  • Yes, the ‘new’ node will do the Up2date while getting part of the cluster. I already considered that.

    Best regards 

    Alex 

    -

  • Perfect! I‘ll try that asap and keep you tuned. Many thanks.

  • Hi Alex,

    just tried that together with staff of our data center where the appliance is being housed.

    Unfortunately this approach does not work. All interfaces unlinked on slave. Except heartbeat if.

    Log messages on the master:

    2020:04:14-17:19:03 xxx-2 ha_daemon[4280]: id="38A3" severity="debug" sys="System" sub="ha" seq="M: 516 03.563" name="Netlink: Found link beat on eth3 again!"
    2020:04:14-17:19:08 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 517 08.376" name="Access granted to remote node 1!"
    2020:04:14-17:19:08 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 518 08.587" name="Node 1 changed version! 0.000000 -> 9.701006"
    2020:04:14-17:19:08 xxx-2 ha_daemon[4280]: id="38C0" severity="info" sys="System" sub="ha" seq="M: 519 08.587" name="Node 1 is alive"
    2020:04:14-17:19:08 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 520 08.587" name="Node 1 changed state: DEAD(2048) -> RESERVED(4096)"
    2020:04:14-17:19:08 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 521 08.587" name="Node 1 changed role: DEAD -> SLAVE"
    2020:04:14-17:19:08 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 522 08.589" name="Executing (wait) /usr/local/bin/confd-setha mode master master_ip 198.19.250.2 slave_ip 198.19.250.1"
    2020:04:14-17:19:08 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 523 08.811" name="Executing (nowait) /etc/init.d/ha_mode topology_changed"
    2020:04:14-17:19:08 xxx-2 ha_mode[2161]: calling topology_changed
    2020:04:14-17:19:08 xxx-2 ha_mode[2161]: topology_changed: waiting for last ha_mode done
    2020:04:14-17:19:09 xxx-2 ha_mode[2161]: repctl[2181]: [i] daemonize_check(1480): daemonized, see syslog for further messages
    2020:04:14-17:19:09 xxx-2 repctl[2181]: [i] daemonize_check(1480): daemonized, see syslog for further messages
    2020:04:14-17:19:09 xxx-2 repctl[2181]: [i] daemonize_check(1497): trying to signal daemon and exit
    2020:04:14-17:19:09 xxx-2 repctl[7290]: [i] recheck(1057): got HUP: replication recheck triggered Setup_replication_done = 1
    2020:04:14-17:19:09 xxx-2 ha_mode[2161]: topology_changed done (started at 17:19:08)
    2020:04:14-17:19:09 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 524 09.593" name="Reading cluster configuration"
    2020:04:14-17:19:22 xxx-1 ha_daemon[4234]: id="38A1" severity="warn" sys="System" sub="ha" seq="S: 32 22.516" name="No link on interface eth1"
    2020:04:14-17:19:22 xxx-1 ha_daemon[4234]: id="38A3" severity="debug" sys="System" sub="ha" seq="S: 33 22.516" name="Netlink: Lost link beat on eth0!"
    2020:04:14-17:19:22 xxx-1 ha_daemon[4234]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 35 22.516" name="state change RESERVED(4096) -> RESERVED(4097)"
    2020:04:14-17:19:22 xxx-1 ha_daemon[4234]: id="38A3" severity="debug" sys="System" sub="ha" seq="S: 34 22.516" name="Netlink: Lost link beat on eth0!"
    2020:04:14-17:19:22 xxx-1 ha_daemon[4234]: id="38A3" severity="debug" sys="System" sub="ha" seq="S: 37 22.517" name="Netlink: Lost link beat on eth1!"
    2020:04:14-17:19:22 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 525 22.601" name="Node 1 changed state: RESERVED(4096) -> RESERVED(4097)"
    2020:04:14-17:19:25 xxx-1 ha_daemon[4234]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 52 25.000" name="HA daemon shutting down (SIGTERM)"
    2020:04:14-17:19:25 xxx-1 ha_daemon[4234]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 53 25.000" name="Executing (nowait) /etc/init.d/ha_mode shutdown"
    2020:04:14-17:19:25 xxx-1 ha_mode[6683]: calling shutdown
    2020:04:14-17:19:25 xxx-1 ha_mode[6683]: shutdown: waiting for last ha_mode done
    2020:04:14-17:19:25 xxx-1 ha_mode[6683]: /var/mdw/scripts/confd-sync: /usr/local/bin/confd-sync stopped
    2020:04:14-17:19:25 xxx-1 ha_daemon[4234]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 54 25.013" name="HA daemon exits (SIGTERM)"
    2020:04:14-17:19:25 xxx-1 ha_mode[6683]: shutdown done (started at 17:19:25)
    2020:04:14-17:19:25 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 526 25.105" name="Monitoring interfaces for link beat: eth1 eth0"
    2020:04:14-17:19:27 xxx-2 ha_daemon[4280]: id="38C1" severity="error" sys="System" sub="ha" seq="M: 527 27.856" name="Node 1 is dead, received no heart beats"
    2020:04:14-17:19:27 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 528 27.858" name="Executing (wait) /usr/local/bin/confd-setha mode master master_ip 198.19.250.2 slave_ip ''"
    2020:04:14-17:19:28 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 529 28.088" name="Executing (nowait) /etc/init.d/ha_mode topology_changed"
    2020:04:14-17:19:28 xxx-2 ha_mode[2550]: calling topology_changed
    2020:04:14-17:19:28 xxx-2 ha_mode[2550]: topology_changed: waiting for last ha_mode done
    2020:04:14-17:19:28 xxx-2 ha_mode[2550]: repctl[2567]: [i] daemonize_check(1480): daemonized, see syslog for further messages
    2020:04:14-17:19:28 xxx-2 repctl[2567]: [i] daemonize_check(1480): daemonized, see syslog for further messages
    2020:04:14-17:19:28 xxx-2 repctl[2567]: [i] daemonize_check(1497): trying to signal daemon and exit
    2020:04:14-17:19:28 xxx-2 repctl[7290]: [i] recheck(1057): got HUP: replication recheck triggered Setup_replication_done = 1
    2020:04:14-17:19:28 xxx-2 ha_mode[2550]: topology_changed done (started at 17:19:28)
    2020:04:14-17:19:28 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 530 28.818" name="Reading cluster configuration"
    2020:04:14-17:19:44 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 531 44.374" name="Monitoring interfaces for link beat: eth1 eth0"
    2020:04:14-17:19:49 xxx-2 conntrack-tools[4701]: no dedicated links available!
    2020:04:14-17:19:49 xxx-2 ha_daemon[4280]: id="38A3" severity="debug" sys="System" sub="ha" seq="M: 532 49.290" name="Netlink: Lost link beat on eth3!"

    After that the slave remains switched off. Tried to switch in on again. Same story happened.

    For a couple of seconds I could see the node as "Reserved" in the Web interface having the option to click on "update now". Also tried this.

    Result was:

    name="HA control: cmd = 'upgrade 1'"
    name="perform upgrade to 9.702001 on node 1"
    name="Node 1 changed state: RESERVED(4096) -> UP2DATE(256)"
    name="state change UP2DATE(256) -> UP2DATE(257)"
    name="HA daemon of node 1 is restarting, waiting 900 seconds before declaring node as dead"
    name="HA daemon shutting down (SIGTERM)"

    Slave stayed powered off.

    Last try: Removed node from master after 900 secs. (Until then it was still visible as "dead")
    But same results as on 2nd try. Slave won't join but shuts down.

    I will now let the unit ship to me and manually apply the update to 9.702-1. Hopefully it will then join again.


    Cheers,

    Philipp

Reply
  • Hi Alex,

    just tried that together with staff of our data center where the appliance is being housed.

    Unfortunately this approach does not work. All interfaces unlinked on slave. Except heartbeat if.

    Log messages on the master:

    2020:04:14-17:19:03 xxx-2 ha_daemon[4280]: id="38A3" severity="debug" sys="System" sub="ha" seq="M: 516 03.563" name="Netlink: Found link beat on eth3 again!"
    2020:04:14-17:19:08 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 517 08.376" name="Access granted to remote node 1!"
    2020:04:14-17:19:08 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 518 08.587" name="Node 1 changed version! 0.000000 -> 9.701006"
    2020:04:14-17:19:08 xxx-2 ha_daemon[4280]: id="38C0" severity="info" sys="System" sub="ha" seq="M: 519 08.587" name="Node 1 is alive"
    2020:04:14-17:19:08 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 520 08.587" name="Node 1 changed state: DEAD(2048) -> RESERVED(4096)"
    2020:04:14-17:19:08 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 521 08.587" name="Node 1 changed role: DEAD -> SLAVE"
    2020:04:14-17:19:08 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 522 08.589" name="Executing (wait) /usr/local/bin/confd-setha mode master master_ip 198.19.250.2 slave_ip 198.19.250.1"
    2020:04:14-17:19:08 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 523 08.811" name="Executing (nowait) /etc/init.d/ha_mode topology_changed"
    2020:04:14-17:19:08 xxx-2 ha_mode[2161]: calling topology_changed
    2020:04:14-17:19:08 xxx-2 ha_mode[2161]: topology_changed: waiting for last ha_mode done
    2020:04:14-17:19:09 xxx-2 ha_mode[2161]: repctl[2181]: [i] daemonize_check(1480): daemonized, see syslog for further messages
    2020:04:14-17:19:09 xxx-2 repctl[2181]: [i] daemonize_check(1480): daemonized, see syslog for further messages
    2020:04:14-17:19:09 xxx-2 repctl[2181]: [i] daemonize_check(1497): trying to signal daemon and exit
    2020:04:14-17:19:09 xxx-2 repctl[7290]: [i] recheck(1057): got HUP: replication recheck triggered Setup_replication_done = 1
    2020:04:14-17:19:09 xxx-2 ha_mode[2161]: topology_changed done (started at 17:19:08)
    2020:04:14-17:19:09 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 524 09.593" name="Reading cluster configuration"
    2020:04:14-17:19:22 xxx-1 ha_daemon[4234]: id="38A1" severity="warn" sys="System" sub="ha" seq="S: 32 22.516" name="No link on interface eth1"
    2020:04:14-17:19:22 xxx-1 ha_daemon[4234]: id="38A3" severity="debug" sys="System" sub="ha" seq="S: 33 22.516" name="Netlink: Lost link beat on eth0!"
    2020:04:14-17:19:22 xxx-1 ha_daemon[4234]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 35 22.516" name="state change RESERVED(4096) -> RESERVED(4097)"
    2020:04:14-17:19:22 xxx-1 ha_daemon[4234]: id="38A3" severity="debug" sys="System" sub="ha" seq="S: 34 22.516" name="Netlink: Lost link beat on eth0!"
    2020:04:14-17:19:22 xxx-1 ha_daemon[4234]: id="38A3" severity="debug" sys="System" sub="ha" seq="S: 37 22.517" name="Netlink: Lost link beat on eth1!"
    2020:04:14-17:19:22 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 525 22.601" name="Node 1 changed state: RESERVED(4096) -> RESERVED(4097)"
    2020:04:14-17:19:25 xxx-1 ha_daemon[4234]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 52 25.000" name="HA daemon shutting down (SIGTERM)"
    2020:04:14-17:19:25 xxx-1 ha_daemon[4234]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 53 25.000" name="Executing (nowait) /etc/init.d/ha_mode shutdown"
    2020:04:14-17:19:25 xxx-1 ha_mode[6683]: calling shutdown
    2020:04:14-17:19:25 xxx-1 ha_mode[6683]: shutdown: waiting for last ha_mode done
    2020:04:14-17:19:25 xxx-1 ha_mode[6683]: /var/mdw/scripts/confd-sync: /usr/local/bin/confd-sync stopped
    2020:04:14-17:19:25 xxx-1 ha_daemon[4234]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 54 25.013" name="HA daemon exits (SIGTERM)"
    2020:04:14-17:19:25 xxx-1 ha_mode[6683]: shutdown done (started at 17:19:25)
    2020:04:14-17:19:25 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 526 25.105" name="Monitoring interfaces for link beat: eth1 eth0"
    2020:04:14-17:19:27 xxx-2 ha_daemon[4280]: id="38C1" severity="error" sys="System" sub="ha" seq="M: 527 27.856" name="Node 1 is dead, received no heart beats"
    2020:04:14-17:19:27 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 528 27.858" name="Executing (wait) /usr/local/bin/confd-setha mode master master_ip 198.19.250.2 slave_ip ''"
    2020:04:14-17:19:28 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 529 28.088" name="Executing (nowait) /etc/init.d/ha_mode topology_changed"
    2020:04:14-17:19:28 xxx-2 ha_mode[2550]: calling topology_changed
    2020:04:14-17:19:28 xxx-2 ha_mode[2550]: topology_changed: waiting for last ha_mode done
    2020:04:14-17:19:28 xxx-2 ha_mode[2550]: repctl[2567]: [i] daemonize_check(1480): daemonized, see syslog for further messages
    2020:04:14-17:19:28 xxx-2 repctl[2567]: [i] daemonize_check(1480): daemonized, see syslog for further messages
    2020:04:14-17:19:28 xxx-2 repctl[2567]: [i] daemonize_check(1497): trying to signal daemon and exit
    2020:04:14-17:19:28 xxx-2 repctl[7290]: [i] recheck(1057): got HUP: replication recheck triggered Setup_replication_done = 1
    2020:04:14-17:19:28 xxx-2 ha_mode[2550]: topology_changed done (started at 17:19:28)
    2020:04:14-17:19:28 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 530 28.818" name="Reading cluster configuration"
    2020:04:14-17:19:44 xxx-2 ha_daemon[4280]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 531 44.374" name="Monitoring interfaces for link beat: eth1 eth0"
    2020:04:14-17:19:49 xxx-2 conntrack-tools[4701]: no dedicated links available!
    2020:04:14-17:19:49 xxx-2 ha_daemon[4280]: id="38A3" severity="debug" sys="System" sub="ha" seq="M: 532 49.290" name="Netlink: Lost link beat on eth3!"

    After that the slave remains switched off. Tried to switch in on again. Same story happened.

    For a couple of seconds I could see the node as "Reserved" in the Web interface having the option to click on "update now". Also tried this.

    Result was:

    name="HA control: cmd = 'upgrade 1'"
    name="perform upgrade to 9.702001 on node 1"
    name="Node 1 changed state: RESERVED(4096) -> UP2DATE(256)"
    name="state change UP2DATE(256) -> UP2DATE(257)"
    name="HA daemon of node 1 is restarting, waiting 900 seconds before declaring node as dead"
    name="HA daemon shutting down (SIGTERM)"

    Slave stayed powered off.

    Last try: Removed node from master after 900 secs. (Until then it was still visible as "dead")
    But same results as on 2nd try. Slave won't join but shuts down.

    I will now let the unit ship to me and manually apply the update to 9.702-1. Hopefully it will then join again.


    Cheers,

    Philipp

Children
  • Hi Phillip,

    sorry to hear that. Unfortunately the small units don’t have a display. So it’s a little difficult to see the status.
    Keep us informed what you find.

    Best regards 

    Alex 

    -

  • Hallo Philipp,

    Here's my standard instruction set for doing what you're doing:

    1. If needed, do a quick, temporary install so that the new device can download Up2Dates.
    2. Apply the Up2Dates to the same version as the current unit, do a factory reset and shutdown.
    3. On the current UTM in use, on the 'Configuration' tab of 'High Availability':
       a. Disable and then enable Hot-Standby
       b. Select eth3 as the Sync NIC
       c. Configure it as Node_1
       d. Enter an encryption key (I've never found a need to remember it)
       e. Select 'Enable automatic configuration of new devices'
       f. I prefer to use 'Preferred Master: None' and 'Backup interface: Internal'
    4. Cable eth3 to eth3 on the new device.
    5. Cable all of the other NICs exactly as they are on the original UTM.
    6. Power up the new device and wait for the good news. [;)]

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
  • Hi Bob

    Hi Alex,

    it worked even simpler. The slave node was still booting with its old config. I was able to login, install the extended license and perform the update. Afterwards I was able to rejoin it to the cluster with the old settings on the master.

    Thanks.

    Cheers

    Philipp