This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

HA update to 9.402-7 on cluster fail on slave node, now it is dead


I stared an update to 9.402-7 on my my active-passive cluster, but after about 10 minutes slave node became dead. It is a virtual appliance on esxi 6 host, everything worked without a problem before.

This node is not reachable though network, I can only open up vmware console to this node.

From console I can ping only this cluster node ip which is 198.19.250.1 but no the neighbour, 198.19.250.2.

I've rebooted a node and it shows a time-out on STP interface on booting time. 



This thread was automatically locked due to age.
Parents
  • Hi,

    I think the update was not successful. Disable HA and try to update the virtual appliance separately.

    Thanks

    Sachin Gurung
    Team Lead | Sophos Technical Support
    Knowledge Base  |  @SophosSupport  |  Video tutorials
    Remember to like a post.  If a post (on a question thread) solves your question use the 'This helped me' link.

  • Hi,

    I have 1 working node with 9.401-11 software, do you suggest disable HA on it and try to update working node ?

    Not he best option in my opinion, because salve update failed and this could fail too, also it would reboot and in my environment it downtime is not an option.

    on the slave node in up2date log it shows that update was successful.

    Slave could be reached only though a vmware console, it is like serial cable, i tried to add other machine to same network where  HA eth is located( 198.19.250.10), but slave node is not reachable, the master node works fine.

    is there a way by console command to check what is wrong with network adapters, looks like it is blocked somehow?

    ip a command shows that status of interface on which HA is happening is UP 

  • Hi,

    Take SSH to master UTM with root privileges. On the command line type "ha_utils", this will give you the access to Slave UTM. Here, monitor up2date log file.

    Thanks

    Sachin Gurung
    Team Lead | Sophos Technical Support
    Knowledge Base  |  @SophosSupport  |  Video tutorials
    Remember to like a post.  If a post (on a question thread) solves your question use the 'This helped me' link.

  • Hi,

    I've solved problem by disabling HA on master and reinstalling slave from 9.355, now I have working cluster with 9.401

    I tried to install 9.402 on slave and join cluster, but it became reserved state with  error message that it could not connect to master database.

    2016:05:16-16:36:01 fw-a-1 ha_mode[6296]: 2016-05-16 16:36:01 [6318] [e] [db_connect(2071)] timeout while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2)
    2016:05:16-16:36:01 fw-a-1 ha_mode[6296]: 2016-05-16 16:36:01 [6318] [e] [master_connection(1920)] (timeout)

    I could not find ISO of 9.401 so I downloaded 9.355 and it joined cluster smoothly.

    I will create test lab with same config first to test update to 9.402.

    Using "ha_utils ssh" was not an option because master saw slave as DEAD, not as RESERVED, this means no link between nodes. 

    In a SLAVE shell on boot a saw a timeout when it was starting HA.

Reply
  • Hi,

    I've solved problem by disabling HA on master and reinstalling slave from 9.355, now I have working cluster with 9.401

    I tried to install 9.402 on slave and join cluster, but it became reserved state with  error message that it could not connect to master database.

    2016:05:16-16:36:01 fw-a-1 ha_mode[6296]: 2016-05-16 16:36:01 [6318] [e] [db_connect(2071)] timeout while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2)
    2016:05:16-16:36:01 fw-a-1 ha_mode[6296]: 2016-05-16 16:36:01 [6318] [e] [master_connection(1920)] (timeout)

    I could not find ISO of 9.401 so I downloaded 9.355 and it joined cluster smoothly.

    I will create test lab with same config first to test update to 9.402.

    Using "ha_utils ssh" was not an option because master saw slave as DEAD, not as RESERVED, this means no link between nodes. 

    In a SLAVE shell on boot a saw a timeout when it was starting HA.

Children
No Data