This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

HA stuck in Syncing

Hi,

since the last server-crash (our utm is a virtual machine) we got this error messages in ha-log and the state is "Syncing" since three days

repctl[4062]: [e] db_connect(2058): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2): could not connect to server: Connection refused
repctl[4062]: [e] master_connection(1904): could not connect to server: Connection refused
repctl[4062]: [e] db_connect(2058): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2): could not connect to server: Connection refused
repctl[4062]: [e] master_connection(1904): could not connect to server: Connection refused
repctl[4062]: [e] db_connect(2058): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2): could not connect to server: Connection refused
repctl[4062]: [e] master_connection(1904): could not connect to server: Connection refused
repctl[4062]: [e] db_connect(2058): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2): could not connect to server: Connection refused
repctl[4062]: [e] master_connection(1904): could not connect to server: Connection refused
repctl[4062]: [i] execute(1627): pg_ctl: could not send stop signal (PID: 6256): No such process
repctl[4062]: [i] recover_master(2296): Using previous master 198.19.250.1 for recovery
repctl[4062]: [i] recover_master(2329): Testing SLAVE/WORKER nodes for rsyncd
repctl[4062]: [c] hasyncmsg(1468): this is a primary node
repctl[4062]: [i] recover_master(2402): MASTER: syncing folder /global/pg_control from 198.19.250.1
repctl[4062]: [i] execute(1627): rsync: failed to connect to 198.19.250.1: Connection refused (111)
repctl[4062]: [c] recover_master(2419): rsync failed on $VAR1 = {
repctl[4062]: [c] recover_master(2428): sync aborted
is there a way for me to fix it without reinstalling the firewall?
kind regards


This thread was automatically locked due to age.
Parents
  • You'll loose reporting data but:

    Step 1: login to master node, su to root
    Step 2: open a new ssh window, login to master again, su to root
    Step 3: on 2nd window, enter: ha_utils ssh
    Step 4: in the 2nd window, login to slave as loginuser, then su to root
    Step 5: on both ssh windows, enter: killall repctl
    Step 6: on both ssh windows, enter: /etc/init.d/postgresql92 rebuild
    Step 7: after database rebuilds, enter on both ssh windows: repctl

  • Danny,

    Thank you for this post.  Just had to use the information in it - again.  

    Worth adding that 

    <M> utm:/home/login # tail -f /var/log/system.log

    2019:05:02-15:20:12 utm-2 ulogd[7753]: pg1: connect: could not connect to server: No such file or directory
    2019:05:02-15:20:17 utm-2 ulogd[7753]: pg1: connect: could not connect to server: No such file or directory
    2019:05:02-15:20:22 utm-2 ulogd[7753]: pg1: connect: could not connect to server: No such file or directory

    ^C
    <M> utm:/home/login # 
    telnet 127.0.0.1 5432

    Trying 127.0.0.1...
    telnet: connect to address 127.0.0.1: Connection refused

    Is a good indication that the database is corrupt.  

    And to Cheerok's points I think it would be reasonable to add

    • Any customer has the right and frequently the need to access the root shell.
    • Any change at root always has the *possibility* of voiding support.
    • Nearly all customers have to use root access to effectively administer their UTMs.
    • If you administer enough UTMs, that database rebuilds is a common need.  

    So as a customer you have to be careful.  Be well informed.   Contact support where possible.  And when not possible - tread carefully.  

    So it is not in place to prevent customers from using root, only to dissuade people from doing "stupid stuff" like trying to install device drivers using root access - which actually happened back in ASG V4 days and resulted in the root changes may void support rule.  

    All the best,

    Adrien.

Reply
  • Danny,

    Thank you for this post.  Just had to use the information in it - again.  

    Worth adding that 

    <M> utm:/home/login # tail -f /var/log/system.log

    2019:05:02-15:20:12 utm-2 ulogd[7753]: pg1: connect: could not connect to server: No such file or directory
    2019:05:02-15:20:17 utm-2 ulogd[7753]: pg1: connect: could not connect to server: No such file or directory
    2019:05:02-15:20:22 utm-2 ulogd[7753]: pg1: connect: could not connect to server: No such file or directory

    ^C
    <M> utm:/home/login # 
    telnet 127.0.0.1 5432

    Trying 127.0.0.1...
    telnet: connect to address 127.0.0.1: Connection refused

    Is a good indication that the database is corrupt.  

    And to Cheerok's points I think it would be reasonable to add

    • Any customer has the right and frequently the need to access the root shell.
    • Any change at root always has the *possibility* of voiding support.
    • Nearly all customers have to use root access to effectively administer their UTMs.
    • If you administer enough UTMs, that database rebuilds is a common need.  

    So as a customer you have to be careful.  Be well informed.   Contact support where possible.  And when not possible - tread carefully.  

    So it is not in place to prevent customers from using root, only to dissuade people from doing "stupid stuff" like trying to install device drivers using root access - which actually happened back in ASG V4 days and resulted in the root changes may void support rule.  

    All the best,

    Adrien.

Children
No Data