This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Postgres issue?

I seem to have this happening in my UTM logs:

 

2017:04:24-11:55:30 gw01-2 repctl[14683]: [e] db_connect(2171): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2): could not connect to server: Connection refused
2017:04:24-11:55:30 gw01-2 repctl[14683]: [e] db_connect(2171): Is the server running on host "198.19.250.2" and accepting
2017:04:24-11:55:30 gw01-2 repctl[14683]: [e] db_connect(2171): TCP/IP connections on port 5432?
2017:04:24-11:55:30 gw01-2 repctl[14683]: [e] master_connection(2010): could not connect to server: Connection refused
2017:04:24-11:55:30 gw01-2 repctl[14683]: [e] master_connection(2010): Is the server running on host "198.19.250.2" and accepting
2017:04:24-11:55:30 gw01-2 repctl[14683]: [e] master_connection(2010): TCP/IP connections on port 5432?
2017:04:24-11:55:30 gw01-2 repctl[14683]: [i] main(188): cannot connect to postgres on master, retry after 1024 seconds
2017:04:24-12:04:08 gw01-1 repctl[4357]: [e] db_connect(2171): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2): could not connect to server: Connection refused
2017:04:24-12:04:08 gw01-1 repctl[4357]: [e] db_connect(2171): Is the server running on host "198.19.250.2" and accepting
2017:04:24-12:04:08 gw01-1 repctl[4357]: [e] db_connect(2171): TCP/IP connections on port 5432?
2017:04:24-12:04:08 gw01-1 repctl[4357]: [e] master_connection(2010): could not connect to server: Connection refused
2017:04:24-12:04:08 gw01-1 repctl[4357]: [e] master_connection(2010): Is the server running on host "198.19.250.2" and accepting
2017:04:24-12:04:08 gw01-1 repctl[4357]: [e] master_connection(2010): TCP/IP connections on port 5432?
2017:04:24-12:04:08 gw01-1 repctl[4357]: [i] main(188): cannot connect to postgres on master, retry after 1024 seconds
 
It's slightly concerning to say the least. Anybody know the solution?


This thread was automatically locked due to age.
Parents
  • Hi Louise,

    This looks like a connectivity issue on High Availability connection between your Master and Slave UTM's.   You can check the connectivity by logging into the UTM shell and trying the ping command "ping 198.19.250.2"   You should see something like this:

    <M> utm-cpc:/home/login # ping 198.19.250.2
    PING 198.19.250.2 (198.19.250.2) 56(84) bytes of data.
    64 bytes from 198.19.250.2: icmp_seq=1 ttl=64 time=0.784 ms
    64 bytes from 198.19.250.2: icmp_seq=2 ttl=64 time=0.805 ms
    64 bytes from 198.19.250.2: icmp_seq=3 ttl=64 time=0.468 ms
    64 bytes from 198.19.250.2: icmp_seq=4 ttl=64 time=0.724 ms
    64 bytes from 198.19.250.2: icmp_seq=5 ttl=64 time=0.524 ms
    64 bytes from 198.19.250.2: icmp_seq=6 ttl=64 time=0.682 ms
    64 bytes from 198.19.250.2: icmp_seq=7 ttl=64 time=0.430 ms
    64 bytes from 198.19.250.2: icmp_seq=8 ttl=64 time=2.31 ms
    64 bytes from 198.19.250.2: icmp_seq=9 ttl=64 time=0.346 ms
    ^C
    --- 198.19.250.2 ping statistics ---
    9 packets transmitted, 9 received, 0% packet loss, time 8000ms
    rtt min/avg/max/mdev = 0.346/0.786/2.316/0.563 ms

  • Well, it's not good.

    The first symptoms we had of this was users complaining about their internet connection being slow. And it was. We struggled to log onto the UTM also. The masters swap was at 27% which we hadn't seen before on our SG310

    So, we rebooted the master to see if that had any effect and yes it did. The slave came on and everything ran fine.

    After the reboot, we switched back to master and it appeared to work (swap % and everything running or so we thought)

    We were/are still going out to the internet etc but we've lost all mail filtering etc and Postgre is complaining and won't start at all.

    Now, we had a nightmare with Sophos support last time (we have premium support and we did it via the portal)

    So this time I called them direct and to their due, they tried a few things to get it going but to no avail.

    It's now been escalated up to the next level so we will see where that goes.
    Hopefully it's fairly quick because if it isn't (after last time taking 8 weeks or so to semi sort something), we aren't going to be in the mood to wait another week for a fix.

    In the meantime, I've redirected our mail to our other UTM active/passive cluster which is behaving itself.

    So we have internet etc out of the UTM that's semi broke and email out of one on the other site that is ok.

    Not sure what happened as there was no config etc going on at the time.

  • If you shut down the Slave does the Masters performance improve?  That would indicate that there are problems with the HA synchronization which can impact Postgres and cause those types of of symptoms.

  • Not sure what went wrong to be honest. The master slowed down, swap went to 27%. Switch to the slave improved it. Rebooted master, swap was 0%, switch back over to master, things went ok and then performance went down again over the next hour.

    Sophos sorted it within a day although it did require a db rebuild.

Reply
  • Not sure what went wrong to be honest. The master slowed down, swap went to 27%. Switch to the slave improved it. Rebooted master, swap was 0%, switch back over to master, things went ok and then performance went down again over the next hour.

    Sophos sorted it within a day although it did require a db rebuild.

Children