High CPU load after update to UTM 9.700-5

Since the up2date 9.700-5 update, CPU on our master SG210 is running >94%.

Top is showing this: 

Cpu(s): 41.7%us, 47.9%sy,  0.0%ni,  5.0%id,  0.0%wa,  0.0%hi,  5.3%si,  0.0%st

HA lgo says:

2020:01:07-16:57:51 repctl[10837]: Coffee standby_clone(936): 'module' => 'postgres-default'
2020:01:07-16:57:51 repctl[10837]: Coffee standby_clone(936): };
2020:01:07-16:57:51 repctl[10837]: Coffee standby_clone(936): (Attempt #:10)
2020:01:07-16:58:01 repctl[10837]: Wilted Flower master_connection(2015): check_dbh: -1
2020:01:07-16:58:03 repctl[10837]: Idea stop_backup_mode(765): stopped backup mode at 00000001000010B0000000DD
2020:01:07-16:58:03 repctl[10837]: Coffee standby_clone(950): standby_clone failed: sync aborted (never executed successfully)
2020:01:07-16:58:03 repctl[10837]: Email prepare_secondary(346): prepare_secondary: clone failed
2020:01:07-16:58:03 repctl[10837]: Coffee prepare_secondary(360): failed to get database up, waiting for retry
2020:01:07-16:58:03 repctl[10837]: Coffee setup_replication(274): setup_replication was not properly executed
2020:01:07-16:58:03 repctl[10837]: Idea setup_replication(278): checkinterval 300

Any remediation suggestions?

Regards,

Koen

  • Seems there are HA/replication problems.

    Are the devices directly connected ? (without switch between them)

    First I would try to reboot the slave.

  • In reply to dirkkotte:

    Just rebooted Slave, been in SYNCING state for at least 8 minutes before getting back to READY.

    >master_ip 198.19.250.2

    >slave_ip 198.19.250.1

    2020:01:07-17:09:10 <SLAVE > ha_daemon[4249]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 47 10.995" name="Initial synchronization finished!"
    2020:01:07-17:09:10 <SLAVE > ha_daemon[4249]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 48 10.995" name="state change SYNCING(2) -> ACTIVE(0)"
    2020:01:07-17:09:11 <MASTER> ha_daemon[4172]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 963 11.459" name="Node 1 changed state: SYNCING(2) -> ACTIVE(0)"
    2020:01:07-17:12:53 <MASTER> repctl[4309]: Email db_connect(2206): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2): could not connect to server: Connection refused
    2020:01:07-17:12:53 <MASTER> repctl[4309]: Email db_connect(2206): Is the server running on host "198.19.250.2" and accepting
    2020:01:07-17:12:53 <MASTER> repctl[4309]: Email db_connect(2206): TCP/IP connections on port 5432?
    2020:01:07-17:12:53 <MASTER> repctl[4309]: Email master_connection(2045): could not connect to server: Connection refused
    2020:01:07-17:12:53 <MASTER> repctl[4309]: Email master_connection(2045): Is the server running on host "198.19.250.2" and accepting
    2020:01:07-17:12:53 <MASTER> repctl[4309]: Email master_connection(2045): TCP/IP connections on port 5432?
    2020:01:07-17:12:53 <MASTER> repctl[4309]: Idea main(188): cannot connect to postgres on master, retry after 512 seconds
    2020:01:07-17:13:02 <SLAVE > repctl[10837]: Idea recheck(1057): got ALRM: replication recheck triggered Setup_replication_done = 0
    2020:01:07-17:13:02 <SLAVE > repctl[10837]: Idea execute(1768): pg_ctl: no server running
    2020:01:07-17:13:02 <SLAVE > repctl[10837]: Email db_connect(2206): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.1): could not connect to server: Connection refused
    2020:01:07-17:13:02 <SLAVE > repctl[10837]: Email db_connect(2206): Is the server running on host "198.19.250.1" and accepting
    2020:01:07-17:13:02 <SLAVE > repctl[10837]: Email db_connect(2206): TCP/IP connections on port 5432?
    2020:01:07-17:13:02 <SLAVE > repctl[10837]: Email master_connection(2045): could not connect to server: Connection refused
    2020:01:07-17:13:02 <SLAVE > repctl[10837]: Email master_connection(2045): Is the server running on host "198.19.250.1" and accepting
    2020:01:07-17:13:02 <SLAVE > repctl[10837]: Email master_connection(2045): TCP/IP connections on port 5432?
    2020:01:07-17:13:05 <SLAVE > repctl[10837]: Email db_connect(2206): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.1): could not connect to server: Connection refused
    2020:01:07-17:13:05 <SLAVE > repctl[10837]: Email db_connect(2206): Is the server running on host "198.19.250.1" and accepting
    2020:01:07-17:13:05 <SLAVE > repctl[10837]: Email db_connect(2206): TCP/IP connections on port 5432?
    2020:01:07-17:13:05 <SLAVE > repctl[10837]: Email master_connection(2045): could not connect to server: Connection refused
    2020:01:07-17:13:05 <SLAVE > repctl[10837]: Email master_connection(2045): Is the server running on host "198.19.250.1" and accepting
    2020:01:07-17:13:05 <SLAVE > repctl[10837]: Email master_connection(2045): TCP/IP connections on port 5432?
    2020:01:07-17:13:08 <SLAVE > repctl[10837]: Email db_connect(2206): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.1): could not connect to server: Connection refused
    2020:01:07-17:13:08 <SLAVE > repctl[10837]: Email db_connect(2206): Is the server running on host "198.19.250.1" and accepting
    2020:01:07-17:13:08 <SLAVE > repctl[10837]: Email db_connect(2206): TCP/IP connections on port 5432?
    2020:01:07-17:13:08 <SLAVE > repctl[10837]: Email master_connection(2045): could not connect to server: Connection refused
    2020:01:07-17:13:08 <SLAVE > repctl[10837]: Email master_connection(2045): Is the server running on host "198.19.250.1" and accepting
    2020:01:07-17:13:08 <SLAVE > repctl[10837]: Email master_connection(2045): TCP/IP connections on port 5432?
    2020:01:07-17:13:12 <SLAVE > repctl[10837]: Email db_connect(2206): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.1): could not connect to server: Connection refused
    2020:01:07-17:13:12 <SLAVE > repctl[10837]: Email db_connect(2206): Is the server running on host "198.19.250.1" and accepting
    2020:01:07-17:13:12 <SLAVE > repctl[10837]: Email db_connect(2206): TCP/IP connections on port 5432?
    2020:01:07-17:13:12 <SLAVE > repctl[10837]: Email master_connection(2045): could not connect to server: Connection refused
    2020:01:07-17:13:12 <SLAVE > repctl[10837]: Email master_connection(2045): Is the server running on host "198.19.250.1" and accepting
    2020:01:07-17:13:12 <SLAVE > repctl[10837]: Email master_connection(2045): TCP/IP connections on port 5432?
    2020:01:07-17:13:15 <SLAVE > repctl[10837]: Coffee prepare_secondary(315): prepare_secondary failed because master db's status can't be determined! Maybe unreachable?
    2020:01:07-17:13:15 <SLAVE > repctl[10837]: Coffee setup_replication(274): setup_replication was not properly executed
    2020:01:07-17:13:15 <SLAVE > repctl[10837]: Idea setup_replication(278): checkinterval 300

     

  • In reply to KoenDooms:

    2020:01:07-17:13:15 <SLAVE > repctl[10837]: Coffee prepare_secondary(315): prepare_secondary failed because master db's status can't be determined! Maybe unreachable?

    this looks as the postgres database at master is defect.

    If slave goes to ready-state i would failover to slave node(reboot master)

    ... and repair/rebuild database or reinstall slave

     

  • Hi KoenDooms,

    I would suggest you to provide output of top command, it will help us identify which process it taking high CPU. Check if there is one specific process shows up taking high CPU. 

    Thanks,

  • In reply to H_Patel:

    I rebooted master that had high CPU load, but slave that is now running as master has same high CPU load.

    This is the top output:

    top - 13:27:11 up 20:23, 1 user, load average: 2.84, 2.73, 2.65
    Tasks: 146 total, 4 running, 140 sleeping, 0 stopped, 2 zombie
    Cpu(s): 48.7%us, 43.1%sy, 0.0%ni, 4.2%id, 0.0%wa, 0.0%hi, 4.0%si, 0.0%st
    Mem: 8101360k total, 2444980k used, 5656380k free, 282860k buffers
    Swap: 4194296k total, 0k used, 4194296k free, 790612k cached

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    6259 snort 15 -5 382m 368m 4052 S 21 4.7 122:02.62 snort
    24435 root 20 0 0 0 0 Z 11 0.0 0:00.34 confd.plx <defunct>
    24316 root 20 0 18676 12m 2952 S 7 0.2 0:00.21 pfilter-reporte
    6349 root 20 0 14888 9536 2844 S 6 0.1 66:19.79 syslog-ng
    4668 root 15 -5 24400 20m 596 R 2 0.3 7:07.02 conntrackd
    13097 afcd 19 -1 102m 79m 8004 S 2 1.0 12:13.65 afcd
    3666 root 20 0 8316 4436 196 R 1 0.1 12:55.38 haveged
    3 root 20 0 0 0 0 S 0 0.0 3:18.99 ksoftirqd/0
    4446 root 20 0 13776 9924 2528 S 0 0.1 2:31.54 selfmonng.plx
    12236 root 20 0 4548 1044 872 S 0 0.0 0:01.02 waf-reporter
    24016 root 20 0 0 0 0 S 0 0.0 0:00.93 kworker/0:0
    1 root 20 0 3976 580 508 S 0 0.0 0:01.24 init
    2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd
    5 root 0 -20 0 0 0 S 0 0.0 0:00.00 kworker/0:0H
    7 root RT 0 0 0 0 S 0 0.0 0:21.47 migration/0
    8 root 20 0 0 0 0 S 0 0.0 0:00.00 rcu_bh
    9 root 20 0 0 0 0 S 0 0.0 2:35.41 rcu_sched
    10 root RT 0 0 0 0 S 0 0.0 0:15.80 migration/1
    11 root 20 0 0 0 0 S 0 0.0 2:15.76 ksoftirqd/1
    12 root 20 0 0 0 0 S 0 0.0 0:00.00 kworker/1:0
    13 root 0 -20 0 0 0 S 0 0.0 0:00.00 kworker/1:0H
    14 root 0 -20 0 0 0 S 0 0.0 0:00.00 khelper
    15 root 20 0 0 0 0 S 0 0.0 0:01.34 kworker/u4:1
    123 root 0 -20 0 0 0 S 0 0.0 0:00.00 writeback
    126 root 0 -20 0 0 0 S 0 0.0 0:00.00 bioset
    127 root 0 -20 0 0 0 S 0 0.0 0:00.00 crypto
    129 root 0 -20 0 0 0 S 0 0.0 0:00.00 kblockd
    319 root 20 0 0 0 0 S 0 0.0 0:00.07 khubd
    327 root 0 -20 0 0 0 S 0 0.0 0:00.00 edac-poller
    446 root 20 0 0 0 0 S 0 0.0 0:00.00 kswapd0
    511 root 39 19 0 0 0 S 0 0.0 0:00.32 khugepaged
    512 root 20 0 0 0 0 S 0 0.0 0:00.00 fsnotify_mark
    1142 root 0 -20 0 0 0 S 0 0.0 0:00.00 deferwq
    1158 root 20 0 27936 21m 2972 S 0 0.3 0:00.43 admin-reporter.
    1226 root 0 -20 0 0 0 S 0 0.0 0:00.00 ata_sff
    1244 root 20 0 0 0 0 S 0 0.0 0:00.04 scsi_eh_0
    1247 root 20 0 0 0 0 S 0 0.0 0:00.00 scsi_eh_1
    1250 root 20 0 0 0 0 S 0 0.0 0:00.00 scsi_eh_2
    1253 root 20 0 0 0 0 S 0 0.0 0:00.00 scsi_eh_3
    1257 root 20 0 0 0 0 S 0 0.0 0:00.04 kworker/u4:3
    1501 root 20 0 8108 3204 2544 S 0 0.0 0:00.12 sshd
    2519 root 0 -20 0 0 0 S 0 0.0 0:02.18 kworker/0:1H
    2520 root 0 -20 0 0 0 S 0 0.0 0:00.06 kworker/1:1H
    2582 root 20 0 0 0 0 S 0 0.0 0:01.10 jbd2/sda6-8
    2583 root 0 -20 0 0 0 S 0 0.0 0:00.00 ext4-rsv-conver

  • In reply to KoenDooms:

    Hi KoenDooms,

    Thank you for providing the output of top command. In your case service "Snort" is the one taking high CPU. Can you please confirm is you have any recent event of Pattern update not being updated in up2date.log? If you do not see any log entry with Pattern update failing to install, please check your IPS configuration and verify the Pattern rule age? 

    Thanks,

  • In reply to H_Patel:

    Hi KoenDooms,

    try to disable IPS for one minute and check if CPU goes down ... i think not.

  • In reply to dirkkotte:

    No improvement when disabling IDS.

  • In reply to KoenDooms:

    Hi KoenDooms,

    When you disable IPS, have you checked the resource utilization at the time? Is snort still utilizing high CPU or some other process? 

    Thanks,

  • In reply to H_Patel:

    I disabled IPS, but CPU stayed >90%.

    Seems to be a sync issue between HA devices.

    Maybe I should try to rebuild the database.

    Anyone know how to do that or should I involve support?

  • In reply to KoenDooms:

    Hi KoenDooms,

    I would suggest you to open ticket with support team and PM me the case number. 

    Thanks,

  • In reply to KoenDooms:

    We have some hints about rebuilding postgres database within the forum.
    But you lost a lot of reporting data and possible mails, because postgres includes mail-data too ... i think.
    You should work with sophos support (or your partner) to solve this problem.
    With rebuilding the database by yourself you may lose your support.

    Please post a hint if you need more help.