Hello
Hope to get some advice from this great community, as this problem is driving me crazy
Our specs:
ASG Software, Firmware version: 9.705-3, runing on a VMWare ESXi 6.0U3
We have like 50 RED devices, models 10/15/15w/20, most of them in Standard/Unified configuration
All deployed to remote branch offices with DHCP WAN service, some share the same ISP but with different type of service, copper ADSL or fiber ADSL
The problem we are facing:
During the past months (2 or 3), almost everday, at almost the sametime, a bunch of them start disconecting from the UTM for about a minute or so, obviously the branch office cant access anything until the RED re-establishes the link with the UTM, after this, they keep working until the next day.
Weird thing:
NOT all REDS are getting disconnecting, just a bunch of 15 or so
Everything works before and after this hour (13hrs UTC -6)
Everything else doesnt seems to get affected, LAN attached to the UTM, Remote Access (VPN), Site2Site IPSEC VPN
What we have tried:
The ISP we use for the UTM swears they dont have any kind of filter over our connection, and neither have connectivity issues, seems true as the UTM is accesible via WAN during these periods and other services on UTM, besides RED, works fine
postgresql rebuild (partner guided)
Re-create/Re-generate the affected REDs on system
Change from unified firmware to legacy (cc get red use_unified_firmware) (partner guided)
RED logs:
2021:07:29-12:56:36 XXX red_server[11299]: R20??????????98: Sending json message {"data":{"seq":5},"type":"PONG"} 2021:07:29-12:56:36 XXX red_server[17044]: A32??????????14: command 'NKEY' 2021:07:29-12:56:36 XXX red_server[12461]: A35??????????8E: seems to be still connected, exiting. 2021:07:29-12:56:36 XXX red_server[11299]: R20??????????98: command '{"data":{"message":"Failed to send keepalive frame: Trying to send PING but expecting PONG to receive first","type":"RUNTIME_ERROR_OCCURRED"},"type":"DISCONNECT"}' 2021:07:29-12:56:36 XXX red_server[11299]: R20??????????98: Disconnecting: RUNTIME_ERROR_OCCURRED, Failed to send keepalive frame: Trying to send PING but expecting PONG to receive first 2021:07:29-12:56:36 XXX red_server[12476]: SELF: New connection from 189.???.??.?? with ID A35??????????F3 (cipher AES256-GCM-SHA384), rev1 2021:07:29-12:56:36 XXX red_server[12476]: A35??????????F3: already connected, releasing old connection. 2021:07:29-12:56:37 XXX red_server[12478]: SELF: New connection from 138.???.??.?? with ID A350?????????AF (cipher AES256-GCM-SHA384), rev1 2021:07:29-12:56:37 XXX red_server[14066]: A35??????????E6: command 'PING 0 uplink=WAN' 2021:07:29-12:56:37 XXX red_server[14066]: A35??????????E6: PING remote_tx=0 local_rx=0 diff=0 2021:07:29-12:56:37 XXX red_server[14066]: A35??????????E6: PONG local_tx=0 2021:07:29-12:56:37 XXX red_server[12478]: A35??????????AF: already connected, releasing old connection. 2021:07:29-12:56:37 XXX red_server[11299]: id="4202" severity="info" sys="System" sub="RED" name="RED Tunnel Down" red_id="R20??????????98" forced="1" 2021:07:29-12:56:37 XXX red_server[11299]: R20??????????98 is disconnected. 2021:07:29-12:56:37 XXX red_server[14655]: A350?????????AF: No ping for 30 seconds, exiting. 2021:07:29-12:56:37 XXX red_server[20614]: A32??????????6B: command 'PING 3697 uplink=WAN' 2021:07:29-12:56:37 XXX red_server[20614]: A32??????????6B: PING remote_tx=3697 local_rx=3697 diff=0 2021:07:29-12:56:37 XXX red_server[20614]: A32??????????6B: PONG local_tx=3036 ... 2021:07:29-12:56:37 XXX redctl[12497]: key length: 32 2021:07:29-12:56:39 XXX red2ctl[5057]: Overflow happened on reds7:0 2021:07:29-12:56:39 XXX red2ctl[5057]: Overflow happened on reds8:0 2021:07:29-12:56:39 XXX red2ctl[5057]: Overflow happened on reds49:0 2021:07:29-12:56:39 XXX red2ctl[5057]: Missing keepalive from reds49:0, disabling peer 138.???.??.?? 2021:07:29-12:56:39 XXX red2ctl[5057]: Missing keepalive from reds8:0, disabling peer 200.??.??.??? 2021:07:29-12:56:39 XXX red2ctl[5057]: Missing keepalive from reds7:0, disabling peer 187.???.???.??? 2021:07:29-12:56:39 XXX red_server[12505]: SELF: New connection from 187.???.??.?? with ID A320?????????14 (cipher AES256-GCM-SHA384), rev1 2021:07:29-12:56:40 XXX red_server[12505]: A32??????????14: already connected, releasing old connection. 2021:07:29-12:56:40 XXX red_server[14685]: id="4202" severity="info" sys="System" sub="RED" name="RED Tunnel Down" red_id="A35??????????F3" forced="1" ... 2021:07:29-12:56:40 XXX red_server[14129]: R20??????????B0: command '{"data":{"message":"Failed to send keepalive frame: Trying to send PING but expecting PONG to receive first","type":"RUNTIME_ERROR_OCCURRED"},"type":"DISCONNECT"}' 2021:07:29-12:56:40 XXX red_server[14129]: R20??????????B0: Disconnecting: RUNTIME_ERROR_OCCURRED, Failed to send keepalive frame: Trying to send PING but expecting PONG to receive first 2021:07:29-12:56:40 XXX red_server[14957]: A35??????????8D is disconnected. 2021:07:29-12:56:41 XXX red_server[12522]: SELF: New connection from 189.???.???.??? with ID A35??????????17 (cipher AES256-GCM-SHA384), rev1 2021:07:29-12:56:41 XXX red_server[14061]: A35??????????A0: command 'PING 0 uplink=WAN' ... 2021:07:29-12:56:41 XXX red_server[14113]: A35??????????42: PONG local_tx=0 2021:07:29-12:56:41 XXX red_server[14129]: id="4202" severity="info" sys="System" sub="RED" name="RED Tunnel Down" red_id="R20??????????B0" forced="1" 2021:07:29-12:56:41 XXX red_server[12522]: A35??????????17: already connected, releasing old connection. 2021:07:29-12:56:41 XXX red_server[17044]: id="4202" severity="info" sys="System" sub="RED" name="RED Tunnel Down" red_id="A32??????????14" forced="1" 2021:07:29-12:56:42 XXX red_server[12529]: SELF: New connection from 187.???.??.??? with ID A35??????????EA (cipher AES256-GCM-SHA384), rev1 2021:07:29-12:56:42 XXX red_server[20350]: A350?????????8E is disconnected. 2021:07:29-12:56:42 XXX red_server[12529]: A350?????????EA: already connected, releasing old connection. 2021:07:29-12:56:42 XXX red_server[14481]: A350?????????C6: command 'PING 0 uplink=WAN' 2021:07:29-12:56:42 XXX red_server[14655]: A35??????????AF is disconnected. 2021:07:29-12:56:42 XXX red2ctl[5057]: Overflow happened on reds15:0 2021:07:29-12:56:42 XXX red2ctl[5057]: Missing keepalive from reds39:0, disabling peer 189.???.??.??? 2021:07:29-12:56:42 XXX red2ctl[5057]: Missing keepalive from reds15:0, disabling peer 189.???.??.?? 2021:07:29-12:56:42 XXX red_server[13962]: id="4202" severity="info" sys="System" sub="RED" name="RED Tunnel Down" red_id="A35??????????EA" forced="1" 2021:07:29-12:56:42 XXX red_server[12522]: A35??????????17: seems to be still connected, exiting. 2021:07:29-12:56:42 XXX red_server[13944]: A35??????????09 is disconnected. 2021:07:29-12:56:43 XXX red_server[10201]: A35??????????07: command 'PING 0 uplink=WAN' 2021:07:29-12:56:43 XXX red_server[13978]: A35??????????70: command 'PING 0 uplink=WAN' 2021:07:29-12:56:43 XXX red_server[17044]: A32??????????14 is disconnected. 2021:07:29-12:56:43 XXX red_server[12543]: SELF: Cannot do SSL handshake on socket accept from '189.???.??.???': SSL connect accept failed because of handshake problems 2021:07:29-12:56:44 XXX red_server[12544]: SELF: Cannot do SSL handshake on socket accept from '189.???.??.???': SSL connect accept failed because of handshake problems 2021:07:29-12:56:44 XXX red_server[13916]: A35??????????B7: command 'PING 0 uplink=WAN' 2021:07:29-12:56:45 XXX red_server[14446]: A32??????????2D: command 'PING 8638 uplink=WAN' 2021:07:29-12:56:45 XXX red_server[14446]: A32??????????2D: PING remote_tx=8638 local_rx=8638 diff=0 2021:07:29-12:56:45 XXX red_server[14446]: A32??????????2D: PONG local_tx=12000 2021:07:29-12:56:45 XXX red_server[14445]: A32??????????17: command 'PING 11856 uplink=WAN' ...... 2021:07:29-12:57:51 XXX red_server[13978]: A35????????????70: PING remote_tx=0 local_rx=0 diff=0 2021:07:29-12:57:51 XXX red_server[13978]: A35????????????70: PONG local_tx=0 2021:07:29-12:57:51 XXX red_server[12183]: A32????????????02: connected OK, pushing config 2021:07:29-12:57:51 XXX red_server[14113]: A35????????????42: No ping for 30 seconds, exiting. 2021:07:29-12:57:51 XXX red_server[13995]: A35????????????4D: No ping for 30 seconds, exiting. 2021:07:29-12:57:51 XXX red_server[29169]: A35????????????0C: No ping for 30 seconds, exiting. 2021:07:29-12:57:51 XXX red_server[11669]: A35????????????A3: No ping for 30 seconds, exiting. 2021:07:29-12:57:51 XXX red_server[14635]: A35????????????63: No ping for 30 seconds, exiting. 2021:07:29-12:57:51 XXX red_server[14050]: A35????????????03: No ping for 30 seconds, exiting. 2021:07:29-12:57:51 XXX red_server[13965]: A35????????????B9: No ping for 30 seconds, exiting. 2021:07:29-12:57:51 XXX red_server[18491]: A35????????????3F: No ping for 30 seconds, exiting. 2021:07:29-12:57:56 XXX red_server[12749]: SELF: Cannot do SSL handshake on socket accept from '187.???.??.??': SSL connect accept failed because of handshake problems 2021:07:29-12:57:57 XXX red_server[12739]: SELF: Cannot do SSL handshake on socket accept from '189.???.??.???': SSL connect accept failed because of handshake problems 2021:07:29-12:57:57 XXX red_server[12741]: SELF: Cannot do SSL handshake on socket accept from '187.???.??.???': SSL connect accept failed because of handshake problems 2021:07:29-12:57:57 XXX red_server[12742]: SELF: Cannot do SSL handshake on socket accept from '187.???.??.???': SSL connect accept failed because of handshake problems 2021:07:29-12:57:57 XXX red_server[13916]: A35????????????B7: command 'PING 0 uplink=WAN' 2021:07:29-12:57:57 XXX red_server[32684]: id="4202" severity="info" sys="System" sub="RED" name="RED Tunnel Down" red_id="A35????????????2B" forced="0" 2021:07:29-12:57:57 XXX red_server[14635]: id="4202" severity="info" sys="System" sub="RED" name="RED Tunnel Down" red_id="A35????????????63" forced="0" 2021:07:29-12:57:57 XXX red_server[14050]: id="4202" severity="info" sys="System" sub="RED" name="RED Tunnel Down" red_id="A35????????????03" forced="0" 2021:07:29-12:57:58 XXX red_server[14390]: A32????????????70: command 'PING 14823 uplink=WAN' 2021:07:29-12:57:58 XXX red_server[14390]: A32????????????70: PING remote_tx=14823 local_rx=15290 diff=-467 2021:07:29-12:57:58 XXX red_server[14390]: A32????????????70: PONG local_tx=16287 2021:07:29-12:57:58 XXX red_server[14258]: A32????????????B4: command 'PING 564 uplink=WAN' 2021:07:29-12:57:58 XXX red_server[14258]: A32????????????B4: PING remote_tx=564 local_rx=790 diff=-226 2021:07:29-12:57:58 XXX red_server[14258]: A32????????????B4: PONG local_tx=406 2021:07:29-12:58:00 XXX red_server[12465]: R20????????????98: command '{"data":{"version":"0"},"type":"INIT_CONNECTION"}' 2021:07:29-12:58:00 XXX red_server[12465]: R20????????????98: Initializing connection running protocol version 0 2021:07:29-12:58:00 XXX red_server[12465]: R20????????????98: Sending json message {"data":{},"type":"WELCOME"} 2021:07:29-12:58:01 XXX red_server[29169]: id="4202" severity="info" sys="System" sub="RED" name="RED Tunnel Down" red_id="A35????????????0C" forced="0" 2021:07:29-12:58:01 XXX red_server[14113]: id="4202" severity="info" sys="System" sub="RED" name="RED Tunnel Down" red_id="A35????????????42" forced="0" 2021:07:29-12:58:01 XXX red_server[13995]: id="4202" severity="info" sys="System" sub="RED" name="RED Tunnel Down" red_id="A35????????????4D" forced="0" 2021:07:29-12:58:01 XXX red_server[11669]: id="4202" severity="info" sys="System" sub="RED" name="RED Tunnel Down" red_id="A35????????????A3" forced="0" 2021:07:29-12:58:01 XXX red_server[18491]: id="4202" severity="info" sys="System" sub="RED" name="RED Tunnel Down" red_id="A35????????????3F" forced="0" 2021:07:29-12:58:04 XXX red_server[13965]: id="4202" severity="info" sys="System" sub="RED" name="RED Tunnel Down" red_id="A35????????????B9" forced="0" 2021:07:29-12:58:05 XXX red_server[11741]: A32????????????68: command 'PING 4263 uplink=WAN' 2021:07:29-12:58:05 XXX red_server[14446]: A32????????????2D: command 'PING 9298 uplink=WAN' 2021:07:29-12:58:05 XXX red_server[11741]: A32????????????68: PING remote_tx=4263 local_rx=4264 diff=-1 2021:07:29-12:58:05 XXX red_server[11741]: A32????????????68: PONG local_tx=3326 2021:07:29-12:58:05 XXX red_server[14446]: A32????????????2D: PING remote_tx=9298 local_rx=9306 diff=-8 2021:07:29-12:58:05 XXX red_server[14446]: A32????????????2D: PONG local_tx=12511 2021:07:29-12:58:05 XXX red_server[3541]: A32????????????B1: command 'PING 1582 uplink=WAN' 2021:07:29-12:58:05 XXX red_server[14027]: A32????????????4B: command 'PING 2136 uplink=WAN' 2021:07:29-12:58:05 XXX red_server[3541]: A32????????????B1: PING remote_tx=1582 local_rx=1579 diff=3 2021:07:29-12:58:05 XXX red_server[3541]: A32????????????B1: PONG local_tx=1249 2021:07:29-12:58:05 XXX red_server[14027]: A32????????????4B: PING remote_tx=2136 local_rx=2134 diff=2 2021:07:29-12:58:05 XXX red_server[14027]: A32????????????4B: PONG local_tx=1720 2021:07:29-12:58:08 XXX red_server[12817]: SELF: Cannot do SSL handshake on socket accept from '186.???.??.???': SSL connect accept failed because of handshake problems 2021:07:29-12:58:08 XXX red_server[12859]: SELF: Cannot do SSL handshake on socket accept from '189.???.??.???': SSL connect accept failed because of handshake problems 2021:07:29-12:58:10 XXX red_server[12837]: SELF: Cannot do SSL handshake on socket accept from '189.???.??.???': SSL connect accept failed because of handshake problems 2021:07:29-12:58:10 XXX red_server[12847]: SELF: Cannot do SSL handshake on socket accept from '189.???.???.??': SSL connect accept failed because of handshake problems 2021:07:29-12:58:10 XXX red_server[12853]: SELF: Cannot do SSL handshake on socket accept from '189.???.??.???': SSL connect accept failed because of handshake problems 2021:07:29-12:58:13 XXX red_server[12849]: WARN ------------------------------------------------------- 2021:07:29-12:58:13 XXX red_server[12849]: Use of uninitialized value in subroutine entry at red_server.pl line 465. 2021:07:29-12:58:13 XXX red_server[12846]: SELF: Cannot do SSL handshake on socket accept from '187.???.???.???': SSL connect accept failed because of handshake problems 2021:07:29-12:58:13 XXX red_server[12849]: 1 (eval) 2021:07:29-12:58:13 XXX red_server[12849]: 2 main::run_client 2021:07:29-12:58:13 XXX red_server[12849]: 3 main::socketListener 2021:07:29-12:58:13 XXX red_server[12849]: >========================================================================= 2021:07:29-12:58:13 XXX red_server[12849]: DIE -------------------------------------------------------- 2021:07:29-12:58:13 XXX red_server[12849]: 2021:07:29-12:58:13 XXX red_server[12849]: 1. main::__ANON__:54() red_server.pl 2021:07:29-12:58:13 XXX red_server[12849]: 2. (eval):465() red_server.pl 2021:07:29-12:58:13 XXX red_server[12849]: 3. main::run_client:464() red_server.pl 2021:07:29-12:58:13 XXX red_server[12849]: 4. main::socketListener:410() red_server.pl 2021:07:29-12:58:13 XXX red_server[12849]: 5. main::top-level:348() red_server.pl 2021:07:29-12:58:13 XXX red_server[12849]: |========================================================================= 2021:07:29-12:58:13 XXX red_server[12849]: Bad arg length for Socket::inet_ntoa, length is 0, should be 4 at red_server.pl line 465. 2021:07:29-12:58:13 XXX red_server[12849]: 2021:07:29-12:58:13 XXX red_server[12849]: 1. main::__ANON__:55() red_server.pl 2021:07:29-12:58:13 XXX red_server[12849]: 2. (eval):465() red_server.pl 2021:07:29-12:58:13 XXX red_server[12849]: 3. main::run_client:464() red_server.pl 2021:07:29-12:58:13 XXX red_server[12849]: 4. main::socketListener:410() red_server.pl 2021:07:29-12:58:13 XXX red_server[12849]: 5. main::top-level:348() red_server.pl 2021:07:29-12:58:13 XXX red_server[12849]: <========================================================================= 2021:07:29-12:58:13 XXX red_server[12849]: 1 (eval) 2021:07:29-12:58:13 XXX red_server[12849]: 2 main::run_client 2021:07:29-12:58:13 XXX red_server[12849]: 3 main::socketListener 2021:07:29-12:58:13 XXX red_server[12849]: SELF: unable to get peer address or retrieve CN for 'n/a' 2021:07:29-12:58:13 XXX red_server[12858]: SELF: Cannot do SSL handshake on socket accept from '189.???.???.???': SSL connect accept failed because of handshake problems 2021:07:29-12:58:17 XXX red_server[12852]: SELF: New connection from 189.???.??.?? with ID A32???????????17 (cipher AES256-GCM-SHA384), rev1 2021:07:29-12:58:17 XXX red_server[12854]: SELF: New connection from 189.???.??.??? with ID A32????????????B4 (cipher AES256-GCM-SHA384), rev1 2021:07:29-12:58:17 XXX red_server[12855]: SELF: New connection from 189.???.??.?? with ID A32?????????ED (cipher AES256-GCM-SHA384), rev1 2021:07:29-12:58:17 XXX red_server[12860]: SELF: New connection from 138.???.??.?? with ID A35?????????AF (cipher AES256-GCM-SHA384), rev1 2021:07:29-12:58:17 XXX red_server[12856]: SELF: New connection from 201.???.???.??? with ID A32?????????BD (cipher AES256-GCM-SHA384), rev1 2021:07:29-12:58:17 XXX red_server[12851]: SELF: New connection from 186.???.??.??? with ID A35????????????2B (cipher AES256-GCM-SHA384), rev1 2021:07:29-12:58:17 XXX red_server[12861]: SELF: New connection from 186.??.???.?? with ID A32??????????3C (cipher AES256-GCM-SHA384), rev1 2021:07:29-12:58:17 XXX red_server[12857]: SELF: New connection from 187.???.???.??? with ID A32?????????6E (cipher AES256-GCM-SHA384), rev1 2021:07:29-12:58:19 XXX red_server[12740]: SELF: Cannot do SSL handshake on socket accept from '189.???.??.???': SSL wants a read first 2021:07:29-12:58:19 XXX red_server[12743]: SELF: Cannot do SSL handshake on socket accept from '187.???.???.??': SSL wants a read first 2021:07:29-12:58:19 XXX red_server[12744]: SELF: Cannot do SSL handshake on socket accept from '187.???.???.???': SSL wants a read first 2021:07:29-12:58:19 XXX red_server[12745]: SELF: Cannot do SSL handshake on socket accept from '187.???.??.???': SSL wants a read first 2021:07:29-12:58:19 XXX red_server[14061]: A35????????????A0: command 'PING 0 uplink=WAN' 2021:07:29-12:58:19 XXX red_server[13916]: A35????????????B7: PING remote_tx=0 local_rx=0 diff=0 2021:07:29-12:58:19 XXX red_server[14066]: A35????????????E6: command 'PING 0 uplink=WAN' 2021:07:29-12:58:19 XXX red_server[14506]: A35????????????D6: command 'PING 0 uplink=WAN' 2021:07:29-12:58:19 XXX red_server[29169]: A35????????????0C is disconnected. 2021:07:29-12:58:19 XXX red_server[12850]: SELF: Cannot do SSL handshake on socket accept from '187.???.???.??': SSL connect accept failed because of handshake problems 2021:07:29-12:58:19 XXX red_server[12197]: A32????????????01: No ping for 30 seconds, exiting. 2021:07:29-12:58:19 XXX red_server[32684]: A35????????????2B is disconnected. 2021:07:29-12:58:19 XXX red2ctl[5057]: Overflow happened on reds48:0 2021:07:29-12:58:19 XXX red2ctl[5057]: Overflow happened on reds55:0 2021:07:29-12:58:19 XXX red2ctl[5057]: Missing keepalive from reds55:0, disabling peer 187.???.???.?? 2021:07:29-12:58:19 XXX red2ctl[5057]: Missing keepalive from reds48:0, disabling peer 187.???.???.?? 2021:07:29-12:58:19 XXX red_server[14635]: A35????????????63 is disconnected. 2021:07:29-12:58:20 XXX red_server[12557]: A35?????????F3: No ping for 30 seconds, exiting. 2021:07:29-12:58:20 XXX red_server[12465]: R20????????????98: No ping for 30 seconds, exiting. 2021:07:29-12:58:20 XXX red_server[13916]: A35????????????B7: PONG local_tx=0
Looking the hardware logs, at the time of the "problem", the usage of CPU and RAM spikes to above 90%, but we cant identify the reason for this.
Thanks for reading and hope to get some advice from the experts
This thread was automatically locked due to age.