This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

UTM 9.601 - RED issues!

Since upgrading all our customers to 9.601, a bigger part of them are complaining about RED's re/disconnection in a no-pattern way.

It started for all of them just the night we upgraded to 9.601, and they all are on different ISP's and located different places around the country.

Been with Sophos support for 2 hours today, and now they escalated it to higher grounds.

Will return with an update....

Suspicious entries in the log - but all connected REDs do this before connection:

2019:03:06-15:15:38 fw01-2 red_server[17509]: SELF: Cannot do SSL handshake on socket accept from 'xxx.xxx.xxx.xxx': SSL connect accept failed because of handshake problems

2019:03:06-15:15:46 fw01-2 red2ctl[12420]: Missing keepalive from reds3:0, disabling peer xxx.xxx.xxx.xxx

I know the last line is written before the tunnel disconnects, because there was no "PING/PONG" answer...

One customer has 2 x RD 50, one 1 100% stable and the other fluctuates in random intervals - we replaced this with a new RED 50, but the same thing occurs.



This thread was automatically locked due to age.
Parents
  • Another spontaneous "overflow". Luckily after hours...

    2019:09:05-16:52:16 neo-2 red_server[6026]: A35xxxxxxxxxxxx: Sending json message {"data":{"seq":39230},"type":"PONG"}
    2019:09:05-16:52:47 neo-2 red_server[6026]: A35xxxxxxxxxxxx: No ping for 30 seconds, exiting.
    2019:09:05-16:52:47 neo-2 red_server[6026]: id="4202" severity="info" sys="System" sub="RED" name="RED Tunnel Down" red_id="A35xxxxxxxxxxxx" forced="0"
    2019:09:05-16:52:47 neo-2 red_server[6026]: A35xxxxxxxxxxxx is disconnected.
    2019:09:05-16:52:47 neo-2 red_server[21506]: SELF: (Re-)loading device configurations
    2019:09:05-16:52:47 neo-2 red2ctl[21514]: Overflow happened on reds2:0
    2019:09:05-16:52:47 neo-2 red2ctl[21514]: Missing keepalive from reds2:0, disabling peer 195.xxxxxxxx
    2019:09:05-16:52:53 neo-2 red2ctl[21514]: Received keepalive from reds2:0, enabling peer 195.xxxxxxxx
    2019:09:05-16:56:23 neo-2 red2ctl[21514]: Missing keepalive from reds2:0, disabling peer 195.xxxxxxxx
    2019:09:05-17:04:25 neo-2 red_server[21506]: SELF: (Re-)loading device configurations
    2019:09:05-17:08:19 neo-2 red_server[2455]: SELF: Cannot do SSL handshake on socket accept from '195.xxxxxxxx': SSL connect accept failed because of handshake problems

     

    We'll see if and when it comes back up...

  • I assuming, this is not related to the Firmware ... Because overflow and other stuff was in the RED protocol since decades and most likely are related to ISP issues. 

    RED Protocol is heavily related to a burst of data (UDP data).
    And some ISP in some scenarios does not like it. 

    Looks like if the RED is firing some data, it could actually crash. 

    If SG/XG is not received enough UDP data in short time and then all those data once, such overflows happen. Maybe you should check with the ISP (Restart router etc, call ISP to reset the wire etc.). 

    Especially if you have only one RED connected. Try to get another RED, check if this is happening with another RED / another location.

    If yes, it could be related to your local SG ISP, if not, it could be related to your RED ISP. 

     

    __________________________________________________________________________________________________________________

  • Hi  and  

    Would it be possible to please raise a support case (if you haven't raised one already) and PM me with your case ID's?

    It seems that both of your issues are related to RED 15's and UTM v9.605.

    I would like to follow up so that further investigation can be performed.

    Regards,


    Florentino
    Director, Global Community & Digital Support

    Are you a Sophos Partner? | Product Documentation@SophosSupport | Sign up for SMS Alerts
    If a post solves your question, please use the 'Verify Answer' button.
    The Award-winning Home of Sophos Support Videos! - Visit Sophos Techvids
  • Hey  no we haven't raised one yet, because i have to go via partner first, and i think the problem is still the same, doesnt matter which RED Device.

    We had the issues before with a RED10 and two RED50. One of the RED50 is bricked now the others just wont come back online. Then we bought 3 new RED15 and replaced them all in advance, because we need to have a working connection to our remote offices.

    Now the same problems are back. This is very frustrating! As i already have written the next RED15 Device just stopped working today after 9 days being up. I tried the workounds again but it doesnt come back online. Checked the Router in the office and it says the RED15 is not connected anymore. Rebooted the Router and later the ethernet connection from the RED15 to the router came back, but the RED is not connecting to our UTM. Finally i deleted the RED15 in the UTM, after a 10 minutes break i added it again, but it is still offline. The last one yesterday needed 3 hours to reconnect after adding it to our SG310 Rev2,. We replaced the SG310 due to the Hardware Refresh Program on 7th august 2019, the SG310 had 9.509-3 out of the box installed. We updated directly to 9.605-1 and then restored our backup from the SG310 Rev.1 because we were already running 9.605-1

    Glad there is no one at the office till monday, but this is truely not a condition where we want and of course can deal with.

    We have been using these three RED Devices for more than 5 years now, and we never had any issues. But now the RED devices are unreliable for us, also the management at the remote offices and the board of the headquarter is very unhappy with this situation. 

    Now we need to get an alternative, maybe a SG115 with Network Protection in RED-Mode or an SG115 with FullGuard working as a standalone firewall. But who is paying the additional costs, who is paying the time we already spent investigating all these issues and dealing with workounds which sometimes work and sometimes not. And what about the brand new RED15 Devices, i cant return them because they are used now, hence i would have three paper weights then!

    The Thread here started in March 2019, and its quite a pity that half a year later the problem is still not fixed! The Update to Version 9.605-1 is also out several weeks now, and there is no information from sophos whats going on here, and when a fix or a new version will be available. 

  • The first location was down 17 hours and 47 minutes

    The second location was down 17 hours and 55 minutes

    It came back online today morning 08:06 am

    Now at 08:20 I just received a mail from our monitoring the third location is down!!!!

    Slowly but surely I think it's more a big issue with the provisioning service / servers, because when I make changes the red does not reconnect immediately, neither when I delete the RED and add it back again there is no activity shown, just the usual upload config and staging messages.

    Let's see if the third location will be back in 17 hours....

    To be continued....

  • This reply was deleted.
  • I would highly suggest to start to debug your UTM Connection.

    Start by dumping the Port 3410 and Port 3400 Port in a file (ring buffer) and extract it. 

    Then analyse the time frame, when the connection drops and not restart. 

     

    I would guess, this is, as mentioned before, not related to the general RED issue in UTM9.6.

    https://community.sophos.com/kb/en-us/134398

    Maybe you should open another Channel and post your output there. 

    __________________________________________________________________________________________________________________

  • I can also confirm, that we could solve overflow issues on RED15W with disabling the unified firmware. Customer had recurring interuptions. This has really nothing to do with ISPs.... It´s the really bad quality of your software, that isn´t tested sufficiently and causes so many headaches to many customers... You should really think about rolling out security and feature updates seperatly. So someone could still satisfy security needs, while features could be installed later...

  • And what we get know after disabling the unified firmware is this message:

     

    [INFO-184] RED server not running - restarted

     

    It appears every hour...

  • Plan a reboot, that fixed the issue here....

     

    If HA reboot both nodes.

    -----

    Best regards
    Martin

    Sophos XGS 2100 @ Home | Sophos v20 Architect

  • Thank you, we will try that!

  • another 10 days later, the first location with RED15 is offline again:

    2019:09:15-21:56:48 vpn red_server[5533]: RED15_LOC1: No ping for 30 seconds, exiting.
    2019:09:15-21:56:48 vpn red_server[5533]: id="4202" severity="info" sys="System" sub="RED" name="RED Tunnel Down" red_id="RED15_LOC1" forced="0"
    2019:09:15-21:56:48 vpn red_server[5533]: RED15_LOC1 is disconnected.
    2019:09:15-21:56:48 vpn red_server[4919]: SELF: (Re-)loading device configurations
    2019:09:15-21:56:49 vpn red2ctl[4930]: Overflow happened on reds1:0
    2019:09:15-21:56:49 vpn red2ctl[4930]: Missing keepalive from reds1:0, disabling peer x.x.x.x

    and still no Update from Sophos about a fix!

    disabling unified firmware tonight and reboot the UTM

Reply
  • another 10 days later, the first location with RED15 is offline again:

    2019:09:15-21:56:48 vpn red_server[5533]: RED15_LOC1: No ping for 30 seconds, exiting.
    2019:09:15-21:56:48 vpn red_server[5533]: id="4202" severity="info" sys="System" sub="RED" name="RED Tunnel Down" red_id="RED15_LOC1" forced="0"
    2019:09:15-21:56:48 vpn red_server[5533]: RED15_LOC1 is disconnected.
    2019:09:15-21:56:48 vpn red_server[4919]: SELF: (Re-)loading device configurations
    2019:09:15-21:56:49 vpn red2ctl[4930]: Overflow happened on reds1:0
    2019:09:15-21:56:49 vpn red2ctl[4930]: Missing keepalive from reds1:0, disabling peer x.x.x.x

    and still no Update from Sophos about a fix!

    disabling unified firmware tonight and reboot the UTM

Children