UTM 9.601 - RED issues!

Since upgrading all our customers to 9.601, a bigger part of them are complaining about RED's re/disconnection in a no-pattern way.

It started for all of them just the night we upgraded to 9.601, and they all are on different ISP's and located different places around the country.

Been with Sophos support for 2 hours today, and now they escalated it to higher grounds.

Will return with an update....

Suspicious entries in the log - but all connected REDs do this before connection:

2019:03:06-15:15:38 fw01-2 red_server[17509]: SELF: Cannot do SSL handshake on socket accept from 'xxx.xxx.xxx.xxx': SSL connect accept failed because of handshake problems

2019:03:06-15:15:46 fw01-2 red2ctl[12420]: Missing keepalive from reds3:0, disabling peer xxx.xxx.xxx.xxx

I know the last line is written before the tunnel disconnects, because there was no "PING/PONG" answer...

One customer has 2 x RD 50, one 1 100% stable and the other fluctuates in random intervals - we replaced this with a new RED 50, but the same thing occurs.

  • In reply to Fabio Giacobbe:

    Fabio Giacobbe

    Hi Jan,

     

    but is possible to start an RMA procedure without a maintenance contract?

     

    thanks

     

    fabio

     

    When having license for reds (network protection), you should be covered ;-)

  • In reply to JanWeber:

    Jan, I think your first B. should be to 9.604, not 9.605.  See my post above and my latest PM to you.

    Cheers - Bob

  • In reply to Fabio Giacobbe:

    You are correct, Fabio, that the standard rule is that there's a 1-year warranty on REDs connected to UTMs.  I think that given that the problem was most likely caused by an Up2Date, Sophos might go ahead and replace the RED.

    If it turns out that you can't get a free replacement, my recommendation is to replace a RED 50 with an SG 115 with a Network Protection subscription.  That will give you more flexibility and will cost less over time than a RED 50 with Warranty Extensions.  You can configure a RED tunnel in your main office UTM and just replace the reds# in your existing Interface definition with the new one.  Depending on your present configuration, there might be very little needed to configure the new SG 115.

    Please let us know what you tried and the results.

    Cheers - Bob

  • In reply to BAlfson:

    Hi Bob,

    actually did not receive a PM from you, but anyway the first B is 9.605, in this scenario given that the REDs are not running the unified firmware prior to the update and are not connected during the update they will not receive a faulty unified firmware but only the fixed unified firmware of 9.605 so will not run into the problem, setting the unified firmware to 0 is actually not necessary in this case.

    The disabling of the REDs is done to prevent them from receiving a faulty firmware in the update process, ones on 9.605 that is not a problem anymore.

    Jan

  • In reply to JanWeber:

    Sorry, Jan, I don't see what I'm not understanding, but I can't reconcile your last post with:


    I just read your response to my PM, and my confusion remains.

    Cheers - Bob

  • Sorry,

     

    but it is a very angry issue! And i think nothing happend by Sophos. We upgrade an Friday one our SG105 with 1 RED50 to 9.605-1. Now the RED50 doesn´t boot, no Workaround helps. It is bullshit, sorry.

    Now i take a RMA and one of our Office ist down since Friday.

    Greetings,

     

    Ingo 

  • Hey everyone,

     

    I am still not sure what to do to ensure that our RED will not brick.

    Currently, our UTM is running 9.603-1. We have one RED50 (in total 5 RED devices) at this UTM. We never set the use_unified_firmware to 0, we managed to get the RED working with MTU 1400 and had no problem since. 

    In the KB article, I read that the problem with destroying the RED will only occure if we want to upgrade TO 9.6 - which we are already. 

    Here in this thread there are instructions for the update to 9.604 and 9.605, that seem to bit slightly different then the KB i found (https://community.sophos.com/kb/en-us/134398).

     

    So quick question: coming from 9.603-1, are the instructions in this thread the correct way to do the upgrade? And will the use_unified_firmware value stay 0 from now on, or do I have to change it later?

     

    Thanks for clarification!

     

    Regards,

     

    Tobias

  • In reply to _Tobias_:

    Just reporting that the MTU 1400 'trick' worked today for one of my remote sites that died late last week after 9.605-1 was installed on the Head Office SG230 (no issues with RED Connectivity prior to this). I found this thread in my troubleshooting after reading the Log Messages. Keeping an eye on it for stability for the rest of today (Monday). 

    So, at this stage, the only permanent fix is to SSH in and disable the unified firmware?

  • In reply to Dread:

    Nope - back down again ...

    2019:08:12-12:09:51 FW-SG230 red_server[13121]: xxidherexx: No ping for 30 seconds, exiting.
    2019:08:12-12:09:51 FW-SG230 red_server[13121]: id="4202" severity="info" sys="System" sub="RED" name="RED Tunnel Down" red_id="xxidherexx" forced="0"
    2019:08:12-12:09:51 FW-SG230 red_server[13121]: xxidherexx is disconnected.
    2019:08:12-12:09:51 FW-SG230 red_server[4659]: SELF: (Re-)loading device configurations
    2019:08:12-12:09:51 FW-SG230 red2ctl[4671]: Overflow happened on reds1:0
    2019:08:12-12:09:51 FW-SG230 red2ctl[4671]: Missing keepalive from reds1:0, disabling peer x.x.x.x
    2019:08:12-12:09:54 FW-SG230 red2ctl[4671]: Received keepalive from reds1:0, enabling peer x.x.x.x
    2019:08:12-12:11:00 FW-SG230 red2ctl[4671]: Missing keepalive from reds1:0, disabling peer x.x.x.x
    2019:08:12-12:11:40 FW-SG230 red_server[4659]: SELF: (Re-)loading device configurations
    2019:08:12-12:16:22 FW-SG230 red_server[6516]: SELF: Cannot do SSL handshake on socket accept from 'x.x.x.x': SSL connect accept failed because of handshake problems
    2019:08:12-12:16:22 FW-SG230 red_server[6526]: SELF: Cannot do SSL handshake on socket accept from 'x.x.x.x': SSL connect accept failed because of handshake problems
    2019:08:12-12:19:32 FW-SG230 red_server[7278]: SELF: Cannot do SSL handshake on socket accept from 'x.x.x.x': SSL wants a read first
    2019:08:12-12:41:54 FW-SG230 red_server[12970]: SELF: Cannot do SSL handshake on socket accept from 'x.x.x.x': SSL connect accept failed because of handshake problems
    2019:08:12-12:41:54 FW-SG230 red_server[12971]: SELF: Cannot do SSL handshake on socket accept from 'x.x.x.x': SSL connect accept failed because of handshake problems
     
    Going to SSH in and see if disabling the unified firmware will fix it now ...
  • Hi All,

     

    I just thought I would do some diagnosing after I was getting the reboot loop on a RED50. Here is what I found out.

    On removing the Red50 From the Customer after Replacing using an RMA, I took the 'Dead' Red50 back to my office to see if I could replicate the issue, and understand it more.

    I found that the source port for the initial communications was TCP/3400, where I was expecting this to be the destination port. - although this may be a red herring.

    I then went through some more basic checks and also found that I connected using a FQDN, this I changed to a public IP Address and this was the factor that enabled it to connect every time, without problem.

    I only provide this as a way to (possibly) fix the issue.

    Of course my issue may be different to others.

     

    - Update: I have just tried this on an XG and although it does load a new firmware to the RED50, I can provision it with either FQDN or IP Address, so it looks like Sophos forgot to add DNS resolution to the UTM config. 

  • In reply to Argo:

    Interesting, Argo!

    I have a client whose RED 15 was (seemingly) killed on 11 August by the 9.604-to-9.605 Up2Date.  When the replacement also wouldn't connect, I asked to work with someone onsite at the remote office 400 miles away from my usual interlocutor for this client.  I was suspicious that they had upgraded their service a month before the RED 15 stopped working and that their ISP had given instructions to another person in that office on setting a fixed public IP so that he could connect over the Internet without a functional RED.  I asked the guy to try getting a public IP on a laptop connected directly to the ISP's modem.  The laptop couldn't get an IP, so I asked the guy to call the ISP and have them enable DHCP for their connection.  Bingo!  The RED 15 came online as soon as the ISP flipped the switch.

    It turns out that a RED needs DHCP when it first downloads its configuration from the cloud, but it's not necessary after that.  This is why the original RED 15 was unaffected by the loss of DHCP on their connection.

    I'm having the original RED 15 shipped to me to examine.  My theory is that the firmware upgrade in the 9.604-to-9.605 Up2Date left the RED in an unconfigured state - making it require DHCP to get its configuration.  I expect to receive the device Monday or Tuesday and will report back here as well as to Sophos Support.

    Cheers - Bob

  • After the Desaster Update 9.605-1 we now hat to replace our 2 Red50 and 1 Red 10 to Red15

    The Red15 which was replaced first, was running without any issues more than 2 weeks on 9.605-1. Today it just stopped working:

    2019:08:27-12:27:16 vpn red_server[5074]: RED15-STOPPED-WORKING: command '{"data":{"seq":47539},"type":"PING"}'
    2019:08:27-12:27:16 vpn red_server[5074]: RED15-STOPPED-WORKING: Sending json message {"data":{"seq":47539},"type":"PONG"}
    2019:08:27-12:27:23 vpn red_server[5074]: RED15-STOPPED-WORKING: command '{"data":{"key_active":1,"key0":"OMUxKkof9EVz\/7BOjAYp7uCcsa5ybLsx9g2pZ7+jlVk="},"type":"SET_KEY_REQ"}'
    2019:08:27-12:27:23 vpn red_server[5074]: RED15-STOPPED-WORKING: Sending json message {"data":{},"type":"SET_KEY_REP"}
    2019:08:27-12:27:47 vpn red_server[5074]: RED15-STOPPED-WORKING: No ping for 30 seconds, exiting.
    2019:08:27-12:27:47 vpn red_server[5074]: id="4202" severity="info" sys="System" sub="RED" name="RED Tunnel Down" red_id="RED15-STOPPED-WORKING" forced="0"
    2019:08:27-12:27:47 vpn red_server[5074]: RED15-STOPPED-WORKING is disconnected.
    2019:08:27-12:27:47 vpn red_server[6966]: SELF: (Re-)loading device configurations
    2019:08:27-12:27:49 vpn red2ctl[4938]: Overflow happened on reds3:0
    2019:08:27-12:27:49 vpn red2ctl[4938]: Missing keepalive from reds3:0, disabling peer EXTERNAL-IP-REMOTE
    2019:08:27-12:27:52 vpn red2ctl[4938]: Received keepalive from reds3:0, enabling peer EXTERNAL-IP-REMOTE


    we already tried everything in the meantime with our old Red50/Red10 and now with the new Red15:

    - we switched to public IP instead of FQDN already 2 weeks ago
    - disabling tunnel compression
    - setting MTU 1400
    - disabling the red, waiting 5 minutes and enable it again
    - removed the Red and added it again.
    - switching from static RED-WAN-IP to DHCP

    It is not reconnecting anymore. Checking the DSL-Modem at the office the Red is not asking for an ip-adress via DHCP??


    After adding the Red15 back to our SG310 Rev2. this is all whats gonna happen:

    2019:08:27-21:25:27 vpn red_server[6966]: SELF: (Re-)loading device configurations
    2019:08:27-21:25:29 vpn red_server[6966]: SELF: (Re-)loading device configurations
    2019:08:27-21:25:29 vpn red_server[6966]: RED15-STOPPED-WORKING: New device
    2019:08:27-21:25:29 vpn red_server[6966]: RED15-STOPPED-WORKING: Staging config for upload
    2019:08:27-21:25:29 vpn red_server[6966]: SELF: (Re-)loading device configurations
    2019:08:27-21:25:31 vpn red_server[7212]: RED15-STOPPED-WORKING Uploaded config to registry service

     

    i also checked the up2date ftp server for maybe a new fix, it seems sophos has now canceld rolling out the 9.605-1, because you cant download it manually.

    now we have to go again to the remote office, and checking the Red :-(

    I am pretty angry about these ongoing issues. These are real time wasters and we dont have time for this. Glad currently the office is on holiday and no one is there.

    Any News on a new Firmware Update for UTM?

    Regards

    Peter

  • In reply to Peter Riederer:

    It seems to me the same problem I have got with one of our RED15. The device simply stops working. Only the first both LEDs are green. A reboot of the RED solves the problem for a few days up to about 10 days.

    At the moment I was trying the MTU 1400, result pending.

    It’s not very good to have these problems with devices in remote locations, but whom I am telling that...

    Best regards

    Alex

  • Hi all, 

     

    We have been experiencing this problem with 2 separate RED15s intermittently going bye-bye ever since 9.601. The temporary workarounds of setting the MTU to 1400, as well as removing and re-adding the RED in the clustered SG230 to force a clean restart have kept us up and running so far. 

    I have been monitoring this thread for close to six months in the hope that the problem would be cleared up in a subsequent update. Unfortunately, this does not seem to be the case so far. We have been keeping the SG230 at 9.601 to not risk any further damage or different issues. Judging by the reports here, that seems to be a wise decision, but I do not like keeping firmware this far behind, and am getting very concerned as to whether Sophos will be able to fix the problem at all. If anyone from Sophos is reading this: I am sure we would all appreciate an official update regarding the issue!

     

    Best, OliverW8

  • In reply to Peter Riederer:

    After deleting the RED15 and adding it again while setting the Interface to 1400 MTU, the RED came back round about 6 hours later from itself!