UTM 9.601 - RED issues!

Question

Since upgrading all our customers to 9.601, a bigger part of them are complaining about RED's re/disconnection in a no-pattern way. 
 It started for all of them just the night we upgraded to 9.601, and they all are on different ISP's and located different places around the country. 
 Been with Sophos support for 2 hours today, and now they escalated it to higher grounds. 
 Will return with an update.... 
 Suspicious entries in the log - but all connected REDs do this before connection: 
 2019:03:06-15:15:38 fw01-2 red_server[17509]: SELF: Cannot do SSL handshake on socket accept from 'xxx.xxx.xxx.xxx': SSL connect accept failed because of handshake problems 
 2019:03:06-15:15:46 fw01-2 red2ctl[12420]: Missing keepalive from reds3:0, disabling peer xxx.xxx.xxx.xxx 
 I know the last line is written before the tunnel disconnects, because there was no "PING/PONG" answer... 
 One customer has 2 x RD 50, one 1 100% stable and the other fluctuates in random intervals - we replaced this with a new RED 50, but the same thing occurs.

William Fraley · Accepted Answer

My problem is resolved. There is a known issue related to unified firmware. 
 from su - 
 cc get red use_unified_firmware 
 if value returned = 1 
 cc set red use_unified_firmware 0 
 reds will update and reboot 
 confirm value is 0 rerunning get command above 
 
 NOT A PERMANENT FIX. The issue needs to be addressed in Sophos UTM firmware permanently.

JanWeber · Answer

Hi All, 
 
 we have finally found and addressed the route cause of the RED50 failures that we have been seeing. The just-released UTM 9.702 https://community.sophos.com/products/unified-threat-management/b/blog/posts/utm-up2date-9-702-released contains a fixed firmware for RED50 that will resolve these issues. Updating to this firmware will prevent RED50 units of running into this issue in the future and can be applied online for any RED50. 
 More details are available in this KBA https://community.sophos.com/kb/en-us/135240 
 
 Jan

LuCar Toni · Answer

https://community.sophos.com/kb/en-us/134398

BAlfson · Answer

I just got this from Sophos Support: 
 1. Are any other REDs in danger of being bricked, or is it just the RED 50? - So far we have only seen RED 50 but this doesn't rule out RED 15 2. In High Availability, does use_unified_firmware need to be set to 0 on all nodes? - No it would replicate the command 3. Instead of physically unplugging REDs, would it suffice to disable the RED server objects in WebAdmin before applying the Up2Date and then enable them after 9.604 has had use_unified_firmware set to 0? - In theory yes if you disable the service in the UTM or turn off the RED devices, then they won't be able to get the Firmware Update, so they won't be able to contact the server, and once you re-enable the services they will not search for a new firmware update 
 Cheers - Bob

JanWeber · Answer

Hi All, 
 The new unified RED firmware included in UTM 9.605 includes a fix for the issue which some of you have reported when upgrading the firmware on the RED 50. 
 However, you should be aware that there is still the possibility that the issue will occur during the update to 9.605 if the RED 50 has the older firmware installed. This will not occur with further firmware updates to your system in the future and unfortunately, can also not be bypassed by using the previous workaround of disabling the unified firmware. This is due to the fact that the issue is within the firmware update process of the old firmware. 
 During our tests, we were only able to reproduce this issue with RED 50 devices which are under significant load. As this is a race condition issue, we cannot guarantee that you will not run into this issue again if you have experienced it before, nor can we predict, if it will occur in any particular scenario. However, the issue is certainly less likely to occur in scenarios where the RED 50 is not under load during the update process, and so, if possible, we would advise that you disconnect the local network behind the RED during the update. 
 We apologize for the inconvenience this issue has caused. 
 Jan

_Tobias_ · Answer

Hi everyone, 
 
 we have the same exact behaviour with one of our RED50, started exactly after the update to 9.601-5. Strangely we saw that the disconnects only happend during office hours, apperently without much traffic the RED was stable... We found a different and hopefully supported workaround, although I do not understand, why it works. 
 
 Someone told us that sometimes the RED does not get the new firmware correctly, and you would have to force a firmware update. Either by changing some settings on the RED, forcing a reboot, or - his suggestion - changing the MTU on the RED interface to 1400 until the firmware is downloaded. 
 I did that - our RED has two used interfaces with different VLAN configurations (one for data, one for VOIP network). I set the MTU on the data interface to 1400, and there were no more disconnects. I thought that now I can go back to default, so I swiched the MTU back, and the disconnects started again. I then tried different MTU values down to 1450, without success. Since a week now, we are at 1400, and not one disconnect since then. 
 
 Maybe this is not at all related to the RED bug, but since it started with the update... I will see when the patch is applied - if I still cannot go back to the default MTU, there was another problem. But I still wanted to share this, since it is easier to change the MTU then changing something on the console... 
 
 Regards, 
 
 Tobias

RedVision81 · Answer

An Alternative would be a SG1xx in RED-Mode but you need a Network Subscription for that, when i am right. 
 But this is more expensive than a red15 or red50, because currently you can get a Red15 f&uuml;r 250 &euro; (brutto) and a SG105 e.g. starts at 370 &euro; (brutto) without subscription.

twister5800 · Answer

Hi all, 
 
 Sorry for the delay, been on vacation for a week - my nerves cound not stand it anymore ;-) 
 I also did the "cc set red use_unified_firmware 0" before I left, and can confirm it solved ALL MY ISSUES. 
 Had one customer with two RED 50s, one was very unstable and another was completely offline, we have setup temporary SG115's with IPSEC just to keep the customer running. 
 After I have disabled the new unified firmware, both RED 50's are back and 100% stable! 
 Sophos Support claims that there are no issues with this, but please, keep refering to this community string, so they can see, that there actually are problems. 
 I have enabled RED debugging with suppoort, and inserted USB key for debug logging into the red 50, but nothing important was shown. 
 We have the unified firmare enabled with several other customers, which have no issues with it, so it's odd, I think it looks like some ttl, ips issues, with the different ISP. 
 EDIT: 
 Some other issues have been located, and it seems like Sophos it looking into it: 
 community.sophos.com/.../9-601-5-update-killed-red-50-home-site-dns-resolution

N K · Answer

Workaround for me: Just use 1x tunnel between Red and SGxxx - in redundant configuration we had the same problem. But it depends of the ISP on the site of REDs. With some ISPs we can use redundant tunnels, with other ISP we had to use just only one tunnel! 
 good luck

Argo · Answer

Hi All, 
 
 I just thought I would do some diagnosing after I was getting the reboot loop on a RED50. Here is what I found out. 
 On removing the Red50 From the Customer after Replacing using an RMA, I took the 'Dead' Red50 back to my office to see if I could replicate the issue, and understand it more. 
 I found that the source port for the initial communications was TCP/3400, where I was expecting this to be the destination port. - although this may be a red herring. 
 I then went through some more basic checks and also found that I connected using a FQDN, this I changed to a public IP Address and this was the factor that enabled it to connect every time, without problem. 
 I only provide this as a way to (possibly) fix the issue. 
 Of course my issue may be different to others. 
 
 - Update: I have just tried this on an XG and although it does load a new firmware to the RED50, I can provision it with either FQDN or IP Address, so it looks like Sophos forgot to add DNS resolution to the UTM config.

FloSupport · Answer

Hi Alexander Busch 
 I followed up with the team, and the new RED unified firmware handles routing differently. As a result, customers who previously had working configurations on the legacy firmware (with the RED WAN IP overlapping with a listed split network) will experience issues on the new unified firmware. 
 Could you please confirm if you are using the RED in a split mode configuration, and if so - please check that your RED WAN IP is not overlapping with a listed split network subnet? 
 Thanks,

Argo · Answer

Hi All, 
 Since the advisory came out, I have been having issues with a RED50, which was replaced under RMA (so now I have 2 units). 
 The first issue was an inability to configured itself to use FQDN (and work), it would only configure & work with a Public IP address. 
 The second issue I had was that I was unable to ping/communicate with the RED50 or any device beyond the RED50. 
 Basically the RED50 firmware was being an unruly teenager. 
 I spoke with Support who were initially very good (UK side) and said they would escalate to their 2nd (or is it third line), then support fell flat as the support section, finally I was assigned to one of the techs on East Coast USA, we exchanged emails for sometimes, and had one phone call with them, as time difference was an issue. 
 
 At no point was I informed of the Advisory ( https://community.sophos.com/kb/en-us/134398 ) I had to find it on here (this post I think). 
 I also found out about the FQDN issue, which I did some testing in-house on the 'faulty' unit. 
 This issue does not happen on the XG (I performed some testing with my own XG which I then realised the 'faulty' unit was not faulty). 
 
 This does have rings of QC/QA not performing, for the SG UTM software(similar to the Microsoft Windows updates test dept. which is a shadow of it's former self). 
 The problem I had existed on 9.602 & 9.605 (Virtual & Hardware based units), it was only when 9.7 came out did I test further and can confirm that all my issues were fixed. 
 Although I did notice that after I ran "cc set red use_unified_firmware 1", on initial reboot it didn't work as it should (stating it was unable to configure itself), physically switching it off (using the power cable) fixed the issue. 
 - Good news - my customer (who bought this unit just prior to the advisory) can now use the RED50 (at last).

FloSupport · Answer

Hi garth1138 
 Followed up with the team to confirm the behavior. 
 
 Users who are upgrading and are currently on the legacy firmware will not be forced onto the unified firmware (at the moment). 
 
 Regards,