Important note about SSL VPN compatibility for 20.0 MR1 with EoL SFOS versions and UTM9 OS. Learn more in the release notes.

This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

SOPHOS Purposefully Designs bugs into their Firewalls: Episode 2 – Email Alerts, Green Statuses, and Routes

I’m documenting my numerous issues with SOPHOS Firewalls so that others can be aware of what they are getting themselves into.

Episode #1

community.sophos.com/.../sophos-purposefully-designs-bugs-into-their-firewalls-episode-1---vpn-failover-and-wan-interfaces

 

Issue # 2 – Email Alerts, Green Statuses, and Routes

               

As an administrator, it’s impossible to check every system under our management multiple times a day. So it is very commonplace that systems have alerts that will let you know when something is amiss. Under the SG Firewall, the alerts were very robust, not so for XG.

  • For one example I was alerted when anyone signed into a firewall. Under the new XG Firewall, this is not an option.
  • On SG if an AP went offline for some reason the alert noted the name of the AP(if you named is) so you’d know right away which AP was offline. On XG, you can name the APs as well, however it only lists the serial number of the AP that went offline. So you now need to go check which AP has that serial number before you can go track it down.
  • Same goes for HA Appliances. You have a notification if one goes down and the other becomes primary, and instead of including the name of the device, it tells you (node1) and the serial number. So now you need to go track it down. It includes a whole host of other information you don’t need, and excludes the information you do need. It seems like Joe from shipping\receiving is the one that makes the design choices. And the sad part is they OWN a well-designed product they could steal good ideas from while they design the new OS.

 

Secondly…

                I had a strange routing issue. We have IPSEC VPN Tunnels, and each tunnel has 6 routes. If you go into the VPN connection details, there is a button you can click on and it will show you the routes and a green light beside each, indicating their status. Green means good, Red means bad.

 

                When this issue happened, one of the 6 routes was not working. This VPN had been functioning for 3 months flawlessly and then in the middle of the day, one route stopped working. I proved the behaviour it by confirming that our domain controllers could not be reached(which was also the complaint of staff). I checked the VPN route statuses and they were all green, including the route that was not working. I contacted SOPHOS immediately as I’ve had all sorts of strange issues happen with these firewalls and now I had a live case for them to see.

 

                The tech I spoke with confirmed that the firewall showed all was good (green statuses everywhere), and also confirmed that the route was definitely not working. I knew if I bounced the VPN tunnel the issue would go away, but I didn’t want to touch it as I wanted SOPHOS to see and diagnose the issue.

 

                The first thing the tech wanted to do was see the VPN config, however, when you have your VPNs configured in a failover, you have no way of seeing the VPN configs anymore. Joe from shipping\receiving(who is the Designer for these Firewalls), must have figured it wouldn’t be necessary. I’ve run into this issue multiple times already in 4 months, when I’ve called SOPHOS for support. SOPHOS Techs support want to double check settings and literally can’t without taking our VPN offline. I checked with SOPHOS design people on this, and they assured me it was “by design” and “working as intended”. SOPHOS tech support did not agree.

 

                Next the SOPHOS tech decided to open up a packet capture on the firewall. The second he enabled the packet capture, it caused the routing issue to start working again. Very strange.

 

                After that he grabbed all the logs, however, he was unable to determine the issue because the logs were not in debug mode. So I asked him to put all the logs in debug mode and he said that the firewall would cease to function if he did that. So unless the problem is repeated and recurring, you can’t diagnose it because the logs don’t capture the necessary data in non-debug mode. I’ve had this happen on multiple calls with SOPHOS, where lack of debug mode means “problem not solved, case closed”. I’ve also never experienced this issue with logs being insufficient with the SG Firewalls. Somehow, that logging could capture the necessary info, where XG logging cannot. I’m sure this is “as-designed” too.

 

                So I’m working on my clairvoyance degree now, so that I can ensure we enable debug mode before problems happen. This way we’ll hopefully be able to troubleshoot issues.



This thread was automatically locked due to age.
Parents
  • So those are some stacked problems. Let me get into this: 

    I have a general problem with your wording of the situation like "designs bugs into the product" - Because that is generally speaking not the case. 
    To rephrase Wikipedia: https://en.wikipedia.org/wiki/Software_bug A software bug is an error, flaw or fault in the design, development, or operation of computer software that causes it to produce an incorrect or unexpected result, or to behave in unintended ways 

    It is important for the discussion to be clear on the situation of SFOS vs UTM. I wrote the following in the other thread: 

    And maybe we should discuss the meaning of "bug". Simply because a product with 20+ Years development (and roughly 6-7 years Startup development) has a feature, which a product in a modern software development does not have - does not mean, it is a bug. It is something, which needs to be designed and maybe adjusted or implemented.

    Now lets look at your points and comment them inline: 

    For one example I was alerted when anyone signed into a firewall. Under the new XG Firewall, this is not an option.

    So essentially the alert system of UTM had this for a while: There is a internal Feature Request to do this, but it is not prioritized yet to other features. Sophos (like all other companies) has limited resources to build features - And to rebuild everything from UTM would mean to stay on the technology path of a product from the past. UTM did some great Innovation back in the days, but it lacks some modernized technologies, which SFOS followed up, like SD-WAN, like a fastpath technology, like a DPI Engine and a flow processor. Those things take time, and the rest of the resources are spend on "Quality of Life" Features, which Sophos builds as well, see the last releases. 

    On SG if an AP went offline for some reason the alert noted the name of the AP(if you named is) so you’d know right away which AP was offline. On XG, you can name the APs as well, however it only lists the serial number of the AP that went offline. So you now need to go check which AP has that serial number before you can go track it down.

    I would always recommend to take a look at Central Wireless for this job. One console, alert system is built-in. The Alert on Central looks like: 
    What happened: Access Point "APX320 Name" is offline, s/n: P52001V, site: -, uptime: 172 days 9 hours 16 minutes, last-seen:2023-04-26T14:21:17.018Z


    Same goes for HA Appliances. You have a notification if one goes down and the other becomes primary, and instead of including the name of the device, it tells you (node1) and the serial number. So now you need to go track it down. It includes a whole host of other information you don’t need, and excludes the information you do need. It seems like Joe from shipping\receiving is the one that makes the design choices. And the sad part is they OWN a well-designed product they could steal good ideas from while they design the new OS.

    Central Firewall Management offers this capability of HA Alerting. 

    Sophos Central Event Details for DACH SE Prod Central

    What happened: One of the HA nodes is down or in a degraded state, and high availability is not degraded.

    Where it happened: PROD Cluster

    About your second problem. Essentially this is something i am not running into anymore, as i am "pushing" most of my customers i am talking to, to use a modern approach, if possible. Discussed this in the other thread as well - But to be clear on this one: It could be an open Bug in IPsec, which is currently under investigation to be fixed (or maybe already fixed). So you are saying, it worked after a tcpdump was created? You dont have a packet capture from the Webadmin UI for this problem?

    By the way: The green light is not Routes. There were never in UTM as well. There are SAs (SPIs).  https://en.wikipedia.org/wiki/IPsec#Security_association So if it is green, the SA could be build up, but maybe the Route on the Kernel was missing or something else was not correct. A Packet capture on Webadmin would be helpful to figure out, where the appliance had this issue. 

    __________________________________________________________________________________________________________________

Reply
  • So those are some stacked problems. Let me get into this: 

    I have a general problem with your wording of the situation like "designs bugs into the product" - Because that is generally speaking not the case. 
    To rephrase Wikipedia: https://en.wikipedia.org/wiki/Software_bug A software bug is an error, flaw or fault in the design, development, or operation of computer software that causes it to produce an incorrect or unexpected result, or to behave in unintended ways 

    It is important for the discussion to be clear on the situation of SFOS vs UTM. I wrote the following in the other thread: 

    And maybe we should discuss the meaning of "bug". Simply because a product with 20+ Years development (and roughly 6-7 years Startup development) has a feature, which a product in a modern software development does not have - does not mean, it is a bug. It is something, which needs to be designed and maybe adjusted or implemented.

    Now lets look at your points and comment them inline: 

    For one example I was alerted when anyone signed into a firewall. Under the new XG Firewall, this is not an option.

    So essentially the alert system of UTM had this for a while: There is a internal Feature Request to do this, but it is not prioritized yet to other features. Sophos (like all other companies) has limited resources to build features - And to rebuild everything from UTM would mean to stay on the technology path of a product from the past. UTM did some great Innovation back in the days, but it lacks some modernized technologies, which SFOS followed up, like SD-WAN, like a fastpath technology, like a DPI Engine and a flow processor. Those things take time, and the rest of the resources are spend on "Quality of Life" Features, which Sophos builds as well, see the last releases. 

    On SG if an AP went offline for some reason the alert noted the name of the AP(if you named is) so you’d know right away which AP was offline. On XG, you can name the APs as well, however it only lists the serial number of the AP that went offline. So you now need to go check which AP has that serial number before you can go track it down.

    I would always recommend to take a look at Central Wireless for this job. One console, alert system is built-in. The Alert on Central looks like: 
    What happened: Access Point "APX320 Name" is offline, s/n: P52001V, site: -, uptime: 172 days 9 hours 16 minutes, last-seen:2023-04-26T14:21:17.018Z


    Same goes for HA Appliances. You have a notification if one goes down and the other becomes primary, and instead of including the name of the device, it tells you (node1) and the serial number. So now you need to go track it down. It includes a whole host of other information you don’t need, and excludes the information you do need. It seems like Joe from shipping\receiving is the one that makes the design choices. And the sad part is they OWN a well-designed product they could steal good ideas from while they design the new OS.

    Central Firewall Management offers this capability of HA Alerting. 

    Sophos Central Event Details for DACH SE Prod Central

    What happened: One of the HA nodes is down or in a degraded state, and high availability is not degraded.

    Where it happened: PROD Cluster

    About your second problem. Essentially this is something i am not running into anymore, as i am "pushing" most of my customers i am talking to, to use a modern approach, if possible. Discussed this in the other thread as well - But to be clear on this one: It could be an open Bug in IPsec, which is currently under investigation to be fixed (or maybe already fixed). So you are saying, it worked after a tcpdump was created? You dont have a packet capture from the Webadmin UI for this problem?

    By the way: The green light is not Routes. There were never in UTM as well. There are SAs (SPIs).  https://en.wikipedia.org/wiki/IPsec#Security_association So if it is green, the SA could be build up, but maybe the Route on the Kernel was missing or something else was not correct. A Packet capture on Webadmin would be helpful to figure out, where the appliance had this issue. 

    __________________________________________________________________________________________________________________

Children
  • That's fine that you have an issue with my wording, however, you're making my case for me with your WIKI article. A software bug includes a "flaw or fault in the design", or "operation of computer software that causes it to produce an incorrect or unexpected result". Right from the WIKI article you stated.

    1) Support Technicians not able to see VPN settings for troubleshooting. I personally thought that was a flaw in the design. Are you saying that was not a flaw, and that it was intended? The support techs I speak with on the phone seem to believe it to be a problem that hinders their ability to troubleshoot.

    2) HA and Access Point Emails Alerts. These alerts get sent out via email. So that's a feature SOPHOS has created in SFOS. However, these alerts are missing the ONE key piece of information that would make them useful. That is a flaw in the design. 

    3) See "Episode 1" on the VPN failover only trying once then giving up. Obviously SOPHOS didn't consider that some businesses have a backup internet connection that has a cost to it, and that their hardcoded settings take away the choices of their administrator and result in actual real-world financial costs for their customer. That's what I would call an "operation of computer software that causes it to produce an incorrect or unexpected result". Unless its true that SOPHOS chose this setting to specifically create costs for their customers, in which case I would agree it's not a bug.

    So these are not statuses for each route, indicating they are working?

    Overall, I appreciate your zeal to bang the SOPHOS drum and be the loudest candidate for SOPHOS you can be. In that respect you're putting in effort. However when SOPHOS implements an alert, or a feature, and I complain that it's not as well designed as an in house product you could have taken inspiration from... you tell me "I would always recommend to take a look at Central Wireless for this job". Which is effectively saying, "don't use that feature, do it this way". And that's again my point, I want to use that feature. But, SOPHOS creates an email alert for an AP going offline. The email alert leaves out the one piece of important information that would save me a trip of logging into the firewall or SOPHOS Central. I complain, and I'm told, do it our way or go pound sand.

    You also seem to somewhat agree with my sentiments that there are lower priority "quality-of-life" features that SOPHOS has yet to implement. However I'm here on the forums for ONE reason. My choice to be a SOPHOS customer and not be a "delta" tester of SFOS was taken from me, and I was forced onto SFOS. I would not be here if SOPHOS didn't take away the product I was paying plenty for, to replace it with a product which is not refined enough. SFOS has better things in it than UTM, yes, for example SDWAN. However, I would have been perfectly happy paying SOPHOS for UTM for another 6-7 years, and then moving onto SFOS later when it's a better refined product. But I was not allowed... so here I am, on the forums, making REAL complaints about REAL issues for the next 6 years at least. 

  • I am simply giving you a way to address your points and explain certain points. 

    If you do not want to take them - i am fine with that. I am just trying to give you a way to do certain things to bring you to a better state of implementation and not be stuck with the old tech forever. 

    Reading something like "extending UTM for 6-7 years" gives me chills... Did you do HTTPS Scanning on UTM and are you doing it now on SFOS as well? If not, this is a red flag. 

    __________________________________________________________________________________________________________________

  • Did you see my reply regarding the Green Statuses for the routes. You mentioned the green lights were not routes. My last post had a screenshot. What are those green lights for beside the routes? 

  • Those are not routes. 

    Green are SAs (SPIs). Same like on UTM. 

    If you click on them, you see the green SPIs, which indicates a established SPI between both networks. Generally speaking, you should see routes as well, if you look at the routing table of the firewall. 

    __________________________________________________________________________________________________________________

  • I'm not sure what SA and SPI are... however my understanding is that those green lights indicate that particular route status is "green" good. If I click on the green dots, nothing happens. These dots were all green, when one of these 6 routes were not working while all others were working. Tech support could not determine the issue.

  • So essentially to do a postmortem of your issue, it is difficult to do it. (Would be the same on UTM as well - If there was a problem with route injection and/or SPI but it got resolved, it is hard to say, what was the root cause). 

    What to do in this case: If you have an active SPI (green), check the packet capture on Webadmin: Does the connection getting routed to IPsec0 and do packets coming back from IPsec0? 
    That would be the first question. You can check this with the packet capture filter. 

    __________________________________________________________________________________________________________________