Important note about SSL VPN compatibility for 20.0 MR1 with EoL SFOS versions and UTM9 OS. Learn more in the release notes.

This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

SOPHOS Purposefully Designs bugs into their Firewalls: Episode 2 – Email Alerts, Green Statuses, and Routes

I’m documenting my numerous issues with SOPHOS Firewalls so that others can be aware of what they are getting themselves into.

Episode #1

community.sophos.com/.../sophos-purposefully-designs-bugs-into-their-firewalls-episode-1---vpn-failover-and-wan-interfaces

 

Issue # 2 – Email Alerts, Green Statuses, and Routes

               

As an administrator, it’s impossible to check every system under our management multiple times a day. So it is very commonplace that systems have alerts that will let you know when something is amiss. Under the SG Firewall, the alerts were very robust, not so for XG.

  • For one example I was alerted when anyone signed into a firewall. Under the new XG Firewall, this is not an option.
  • On SG if an AP went offline for some reason the alert noted the name of the AP(if you named is) so you’d know right away which AP was offline. On XG, you can name the APs as well, however it only lists the serial number of the AP that went offline. So you now need to go check which AP has that serial number before you can go track it down.
  • Same goes for HA Appliances. You have a notification if one goes down and the other becomes primary, and instead of including the name of the device, it tells you (node1) and the serial number. So now you need to go track it down. It includes a whole host of other information you don’t need, and excludes the information you do need. It seems like Joe from shipping\receiving is the one that makes the design choices. And the sad part is they OWN a well-designed product they could steal good ideas from while they design the new OS.

 

Secondly…

                I had a strange routing issue. We have IPSEC VPN Tunnels, and each tunnel has 6 routes. If you go into the VPN connection details, there is a button you can click on and it will show you the routes and a green light beside each, indicating their status. Green means good, Red means bad.

 

                When this issue happened, one of the 6 routes was not working. This VPN had been functioning for 3 months flawlessly and then in the middle of the day, one route stopped working. I proved the behaviour it by confirming that our domain controllers could not be reached(which was also the complaint of staff). I checked the VPN route statuses and they were all green, including the route that was not working. I contacted SOPHOS immediately as I’ve had all sorts of strange issues happen with these firewalls and now I had a live case for them to see.

 

                The tech I spoke with confirmed that the firewall showed all was good (green statuses everywhere), and also confirmed that the route was definitely not working. I knew if I bounced the VPN tunnel the issue would go away, but I didn’t want to touch it as I wanted SOPHOS to see and diagnose the issue.

 

                The first thing the tech wanted to do was see the VPN config, however, when you have your VPNs configured in a failover, you have no way of seeing the VPN configs anymore. Joe from shipping\receiving(who is the Designer for these Firewalls), must have figured it wouldn’t be necessary. I’ve run into this issue multiple times already in 4 months, when I’ve called SOPHOS for support. SOPHOS Techs support want to double check settings and literally can’t without taking our VPN offline. I checked with SOPHOS design people on this, and they assured me it was “by design” and “working as intended”. SOPHOS tech support did not agree.

 

                Next the SOPHOS tech decided to open up a packet capture on the firewall. The second he enabled the packet capture, it caused the routing issue to start working again. Very strange.

 

                After that he grabbed all the logs, however, he was unable to determine the issue because the logs were not in debug mode. So I asked him to put all the logs in debug mode and he said that the firewall would cease to function if he did that. So unless the problem is repeated and recurring, you can’t diagnose it because the logs don’t capture the necessary data in non-debug mode. I’ve had this happen on multiple calls with SOPHOS, where lack of debug mode means “problem not solved, case closed”. I’ve also never experienced this issue with logs being insufficient with the SG Firewalls. Somehow, that logging could capture the necessary info, where XG logging cannot. I’m sure this is “as-designed” too.

 

                So I’m working on my clairvoyance degree now, so that I can ensure we enable debug mode before problems happen. This way we’ll hopefully be able to troubleshoot issues.



This thread was automatically locked due to age.
  • Hi Steve,

    I would like to suggest that you post link to your original thread under the title of "background" so that the new readers can concentrate on your latest issue/s.

    Ian

    XG115W - v20.0.1 MR-1 - Home

    XG on VM 8 - v20 GA

    If a post solves your question please use the 'Verify Answer' button.

  • I think that's fair. I've updated it.

  • Hi Steve.

    I had an issue with an specific firmware before that firewall stopped to forward ICMP messages and like your case immediately after starting packet capture, ICMP packets forwarded normally. I found out that starting packet capture causes fast path to stop and all the traffic are handled by CPU. It maybe the case for you. I know for a fact that IPSec connections are handled by NPU/Xstream Processor and it maybe the root cause for this issue.

    I suggest you disable IPSec acceleration by following command:

    system ipsec-acceleration disable

    Also it maybe good idea to temporarily disable firewall acceleration:

    system firewall-acceleration disable

  • Yes, It’s frustrating - every time “something strange” happens in network it’s probably in 8 of 10 cases because an hard to troubleshoot/nail down issue with sfos. Opening support-cases for root-cause-analysis would take too much of your time, multiple downtimes to diagnose, and so on… Working for a customer - who’s gonna pay for this time? So in most cases single downtime and reboot “solves” most issues. That’s not what to expect by an enterprise product this expensive. Even HA is not “HA” as on SG/UTM as it takes usually more than 1-2 ping-drops and WebAdmin will be available after a few minutes after HA-Failover - not immediately.

    SFOS has some good new features, like Central Connection, SD-WAN and more - but still not the reliability SG had, wich should be key-feature. Sophos should focus on logging/stability first, before looking for new features, marketing suggests.

  • So those are some stacked problems. Let me get into this: 

    I have a general problem with your wording of the situation like "designs bugs into the product" - Because that is generally speaking not the case. 
    To rephrase Wikipedia: https://en.wikipedia.org/wiki/Software_bug A software bug is an error, flaw or fault in the design, development, or operation of computer software that causes it to produce an incorrect or unexpected result, or to behave in unintended ways 

    It is important for the discussion to be clear on the situation of SFOS vs UTM. I wrote the following in the other thread: 

    And maybe we should discuss the meaning of "bug". Simply because a product with 20+ Years development (and roughly 6-7 years Startup development) has a feature, which a product in a modern software development does not have - does not mean, it is a bug. It is something, which needs to be designed and maybe adjusted or implemented.

    Now lets look at your points and comment them inline: 

    For one example I was alerted when anyone signed into a firewall. Under the new XG Firewall, this is not an option.

    So essentially the alert system of UTM had this for a while: There is a internal Feature Request to do this, but it is not prioritized yet to other features. Sophos (like all other companies) has limited resources to build features - And to rebuild everything from UTM would mean to stay on the technology path of a product from the past. UTM did some great Innovation back in the days, but it lacks some modernized technologies, which SFOS followed up, like SD-WAN, like a fastpath technology, like a DPI Engine and a flow processor. Those things take time, and the rest of the resources are spend on "Quality of Life" Features, which Sophos builds as well, see the last releases. 

    On SG if an AP went offline for some reason the alert noted the name of the AP(if you named is) so you’d know right away which AP was offline. On XG, you can name the APs as well, however it only lists the serial number of the AP that went offline. So you now need to go check which AP has that serial number before you can go track it down.

    I would always recommend to take a look at Central Wireless for this job. One console, alert system is built-in. The Alert on Central looks like: 
    What happened: Access Point "APX320 Name" is offline, s/n: P52001V, site: -, uptime: 172 days 9 hours 16 minutes, last-seen:2023-04-26T14:21:17.018Z


    Same goes for HA Appliances. You have a notification if one goes down and the other becomes primary, and instead of including the name of the device, it tells you (node1) and the serial number. So now you need to go track it down. It includes a whole host of other information you don’t need, and excludes the information you do need. It seems like Joe from shipping\receiving is the one that makes the design choices. And the sad part is they OWN a well-designed product they could steal good ideas from while they design the new OS.

    Central Firewall Management offers this capability of HA Alerting. 

    Sophos Central Event Details for DACH SE Prod Central

    What happened: One of the HA nodes is down or in a degraded state, and high availability is not degraded.

    Where it happened: PROD Cluster

    About your second problem. Essentially this is something i am not running into anymore, as i am "pushing" most of my customers i am talking to, to use a modern approach, if possible. Discussed this in the other thread as well - But to be clear on this one: It could be an open Bug in IPsec, which is currently under investigation to be fixed (or maybe already fixed). So you are saying, it worked after a tcpdump was created? You dont have a packet capture from the Webadmin UI for this problem?

    By the way: The green light is not Routes. There were never in UTM as well. There are SAs (SPIs).  https://en.wikipedia.org/wiki/IPsec#Security_association So if it is green, the SA could be build up, but maybe the Route on the Kernel was missing or something else was not correct. A Packet capture on Webadmin would be helpful to figure out, where the appliance had this issue. 

    __________________________________________________________________________________________________________________

  • That's fine that you have an issue with my wording, however, you're making my case for me with your WIKI article. A software bug includes a "flaw or fault in the design", or "operation of computer software that causes it to produce an incorrect or unexpected result". Right from the WIKI article you stated.

    1) Support Technicians not able to see VPN settings for troubleshooting. I personally thought that was a flaw in the design. Are you saying that was not a flaw, and that it was intended? The support techs I speak with on the phone seem to believe it to be a problem that hinders their ability to troubleshoot.

    2) HA and Access Point Emails Alerts. These alerts get sent out via email. So that's a feature SOPHOS has created in SFOS. However, these alerts are missing the ONE key piece of information that would make them useful. That is a flaw in the design. 

    3) See "Episode 1" on the VPN failover only trying once then giving up. Obviously SOPHOS didn't consider that some businesses have a backup internet connection that has a cost to it, and that their hardcoded settings take away the choices of their administrator and result in actual real-world financial costs for their customer. That's what I would call an "operation of computer software that causes it to produce an incorrect or unexpected result". Unless its true that SOPHOS chose this setting to specifically create costs for their customers, in which case I would agree it's not a bug.

    So these are not statuses for each route, indicating they are working?

    Overall, I appreciate your zeal to bang the SOPHOS drum and be the loudest candidate for SOPHOS you can be. In that respect you're putting in effort. However when SOPHOS implements an alert, or a feature, and I complain that it's not as well designed as an in house product you could have taken inspiration from... you tell me "I would always recommend to take a look at Central Wireless for this job". Which is effectively saying, "don't use that feature, do it this way". And that's again my point, I want to use that feature. But, SOPHOS creates an email alert for an AP going offline. The email alert leaves out the one piece of important information that would save me a trip of logging into the firewall or SOPHOS Central. I complain, and I'm told, do it our way or go pound sand.

    You also seem to somewhat agree with my sentiments that there are lower priority "quality-of-life" features that SOPHOS has yet to implement. However I'm here on the forums for ONE reason. My choice to be a SOPHOS customer and not be a "delta" tester of SFOS was taken from me, and I was forced onto SFOS. I would not be here if SOPHOS didn't take away the product I was paying plenty for, to replace it with a product which is not refined enough. SFOS has better things in it than UTM, yes, for example SDWAN. However, I would have been perfectly happy paying SOPHOS for UTM for another 6-7 years, and then moving onto SFOS later when it's a better refined product. But I was not allowed... so here I am, on the forums, making REAL complaints about REAL issues for the next 6 years at least. 

  • I am simply giving you a way to address your points and explain certain points. 

    If you do not want to take them - i am fine with that. I am just trying to give you a way to do certain things to bring you to a better state of implementation and not be stuck with the old tech forever. 

    Reading something like "extending UTM for 6-7 years" gives me chills... Did you do HTTPS Scanning on UTM and are you doing it now on SFOS as well? If not, this is a red flag. 

    __________________________________________________________________________________________________________________

  • Did you see my reply regarding the Green Statuses for the routes. You mentioned the green lights were not routes. My last post had a screenshot. What are those green lights for beside the routes? 

  • Those are not routes. 

    Green are SAs (SPIs). Same like on UTM. 

    If you click on them, you see the green SPIs, which indicates a established SPI between both networks. Generally speaking, you should see routes as well, if you look at the routing table of the firewall. 

    __________________________________________________________________________________________________________________

  • After reading your 2 posts here, I thought about my last 3 years on the SFOS track with multiple devices, while still administering two UTM clusters.

    If your're used to work with UTM for years and switch over to XG, it's a hard and long way. yes.

    UTM is like a huge workshop, well equipped with loads of robust, old fashined tools. You can do solid work. And I agree: the way you want to.

    With XG/S I feel like havig a small box with some screwdrivers all of the same colour and shape (you don't find what your searching for). But next to the box you will notice some of the latest electronic diagnose devices, unfortunately you'll only use them rarely.

    The bad alert mails of XG have always been a pain - I just ignore most of them, because they are useless - no helpful information contained. You have no choice and need to check directly on the firewall. I feel, I need to mention SFOS has no mail throttling capability. One day morning you will find your mailbox flooded with 10k similar mails. UTM mails are different - I read them - because they do help: they contain the information you need to decide if you need to react or if it can wait. Not to talk of the search capabilities of UTM - SFOS now has some search function but it is so basic, even if I was excited about the new search, I use it almost never.. HA failover taking 5-10 minutes - always... OK, I learned and got used to it.

    I'm not sure if all our writing in threads like this will change something. The community is really a great source of help and the guys around Luca and Emmanuel (only to name them because actively writing in this these threads) are doing what they can to help you with cases and know how. They link to guys in the background and so on. That is REALLY great and I like my time reading and writing in the various communities.

    I just whish Sophos Dev would some day equip the XG with the cool tools for every day use that they have in the UTM and that make admins happy. But it seems the old Cyberoam code framework makes this impossible. I think, in terms of security, XG/SFOS is levels above UTM and that compares some lacking features. I don't like your words, stating that Sophos design bugs into their code, I think they are doing a hard time to do big changes due to too many small issues and limitations. I'm always shocked how much higher the number of a new support case ID is, two weeks after the previous support case.

    Still looking forward: things have evolved and improved with Sophos and most of their products. I hope some day after a new Firmware release, I may find myself thinking: wow, that firewall is now great and forget about the old UTM.