This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How are web content filters created and maintained?

I am wondering how Sophos generates\manages the web content filters used in their products.  I am specifically interested in the process for the XG firewall and endpoint products like Sophos Home.

My guess is that Sophos has a dedicated team that researches and manages all content related aspects of the Internet and then uses that data to manage all of the web content related filters however I have no idea if this is true or even close to accurate.

Thanks in advance for your assistance and feedback.



This thread was automatically locked due to age.
  • Different product work differently.  I know the XG, UTM, and SWA quite well - and the web part of all three of those products are developed by the same team.  Sophos Home and other endpoints are done by other teams, but I'm familiar with some of them.

    By "web content related filters" do you mean the categorization of websites?  Or do you mean the functionality the we provide so that administrators can build policies?

    Feel free to ask questions and I'll try to answer.

    • Hi Michael - thanks for replying...

      So what I'm referring to is the backend data that defines the stuff in Protect --> Web --> Categories.  As you know, the Categories are what make up the different User Activities which are in turn used to craft Policies.  But the heart of this definitely seems to be Categories and given the scope of the Internet coupled with how important it is to get the Category content right, I would think it takes a lot of work and effort to stay on top of what goes into them.

      Is there anything that shows exactly how each Category is constructed to filter\apply to a given set of work or action - something akin to regex, Snort policy rules, or other script (FYI - I know Snort is for IPS but just using it as an example here)?  What is the process within Sophos that makes all this happen?  Is there a process to request modifications or even new Categories?


      • I think you are asking two different questions.  How do we determine what is the correct category (which is a data question) and how does the XG apply categorization policy to web traffic.  I'll spend some time at it because I know people are interested and I've somewhat answered before.
        Lets answer the second question first, for firewalls like XG or SFOS.  Endpoints do something similar for policy but how they get/receive the data is different.
        Basic process:
        Firewall looks at all traffic passing through the firewall on port 80 and sends the packets to awarrenhttp (the web proxy) for inspection.
        The proxy looks at the GET to determine the destination domain and path.
        The proxy does a cloud lookup of the URL.
        The cloud responds with the category.
        The proxy caches the response.
        The proxy looks at the policy rules and enforces any categorization block policy.
        The proxy looks at many other policy rules to determine if it should block at this time.
        If there are no blocks (at this time) the proxy makes a connection to the far server and sends the GET response to its destination.
        The proxy looks at the response, and again runs through all the policy rules (like virus scanning the data content) to determine whether to block or allow.
        This is essentially how all web proxies work.  Simple ones don't have any policy enforcement.  If configured, our web proxy sees all traffic on port 80 and 443 and makes policy decisions based on the headers and content that it sees.  There are also times that it will change headers.  HTTPS is more complex.

        SFOS categorization data is built and maintained by Sophos, the data team was originally from Cyberroam.  The data is stored in a cloud server and XG requests "what is the category for domain.com/somepath".  It caches the answer along with some details about the cache validity.  The cloud service mechanism changed between 16.5 (WINGc) and 17 (SXL) but the data is the same.

        The cloud service knows about every request.  It knows about which URLs it replied "uncategorized".  The uncategorized sites that were hit the most get visited by an process that automatically tries to determine the categorization (presence of the word "poker" on the page, links to known sites, etc).  Sometimes it goes a human to categorize.  The database is continuously monitored and updated, and since it is a cloud lookup all customers will get the new data when it changes.
        The automatic categorizer can sometimes have trouble because it cannot see the context of the web request in the overall loading of a page.  Maybe it gets a request for "what is the category of some_new_cdn.com/files/1.png" which it does not know the category of.  So it returns uncategorized.  Later it tries to automatically categorize that site and finds...  that is 1x1 pixel png.  Can't really categorize it, so it tries to just go to some_new_cdn.com/ and that comes back with a blank page.  So what are we to do?  It remains uncategorized.  It may turn out that shoppingsite.com/buynow loaded the png from some_new_cdn.com.  Does that mean some_new_cdn.com is a shopping site?  Or an advertising site?  Or a cdn (content delivery network) for a lot of different unrelated websites that have many different categories?
        Therefore the majority of the websites that are uncategorized are: brand new, not commonly visited by Sophos customers in the office, or hard to categorize based on just trying to go to the URL.
        If an admin or user thinks a categorization is wrong, they can submit a recategorization request here:
        secure2.sophos.com/.../contact-support.aspx
        As far as I know these are always processed by humans to confirm the correct category.
        In general, once a site has a category it is not checked or recategorized unless we hear from a customer.  There are, however a few other things that can cause us to recheck.
        In general, the popularity of a site (how often someone goes through the proxy to get to that site) make a difference to the priority.  Therefore we sometimes get cases where there is, for example, a italian language pornography site that we do no categorize.  That is because we need to have italian customers attempt to visit the site while at work (and going through the proxy) before we even know that website exists.  And then we need a high enough number of requests so that it rises above the noise of the worldwide number of people going to other random websites.  On a pure number-of-requests if we see 200 requests for one site (which turns out to be a local hardware store in Argentina) it will be categorized before the 10 requests for another site (which turns out to be italian porn).

        Customers can add their own internal custom categories for their box only, based on URL.  This takes precedence over the cloud lookup.  There is no process for customers to recommend new categories for Sophos to support, we technically can do it but it complicates things because it affect all customers policies.
        More information on custom categories is can be found here: https://community.sophos.com/kb/en-us/127270
         
         
        • Hi Michael - EXCELLENT write-up.  Thanks so much for this and it definitely does help.  I do have a few immediate questions re: the categorization lookups:

          • Does the communication between the XG and the Sophos cloud bypass the FW rules?
            • Part of the reason I ask is because I have done nothing either on the XG or anywhere in my network to ensure my XG can talk to the Sophos cloud yet my web policing seems to work as expected.
            • I also checked in the Web --> Exceptions and there is one active entry for "Sophos Services" but I'm not sure if that is applicable to what we're discussing here.
          • Aside from the categorization lookups, is there any traffic from the XG to the Sophos "mothership" that bypasses any filtering on the XG itself?
          • Do any security devices upstream of the XG need to be configured to allow this traffic in\out of the network?

          Thanks again for your time on this - very much appreciated.

          • Traffic that originates from the XG itself does not go "through" the firewall and therefore does not follow firewall rules.

            This includes:

            - Categorization (uses HTTPS)

            - DNS, NTP, DHCP, Active Directory, and possible several other services that are needed for general networking

            - Up2Date checking for and downloading product revisions (like an MR) eg code (Backup and Firmware, Firmware)

            - Up2Date checking for virus definitions, application definitions, etc. eg data (Backup and Firmware, Pattern Updates)

            Possibly licensing, not sure how that works.

            Off the top of my head I cannot think of anything else but there could be.

            There are a few other automatic things that are always allowed through the web proxy, mostly around allowing Sophos Endpoint traffic, but also may allow XG behind another XG.  This is under the Sophos Services web exception.  There might be other more hidden exceptions as well (I work on multiple products, I cannot recall XG specifically).

             

            There is an added complexity if you need to talk to a upstream web proxy in order talk to the internet (Routing, Upstream Proxy) then all web traffic the system needs to do is routed through the on-board web proxy with a special Allow All rule, which will forward it to the upstream.

             

            I don't happen to know what needs to be opened up if you have another firewall, aside from the above.  Possibly this is documented in a KB, I assume Sales and Support would know.

             

            I recall seeing something where in an upcoming release (sorry I have no details) they were looking at supporting "air gap" environments which have no connection to the internet.

            • Hi Michael - another great set of information - Thanks...

              • Hi Michael, great step by step explanation. It cleared a few doubts I had about the process.

                I wanted to follow up with a ver specific scenario. In a really limited navigation quota setup (Satellital connections), is there any way to force the download (caching) of the full or at least most popular sites web categorizations, so the XG wont spend that limited quota constatly, and can do all the updates only when connected to an unlimited WAN?

                Thanks in advance for your answer.

                • No.

                  In v17 the web categorization is (AFAIK) fairly efficient in caching.  I haven't seen any stats for the new SXL4 method that XG uses, but the SXL3.1 method that UTM uses is less than 1% of web requests require cloud lookups for categorization.  Because it is caching - larger installations with more uses mean better cache hit rate while a small office with only a dozen people would have a lower cache hit rate.

                  UTM also has a little-used and generally avoided way of downloading a local copy of the categorization database, which then will get updates a few times a day.  AFAIK one of the most common users of this are sites with Satellite uplinks that have poor round-trip-times but reasonable total bandwidth caps (afaik this method actually uses more MB/month).

                  Another possible option is using a firewall rule with Allow All (or at least with no categorization rules) while you are on satellite.  Under the right circumstances if categorization is not needed then it is not performed.  However you would still get malware scanning on your web traffic.

                  • In understand, this client in particular needs to cut to the minimum the Xg originated traffic so wanted to know if he could download the full database (or most of it) when he is on a cable modem connection, and then disable categorization updates while the devices are travelling under satellite connections until the are back on cable.

                    The main reason they need the device is actually for web filtering and traffic shaping based on web and app categories, so the allow all option wouldn't be usefull.

                    Unfortunately from what you said, I assume the UTM method is not available for XG, and therefore no specific solution for this case.

                    Thansk for your help anyway.