How are web content filters created and maintained?

Question

I am wondering how Sophos generates\manages the web content filters used in their products. I am specifically interested in the process for the XG firewall and endpoint products like Sophos Home . 
 My guess is that Sophos has a dedicated team that researches and manages all content related aspects of the Internet and then uses that data to manage all of the web content related filters however I have no idea if this is true or even close to accurate. 
 Thanks in advance for your assistance and feedback.

Michael Dunn · Accepted Answer

I think you are asking two different questions. How do we determine what is the correct category (which is a data question) and how does the XG apply categorization policy to web traffic. I'll spend some time at it because I know people are interested and I've somewhat answered before. 
 Lets answer the second question first, for firewalls like XG or SFOS. Endpoints do something similar for policy but how they get/receive the data is different. 
 Basic process: Firewall looks at all traffic passing through the firewall on port 80 and sends the packets to awarrenhttp (the web proxy) for inspection. The proxy looks at the GET to determine the destination domain and path. The proxy does a cloud lookup of the URL. The cloud responds with the category. The proxy caches the response. The proxy looks at the policy rules and enforces any categorization block policy. The proxy looks at many other policy rules to determine if it should block at this time. If there are no blocks (at this time) the proxy makes a connection to the far server and sends the GET response to its destination. The proxy looks at the response, and again runs through all the policy rules (like virus scanning the data content) to determine whether to block or allow. 
 This is essentially how all web proxies work. Simple ones don't have any policy enforcement. If configured, our web proxy sees all traffic on port 80 and 443 and makes policy decisions based on the headers and content that it sees. There are also times that it will change headers. HTTPS is more complex. 
 SFOS categorization data is built and maintained by Sophos, the data team was originally from Cyberroam. The data is stored in a cloud server and XG requests "what is the category for domain.com/somepath". It caches the answer along with some details about the cache validity. The cloud service mechanism changed between 16.5 (WINGc) and 17 (SXL) but the data is the same. 
 The cloud service knows about every request. It knows about which URLs it replied "uncategorized". The uncategorized sites that were hit the most get visited by an process that automatically tries to determine the categorization (presence of the word "poker" on the page, links to known sites, etc). Sometimes it goes a human to categorize. The database is continuously monitored and updated, and since it is a cloud lookup all customers will get the new data when it changes. 
 The automatic categorizer can sometimes have trouble because it cannot see the context of the web request in the overall loading of a page. Maybe it gets a request for "what is the category of some_new_cdn.com/files/1.png" which it does not know the category of. So it returns uncategorized. Later it tries to automatically categorize that site and finds... that is 1x1 pixel png. Can't really categorize it, so it tries to just go to some_new_cdn.com/ and that comes back with a blank page. So what are we to do? It remains uncategorized. It may turn out that shoppingsite.com/buynow loaded the png from some_new_cdn.com. Does that mean some_new_cdn.com is a shopping site? Or an advertising site? Or a cdn (content delivery network) for a lot of different unrelated websites that have many different categories? 
 Therefore the majority of the websites that are uncategorized are: brand new, not commonly visited by Sophos customers in the office, or hard to categorize based on just trying to go to the URL. 
 If an admin or user thinks a categorization is wrong, they can submit a recategorization request here: secure2.sophos.com/.../contact-support.aspx As far as I know these are always processed by humans to confirm the correct category. 
 In general, once a site has a category it is not checked or recategorized unless we hear from a customer. There are, however a few other things that can cause us to recheck. 
 In general, the popularity of a site (how often someone goes through the proxy to get to that site) make a difference to the priority. Therefore we sometimes get cases where there is, for example, a italian language pornography site that we do no categorize. That is because we need to have italian customers attempt to visit the site while at work (and going through the proxy) before we even know that website exists. And then we need a high enough number of requests so that it rises above the noise of the worldwide number of people going to other random websites. On a pure number-of-requests if we see 200 requests for one site (which turns out to be a local hardware store in Argentina) it will be categorized before the 10 requests for another site (which turns out to be italian porn). 
 Customers can add their own internal custom categories for their box only, based on URL. This takes precedence over the cloud lookup. There is no process for customers to recommend new categories for Sophos to support, we technically can do it but it complicates things because it affect all customers policies. More information on custom categories is can be found here: https://community.sophos.com/kb/en-us/127270

Michael Dunn · Answer

Traffic that originates from the XG itself does not go "through" the firewall and therefore does not follow firewall rules. 
 This includes: 
 - Categorization (uses HTTPS) 
 - DNS, NTP, DHCP, Active Directory, and possible several other services that are needed for general networking 
 - Up2Date checking for and downloading product revisions (like an MR) eg code (Backup and Firmware, Firmware) 
 - Up2Date checking for virus definitions, application definitions, etc. eg data (Backup and Firmware, Pattern Updates) 
 Possibly licensing, not sure how that works. 
 Off the top of my head I cannot think of anything else but there could be. 
 There are a few other automatic things that are always allowed through the web proxy, mostly around allowing Sophos Endpoint traffic, but also may allow XG behind another XG. This is under the Sophos Services web exception. There might be other more hidden exceptions as well (I work on multiple products, I cannot recall XG specifically). 
 
 There is an added complexity if you need to talk to a upstream web proxy in order talk to the internet (Routing, Upstream Proxy) then all web traffic the system needs to do is routed through the on-board web proxy with a special Allow All rule, which will forward it to the upstream. 
 
 I don't happen to know what needs to be opened up if you have another firewall, aside from the above. Possibly this is documented in a KB, I assume Sales and Support would know. 
 
 I recall seeing something where in an upcoming release (sorry I have no details) they were looking at supporting "air gap" environments which have no connection to the internet.