This article covers the Root Cause Analysis (RCA) for the issues experienced from July 11 to 13, 2017 with Sophos Central Admin US-West region.
The following sections are covered:
Applies to the following Sophos products and versions Sophos Central Admin
From July 11 through July 13, Sophos Central customers hosted in our US–West region were unable to install new endpoints and may have experienced slow performance when applying new policies and other minor issues. Other management and reporting functionality within Sophos Central continued to function normally with the exception of a handful of minor issues that continued until July 17: namely, the “last updated” date for endpoints showed incorrectly and policies between users on shared machines wouldn’t switch properly. Existing endpoints remained protected throughout the entire duration of the event. There was also no material impact upon non-Endpoint and non-Server products, and no impact on customers leveraging other hosting regions.
This issue started when we released an updated endpoint and server client. While Sophos Central is designed for scalability and resiliency, this update exposed an inefficiency in the communication protocol which Sophos endpoints utilize to communicate health status to Sophos Central. This alone would have merely caused a period of slow response but unfortunately, an error in the endpoint communication logic caused a high frequency of communication with Sophos Central that resulted in an unexpected surge in traffic.
While our monitoring identified the issue immediately, it did take a few days as our engineers narrowed down the exact cause and subsequently build, test, and publish the appropriate fixes. In the interim, we added a large amount of capacity to our cloud systems to ensure that the system processed as much traffic as possible. By Thursday night, the system was back to successfully processing installs and applying policies.
We are in the process of carrying out a detailed analysis of our testing processes, our incident response approach, our communication, our design for resiliency, and all other aspects of the tools we have to prevent future incidents and to optimize our response if we do have one. We are making improvements based on this incident, some of which we’ve already implemented. We understand how much you rely on Sophos Central and we apologize for the challenges this issue has caused for you and your team.
If you've spotted an error or would like to provide feedback on this article, please use the section below to rate and comment on the article. This is invaluable to us to ensure that we continually strive to give our customers the best information possible.
Every comment submitted here is read (by a human) but we do not reply to specific technical questions. For technical support post a question to the community. Or click here for new feature/product improvements. Alternatively for paid/licensed products open a support ticket.