Issue: Sophos Central Admin – US-West region - Delays with the enforcement of Central policies on managed endpoints.

**Update 9** Root cause analysis KBA has been published: see knowledge base article for the latest.

**Update 8** As part of a routine database maintenance task customers may notice a few intermittent install and policy rendering failures. Please retry before contacting support. 7/17/2017 8:00 AM PST

**UPDATE 7** Some customers may notice a few intermittent install failures, please retry before contacting Sophos Support. 7/14/2017 2:00 PM PST

**UPDATE 6** Installations are being processed normally, service is restored. Please re-download installer from Central. 7/14/2017 9:00 AM PST

**UPDATE 5** Installations are now working as of July 13, 2017 19:00 UTC-5. See knowledge base article for the latest.

**UPDATE 4** New installs likely to still fail. http://centralstatus.sophos.com/#!/ has latest update. 

**UPDATE 3** System is now processing backlogs. Please see last updates here.

**UPDATE 2** Issue is ongoing, apologies. Impacts all areas within Central that rely on MCS communication between client and Central. 7/13/2017 8:00 AM PST

**UPDATE** Development has identified root cause and is working on a fix. 

Hello,

We are seeing delays with policy changes and enforcement in Sophos Central (US-West region) as well as installation failures due to inability of new endpoint installations to initially register. Our engineers are working to restore latency. Please note your endpoints remain protected. Updates will be provided on this thread.

KBA: https://community.sophos.com/kb/en-us/126477

Thank you,

Bob

  • In reply to Christopher Curwood:

    I can't agree more with you Christopher. The bottom line is that this is a business. When I have my higher-ups come to us with questions, we aren't being effectively equipped to answer them from Sophos. Which makes us as IT professionals look bad, which is then affecting Sophos' reputation.

    The slow policy pushing has been affecting us from July 3rd, 2017... This issue has been going on for 5 weeks for me, and I now have old problems that are snowballing into bigger ones because the console will inexplicably be down. The fix (From Sophos) is to reinstall Sophos yet I currently I sit with 20 machines totally unprotected because the I can't install due to console issues. Also you say in many of the status updates on the console that machines are still protected and this is no security risk. Not being able to effectively manage our computers is a security risk. For example, I was caught in the middle of an uninstall for Sophos on a high level branch manager's machine when bitlocker kicked off. Since we utilize Sophos to manage our bitlocker keys I was stuck until your systems come back online... This is unacceptable in a business and is causes me to not be able to rely on and trust Sophos.

    Central status is good for perhaps a quick note saying that services are down. But when we purchased Sophos, we did so based on their amazing support and reputation. The support is lacking, but to be fair it is getting better. Part of support however is effective communication.  I don't think it's too much to ask for a full technical follow-up email detailing the who, what, where and why questions, as well as what you're doing to prevent the issue from happening again. Something that we can present as an explaination to managers, directors and C level management.

    Also, why does the status page and twitter say two different things currently? (Status page notes that the console is back up and running and twitter says your still working to restore service for US-West clients 4 hours ago.)

    Christopher Curwood
     
    Sure Win

    Twitter is just one form of communication. 

    The Central team is working on a more proactive communications channel through the Central platform. Today, Central Status page has the latest info.

     

     

     
    Sure Win,
     
    I think I speak for many of your customers here when I say we don't just want to see the basic "Outage affecting X started at Y-time/Outage was resolved at Y-time" information, but we want a proper, full, articulate response from senior management on what is happening to the Sophos Central USA platform that is causing these literal weeks of outages. An email from management, sent to us directly and via our account reps, as well as posted online (not Twitter haha). And it should include why only the USA service is affected and not Germany or Ireland.
     

  • In reply to Trevor Karppi:

    Thanks  I didn't get a chance to update the twitter post. Still a very generic message for now

     I'm definitely working with the Central team to get that message to you everyone here on the community. I'm doing my best to be your voice and champion for these Central issues and how we communicate updates/issues to customers like yourself. Keep the comments coming because my colleagues do see these posts.

  • In reply to Trevor Karppi:

    I agree with Trevor 100%.  From an enterprise level and the type of company we are we cannot afford to let any PHI slip because Sophos Central is having issues.  proper communication is key.  Currently Management is looking back at Palo Alto Traps since Sophos Central cannot seem to fix the issue happening this is not good because now it shows that I made a bad choice in selecting Sophos.  Sophos central is a great product but you have to communicate better and resolve the issues quicker.

  • In reply to Trevor Karppi:

    Trevor, I couldn't agree more.

    I've had the same experience and I don't think Sophos understands how costly this is to it's customers.

    I've lost countless hours of productivity due to these unacknowledged outages and generic failed installation errors.

    I've driven three hours to a remote office to do after hours upgrades from in-house Sophos to the cloud and couldn't get it done due to the service being down.

    Spent many trying to deploy new machines and troubleshooting installation by renaming machines, removing from the domain, reimaging, adding back to the domain.

    Thought maybe it's my firewall so more time changing firewall rules, trying different ISP, different firewall etc.

    Also seems like these issues/updates are being worked 9-5 and not 24/7 when many of us are trying to deploy.

    Almost feels like it's all outsourced and Sophos is the middle man with no control over the outages.

  • We are still experiencing issues and delays with logging in to Sophos Central Admin Console.  I am unable to update the desktops that already have it installed it keeps failing.  I have tried to reinstall and it fails.  I have restarted and it still fails. 

     

    Is there another solution?  Our Security Team is becoming really concerned about the stability of Sophos.

     

    Thank you

     

    Aisha Smith

     

     

  • In reply to Aisha Smith:

    Thanks for all of your feedback on this subject.  We've heard you. 

    Our Sophos Central team has committed to providing more current and detailed technical information on our Central Status page.  The page has been recently updated and reflects the status of the last two days, including the incident that occurred earlier today. 

    As always, we will provide a complete Root Cause Analysis (RCA) when we are fully confident that we have completely addressed the issues affecting performance.  Until then, please refer to the Sophos Central status page for the most current and accurate information.  

    We are happy to speak with you, your customers or your management teams if you still need more information. 

    Michael Anderson, SVP Global Services

    michael.anderson@sophos.com   +1 408 334 7300  

  • In reply to MichaelAnderson:

    Status page is showing green..but I'm still getting the same old issues...

     

    - UI is very slow, lots of spinning circles

    - Policy enforcement seems randomly applied

    - Installing the encryption module seems be hanging (logs show failure to connect to US-West)

    - Also - correct me if I wrong,  I am fairly certain in the past, when users tried to unsuspend a Bitlocker - if the encryption policy was applied, Bitlocker would re-suspend itself. Well apparently not anymore! I can unsuspend my bitlocker at the moment no problem. 

     

     

     

     

  • In reply to Lance Bertram:

    Slow policy enforcement and UI again , "last activity fields" haven't updated in 11 hours.

     

    Can the status page remain yellow or red until its actually fixed permanently? It shouldn't be green until the issues are actually resolved correctly. 

  • In reply to Lance Bertram:

    Yeah I pushed an encryption policy yesterday and the machine still hasn't picked it up.

  • In reply to Christopher Curwood:

    OK I can now confirm that there are still massive delays with US west policies being pushed. A computer that I applied an Encryption policy to yesterday 8/14 just now triggered a Medium Event 'device that should be encrypted it not' alert 2:30pm 8/15 and then prompted the end-point to re-boot for the encryption process to begin.

  • In reply to MichaelAnderson:

    MichaelAnderson

    Thanks for all of your feedback on this subject.  We've heard you. 

    Our Sophos Central team has committed to providing more current and detailed technical information on our Central Status page.  The page has been recently updated and reflects the status of the last two days, including the incident that occurred earlier today. 

    As always, we will provide a complete Root Cause Analysis (RCA) when we are fully confident that we have completely addressed the issues affecting performance.  Until then, please refer to the Sophos Central status page for the most current and accurate information.  

    We are happy to speak with you, your customers or your management teams if you still need more information. 

    Michael Anderson, SVP Global Services

    michael.anderson@sophos.com   +1 408 334 7300  

     

     

     

    For the last three days your clients here in the forum have said that they cannot push policies, yet the console shows 3 days of normal operation. Why is that? Is delayed policy updates now considered the normal operation? For example, the policy on my computer hasn't been updated since 8/13/17 despite me making changes yesterday to the global policy.

    Are you still working on resolving this currently? If so, can you make a note in the console and clear it out once we have a product that has its basic functionality back. We are coming up on 6 weeks of constant problems with policies not pushing and install problems...

  • In reply to Trevor Karppi:

    I have a theory. This is an over-provisioning issue. They're adding too many customers and not adding (paying more for) additional AWS/IaaS resources quick enough. They won't acknowledge there is an issue until enough customers complain via the regular ticket/phone support channels. This is why Germany/Ireland didn't have the problem even though they're running the same Sophos software. The difference is the infrastructure.

     

    Again, just a theory as to why this affected one region but not others.

  • In reply to Christopher Curwood:

    This seems very plausible. 

    I'm curious: regarding the uptick in US customers, do you think this is related to April's announcement of Sophos Central, overall market penetration, Gartner/Forrester/NSS Labs reports, relationships with more US channel partners, some kind of combination, etc.? 

  • In reply to Philip Anderson:

    I think the general growth and therefore load in the US region on the back of the ransomware outbreaks is certainly part of it.  

    If I was a gambling man I would bet on a new region being made available in the US region.  Germany was added in addition to Ireland pretty early on, mainly as a result of German data storage laws, so it can be done but would probably require quite a reasonable amount of infrastructure work for monitoring and deployment and then there is the test effort.  Probably more things to consider now than there was then as these systems grow in complexity.

    The act of moving existing data would also be work so I would imagine all new accounts choosing a NA region would land on a new region.  

    I'm sure any new region would most likely require a Central release and if they happen every 3 weeks\;  A new region is likely to be in that timeframe rather than making a few config changes in AWS.

    In the meantime, reducing message processing seems like the most obvious solution which would probably require and endpoint release of the MCS component.

    I'm sure once it's "fixed" from all sides the issue will be gone for good.

    Just my 2 cents!

     

  • In reply to MichaelAnderson:

    I had a phone call with the Director of Product Management for Sophos Central a few days ago and can confirm most of your suspicions. Issues are on there end due to poor database logic and bad installer they released. This caused the endpoints to call out to Sophos Central and when they got no response, instead of backing off they increased their call out interval. Effectively DDoS'ing themselves as others have mentioned. What compounded this issue recently (And forgive me probably not using the correct wording as I do not work with Databases/SQL at all) is that they were trying to upgrade the database communications to asynchronous messaging, database querying updates, etc. For Germany and Ireland this went fine due to not many clients being on there. But US-WEST had major issues that we are seeing today and apparently trying to roll back now.

    What bothers me most is that he then went onto say that they knew 8 months ago that the database logic/infrastructure wouldn't be sustainable for a high load of clients. With a chuckle he said that it was a great problem for them and started bragging about "triple digit percent growth in such a short period" and how "this is a great problem for us (Sophos)". I responded that this is a terrible problem for your clients and not funny. Meanwhile we get the brunt of this by not being able to push effectively policies to our machines for 6 weeks. 6 weeks and still no resolution... He assured me they're working around the clock to get this fixed and it's their top priority. 6 weeks... If on Sophos status page it's being reported as "normal operations" and we've heard nothing otherwise from Sophos, yet there is a banner across the top of this screen saying their is issues at US-WEST still. Why can the status page not accurately reflect your issues? This is not fixed so stop saying it is. We aren't asking for much, just a functioning service that we pay a lot for and better communication.

    They're opening a new datacenter in the Midwest, but are only putting new clients to "ensure the best experience possible". Which I won't even comment further on. Also I was told that they're investing in better notifications for their clients. That will be a welcome addition to have Sophos be able to email us when we they're down.

    Is anyone else still having issues? I still cannot push policy updates to anyone's machine since August 13, 2017. Now I'm getting high alerts daily for our file servers, print servers, terminal servers and several user computers. I keep getting alerts saying that "one or more services is missing", yet they are all there and through my own figuring out it is pointing to a communication error with their cloud. Anyone else seeing that?