This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Issue: Cloud Web Gateway unable to establish a connection with the cloud

**UPDATE 6** Statement from Product Management in KBA: https://community.sophos.com/kb/en-us/126926 

**UPDATE 5** ChromeOS/Chrome browser agent performance should be back to normal, though there might still be some delay in event reporting during peak hours. Ongoing issues with CWG agents (delays or gaps in event reporting) are still being investigated.

**UPDATE 4** Reports coming in indicating issue is still present. 

**UPDATE 3** As of this morning, the outage is confirmed as resolved. Backlog of events should now be processed and operation should be at 100%. Please let us know below if you are still seeing this issue.

**UPDATE 2** Backlog of queued events are finishing synchronization, after this is complete service should be restored. 

**UPDATE** Chromebooks with extension enabled are unable to browse web. 

Hello,

Currently, Cloud Web Gateway agents are unable to establish connection to the cloud, and may report with a status of “Security Enabled Activity Logs Delayed”. Actions are currently taking place that will resume service. Updates will be provided on this thread.

Thank you,

Bob



This thread was automatically locked due to age.
  • Our logs are about 24 hours behind again.  The last update to my ticket told me that the delay is during peak hours:

    "Right now we are see that  the peak hours seem to be East coast business hours, 5:30am to 3:30pm PDT / 8:30am to 6:30pm EDT / 2:30pm to 11:30am CEST."

    If that is the case why do the logs never catch up during "off-hours"?  I also received this, but it lead to more questions than answers:

    "This is the info I have gotten from our L3/Dev team, please contact us if you have anymore questions

    Recently, Sophos experienced an outage impacting our Central Web Gateway infrastructure. This outage affected the ability of Central Web Gateway agents to communicate with Sophos Central Admin, resulting in an initial period where event logging and reporting was disrupted. This disruption did not affect the operation of Windows and MacOS clients, as they were still able to filter and block web traffic using local copies of web policies. However, the ChromeOS and Chrome browser agents were impacted because they require a connection to the cloud at all times. Chrome agents were unable to retrieve and filter web traffic while the outage was ongoing.

    The initial outage was caused by a bug in the Central Web Gateway services that manifests itself only under high load. Although Sophos services have redundancy of capacity, using multiple servers in multiple data centers within each, when one server became unresponsive due to the bug, additional load was put on other servers and a domino effect occurred. As a result of this outage Sophos will be reviewing service deployment processes to ensure that this type of incident does not occur again in the future.

    Resolution of the initial outage led to a follow-on period of outage as our systems recovered. The initial outage meant that most Central Web Gateway agents had built up a large backlog of queued event reports. Agents are designed to queue up events in the result of an outage or loss of internet connectivity to prevent loss of reporting data whenever possible. After the initial outage was resolved, our cloud services saw events coming in at five times the normal rate. This load caused further communications problems for agents. In addition, we believe that it also led to some loss of event or report data here and there, but this should be minimal. 

    At the current time, processing of agent events and reports has returned to normal, although there may be delays in reporting and event logging during peak times due to current infrastructure limitations. Sophos engineers are working to improve the efficiency of communication and processing between the Central Web Gateway agents and cloud services.  This work will take a few weeks to complete and fully test. Once it has been rolled out, following our improved deployment processes, performance will improve and delays during peak times will be significantly reduced."

    This would imply that the policy issues only affected Chrome, but I can say first-hand that this is false.  I am very concerned that the work to upgrade/repair the system is still projected to be weeks out.  I suppose we are very lucky this wasn't the AV definition servers that have failed, but this is still a major impact.  This has been ongoing for about a month now.  What is the delay in improving the infrastructure?
  • I also just discovered that the CWG is not functioning normally.  I tried various categories under the test site and found about 50% do not BLOCK/WARN as specified.  It is particularly alarming that 'Phishing and Fraud' is allowed.  I strongly warn anyone considering this product to look very closely at issues such as this before making a decision.  We are stuck in a 3-year contract with this broken and under-supported software.

  • Our logs are still delayed by 24 hours and we have been noticing that sections of logs are missing altogether.  It would appear the problem is far worse than we first thought.

  • Hey Keith, I'm with you on things looking worse that what we thought was just a hic cup back in may. I'm in the middle of rolling out both Sophos Central Endpoint and Web Gateway to our company and with the back log of reports on the gateway it's making it way to hard to troubleshoot with our end users days later, so we have them on bypass to allow them to perform their job duties.

     

    Here is what our account rep provided to me yesterday. Looks like we will be stuck with this issue for a little while longer unfortunately...Once this is all said and done I hope we are provided some assurance it will not happen again and maybe some type of restitution.

    Quick update.

     I sent this up to our global escalations team earlier today and received a response. The current guidance from our development team regarding the resolution to CWEB-617 is that the code refactoring will take a few weeks to complete and fully test. They are aware of the issue and are focusing resources to tackle the problem urgently.

  • So, now we are in July and the logs are still delayed.  Has there been any progress on this?  When should we expect this to be fixed?

  • Hi Keith,

    I've looked at your ticket and the investigation is ongoing. The KBA will be updated as soon as new information is available. Apologies on the wait.

    Regards,

    Bob

  • Hi Bob.  Will the fix for the Endpoint communication failure fix these problems as well?  There has been no update on this problem for a long time.

  • Hi Keith,

    I can't say at this point, once endpoint comm fix is in I will look into it. I certainly hope so.

    Bob

  • Logs are still massively delayed and missing huge sections.  What is the ETA on fixing this?  It has been months!

  • Hi Keith,

    What does the engineer assigned to your case have to say? 

    Thank you,

    Bob