Hello,
For some time now, my organization has been trying to determine the cause of an issue between our Sophos Endpoints and Message Relay servers. On what appears to be 3-5 hour intervals, the Sophos endpoints will start appearing as disconnected within the Enterprise Console in time stamped waves (i.e. 200 endpoints disconnect and show last updated at 5:21:15 AM).
We've found after going through the Sophos Remote Management System router log files that the endpoints are receiving the CORBA exception errors as follows:
06.04.2016 00:11:02 1ADC E ParentLogon::RegisterParent: Caught CORBA system exception, ID 'IDL:omg.org/CORBA/TRANSIENT:1.0'
OMG minor code (2), described as '*unknown description*', completed = NO
The endpoints are able to authenticate to the message relay over 8192 and validate the IOR string, and then begin exchanging messages via the agent before receiving the CORBA error above.
What we've found with analysis, is the relay servers are sending the endpoints TCP Resets either as a result of the CORBA errors, or leading to the CORBA errors but I'm unable to validate which, and what the underlying cause is. We see that the Relay Servers are holding onto sometimes 30+ TCP session connections established for each endpoint, which leads the Relay Servers to reach port exhaustion quickly & begin refusing connections which then bottlenecks the traffic to the relay & continues disconnecting machines that can't get their messages acknowledged.
We've routed the traffic back and forth between additional relays to offset the port exhaustion by changing the mrinit.conf files for relay traffic on the SUM's, & then running the ConfigCID.exe to point the endpoints to another location. That usually brings everything back up to a stable point for a bit, until the next Message relay experiences the same issue.
We've even implemented a Linux RHEL 6 box as a load balancer in between the endpoints and the relays & changed the mrinit.conf for the endpoints to try to talk to the Load Balancer directly. The Load Balancer can validate the IOR string, but then because the IOR contains the IP/port of the relay, the 8194 traffic still goes directly to the relay instead of forcing the Load balancer to handle it.
Sophos' support has been unhelpful unfortunately. This ticket has been open since December and we've gotten little contact from them short of pointing at every other application on our network as the problem. We've gone through at their request & disabled every other component (Firewalls, Vulnerability scanners, WAN optimizers, GPO's, etc.) with no positive effect. We've also beefed up nearly every TCP registry key and setting that we can find, in the hopes that it will alleviate the issues. Some examples are TcpTimedWaitDelay (set to 200 seconds), TcpMaxDataRetransmissions (set to 5), MaxUserPort (65535). We thought the issue was resolved for a few days and when I advised Sophos support of the TCP changes we had made, they sent this article:
https://www.sophos.com/en-us/support/knowledgebase/14243.aspx
But then it came back in the same fashion as before. If anyone here can offer any suggestions, we would be incredibly grateful as this issue has been plaguing our organization for months.
Thank you.
This thread was automatically locked due to age.