This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Sophos Message Relay creating TCP resets and DOS-like behavior on endpoints

Hello,

For some time now, my organization has been trying to determine the cause of an issue between our Sophos Endpoints and Message Relay servers.  On what appears to be 3-5 hour intervals, the Sophos endpoints will start appearing as disconnected within the Enterprise Console in time stamped waves (i.e. 200 endpoints disconnect and show last updated at 5:21:15 AM).  

We've found after going through the Sophos Remote Management System router log files that the endpoints are receiving the CORBA exception errors as follows:  

06.04.2016 00:11:02 1ADC E ParentLogon::RegisterParent: Caught CORBA system exception, ID 'IDL:omg.org/CORBA/TRANSIENT:1.0'
OMG minor code (2), described as '*unknown description*', completed = NO

The endpoints are able to authenticate to the message relay over 8192 and validate the IOR string, and then begin exchanging messages via the agent before receiving the CORBA error above.  

What we've found with analysis, is the relay servers are sending the endpoints TCP Resets either as a result of the CORBA errors, or leading to the CORBA errors but I'm unable to validate which, and what the underlying cause is.  We see that the Relay Servers are holding onto sometimes 30+ TCP session connections established for each endpoint, which leads the Relay Servers to reach port exhaustion quickly & begin refusing connections which then bottlenecks the traffic to the relay & continues disconnecting machines that can't get their messages acknowledged.  

We've routed the traffic back and forth between additional relays to offset the port exhaustion by changing the mrinit.conf files for relay traffic on the SUM's, & then running the ConfigCID.exe to point the endpoints to another location.  That usually brings everything back up to a stable point for a bit, until the next Message relay experiences the same issue.  

We've even implemented a Linux RHEL 6 box as a load balancer in between the endpoints and the relays & changed the mrinit.conf for the endpoints to try to talk to the Load Balancer directly.  The Load Balancer can validate the IOR string, but then because the IOR contains the IP/port of the relay, the 8194 traffic still goes directly to the relay instead of forcing the Load balancer to handle it.  

Sophos' support has been unhelpful unfortunately.  This ticket has been open since December and we've gotten little contact from them short of pointing at every other application on our network as the problem.  We've gone through at their request & disabled every other component (Firewalls, Vulnerability scanners, WAN optimizers, GPO's, etc.) with no positive effect.  We've also beefed up nearly every TCP registry key and setting that we can find, in the hopes that it will alleviate the issues.  Some examples are TcpTimedWaitDelay (set to 200 seconds), TcpMaxDataRetransmissions (set to 5), MaxUserPort (65535).  We thought the issue was resolved for a few days and when I advised Sophos support of the TCP changes we had made, they sent this article:  

https://www.sophos.com/en-us/support/knowledgebase/14243.aspx 

But then it came back in the same fashion as before.  If anyone here can offer any suggestions, we would be incredibly grateful as this issue has been plaguing our organization for months.  

Thank you. 



This thread was automatically locked due to age.
  • Hello JSchaff,

    can't say what's causing the RSETs, as to the CORBA exceptions I think they are the result.

    In no particular order some thoughts:
    I too see multiple ESTABLISHED connections (to the server's 8194), in some cases several 100 (yes, hundreds). Actually there is no other end although they are in the ESTABLISHED state. Restarting the Message Router gets rid of these extra connections, active endpoints reconnect within minutes so this might be an option (iff this works around your problem). Haven't found the cause for the holding up, in some cases they came over WiFi.

    the IOR contains the IP/port of the relay
    You can control what's returned in the IOR, please see Using Sophos message relays in a public WAN.

    Just curious: How many endpoints connect through one of the relays, which Windows version is running on the relays, and how many endpoints in total?

    3-5 hour intervals
    it's not clear from your detailed (thanks) post what you do to resolve the immediate issue? Do you already restart the service/server? If so, you have perhaps thought about restarting the Message Router but dismissed the idea. In that case, why?

    Christian

  • Christian,

    Thanks for the prompt response here.  I see you are an active and valuable member of the community.  To speak more specifically to your questions:  

    As to the multiple 8194 connections - we have restarted the Message Router service from the Relay and/or endpoint, but usually we find the connections quickly re-establish and the port exhaustion behavior comes back within sometimes minutes, sometimes hours.  

    I will certainly review your IOR article so that we may be able to successfully introduce the Load Balancer in front of the relays to off-set the TCP session abundance.

    On a given day we probably have 1500 to 2500 endpoints connecting to each relay.  Our topology has 1 primary console server, 4 SUMs by region, and 2 relays, 1 for each pair of SUMs.  In total, we have approx 4000 endpoints connecting each day with approx 2000 going to each relay.  We're running MS Server 2012 R2 on 80% of the Sophos infrastructure except on our Enterprise Console server and our separate clustered SQL DB server, those are both running MS Server 2008 R2.  The entire environment is on vmWare Virtual Machines.  

    For the immediate issue, we employ two different solutions.  In isolated cases, if the relay that an endpoint is using, is not currently in a full TCP session refusal state (meaning it won't accept ANY additional connection requests), we can physically restart the Sophos Message Router Service on the endpoint and it will reconnect, if the relay that it is using has reached complete port exhaustion and is refusing all TCP connection requests, that is when we have to juggle the relay load to another relay server that sits idle otherwise when the first relay is functioning normally.  

    We will change the mrinit.conf on the SUMs, run ConfigCID.exe to sync the CIDs with the .conf files and the endpoints will start redirecting to the other relay that isn't in a TCP refusal state.  The TCP sessions will start being acknowledged by the healthy relay and the endpoints will all start to reconnect.  Then the first relay that has full exhaustion will start to be able to clear the existing connection requests as the Denial of Service-like behavior will stop once the bombardment of connection requests has slowed.  

    Then inevitably the same process occurs again on the subsequent relay, and we just keep switching the traffic back & forth between the relay servers every 6-12 hours. 

    The problem with just restarting the Message Router service as a solution, is that it kills the existing TCP session requests, which (to my understanding) then also kills any policy change that the endpoint is due to receive or request.  While it's a short-term fix to clear the congestion, it doesn't address the overall issue.  

    Any further info is hugely appreciated.  

    Please let me know.  

    Thanks very much!

    -Joe

  • Hello Joe,

    restarting the Message Router [...] doesn't address the overall issue
    agreed, that's why I said work around [:)].
    [...] kills any policy change
    a message (whether downstream or upstream) can either immediately be delivered or it is stored (xxxxxxxx.msg) in the \Envelopes folder to be forwarded when the connection is up again. Therefore a graceful Message Router restart (on the relay) shouldn't cause any loss (except when the message's TTL expires, but it's usually days).

    other components
    You didn't mention VMware (unless it's among etc.). I'm not a VMware expert and can't allege that it is (or could be) the culprit but I've overheard laments from our admins that suggest that the networking and adapter business is not always easy-peasy. These are no longer actual connections but "cut-off server ends" of the upstream (endpoint to server/relay) connection. RMS attempts to send on an upstream connection only when it receives a message and has no downstream connection to the endpoint, thus it doesn't notice the connection is dead (apparently there's no idling detection).
    Just thinking aloud ... Wonder if the RSETs you observe actually originate from the IP-stack on the relay? It's as if the connection is dropped by some component on the path but only the endpoint side receives the RSET ...

    Christian

  • Christian,

    Thanks for the response.  I believe you could be correct in respect to the RESETs coming from the relay server to the endpoint on the TCP session requests, rather than the other way most of the time.  We've seen in our Palo Alto FW analysis that the relay server sees the TCP requests on 8194 & sends resets to the endpoint, effectively killing the session, but that behavior only starts once the Port Exhaustion threshold has been reached.  

    In terms of VMware, I've done some basic troubleshooting with our VM infrastructure team but have been unable to establish any kind of correlation between their activity and the issues that we're experiencing.  

    What we need to determine is what setting is occurring that is causing the endpoints to call out to the relays for so many concurrent TCP sessions.  There's no reason that the endpoints should have more than 10 concurrent TCP session requests, & yet we're holding upwards of 20-30 per host.  I've talked to Sophos support in this regard and they are at a loss.  

    We hope to have a session with their development team in the future but any scheduling has been slow to occur.  

  • Hello Joe,

    So, if the RSETs are responses to SYNs only after the threshold has been reached they are probably "legitimate" and not (connected to) the cause.

    should have [no] more than 10 concurrent TCP session requests
    there should be at most one (upstream) connection and only when it goes down (for whatever reason) another request. You've perhaps already done the following, if so, please excuse my wasting your time. When there are concurrent connections seen on the relay - do you see more than one connection on the endpoints? The Router log should show when a new connection has been established and actually there should be a preceding error (but not the ParentLogon).
    I assume the endpoint does not initiate an additional connection, thus it must have noticed that the former connection is gone. A packet trace (Wireshark) should give some insight. Note: while the client RMS uses heartbeats it seems there is no dead-connection detection (application/RMS heartbeat or TCP keepalive) on the server side. It doesn't seem to be an issue for most installations though.

    As said, I assume it should be possible to work around the exhaustion by restarting the Message Router on the relays at regular intervals. This wouldn't prevent you from obtaining a suitable packet trace on the endpoints. BTW - I've found that sleep might/does not take down (in the sense that the server is notified) a connection, upon resuming the RMS heartbeat fails. In the log I see Host IP Addresses have changed ... hmm, the router restarts ... some errors but eventually it uses the same address  ... We are using DHCP and the computer apparently has to obtain a new lease. Quite obvious what happens - this seems to explain "my " buildup but not yours. Anyway I thought I should mention it.

    Christian

  • QC,

    Thank you for the continued responses here, apologies for not responding sooner.  I stood up an additional relay to accompany each Sophos Update Manager and we went nearly 2 weeks without seeing the connection drops that we had been seeing nearly daily.  I have confirmed that we continue to see multiple established connections on the relay servers without clearing and closing but the endpoints don't appear to exhibit the same behavior.  Using my endpoint as an example, I reviewed the -anob netstat report to confirm that the RouterNT.exe connections are limited to either loopback addresses or singular connections to the relays themselves.  

    We've done quite a few packet caps via Wireshark & provided those to Sophos support, but none have revealed the smoking gun as of yet.  I wonder if the relay service restart actually temporarily resolves the issue, could we just set a MS task scheduled to kick the service periodically to flush out the failed and persistent connections.  The sleep behavior you noted for your own environment does line up with what we've seen, where the endpoint will not immediately report to the console as shutdown and will actually drop off & then reconnect after the device wakes back up from the sleep state, or during a system reboot as well.  

    Any additional ideas are appreciated.

    Thanks again.  

  • Hello Joe,

    thank you for the follow-up.

    the -anob netstat report
    shows the expected data. When the endpoints encounter an error when trying to send an RMS status message because there is (temporarily) no network connection the socket is dropped, but the other side (the Message router on the server) naturally won't get informed. Wireshark - I've just checked, traced traffic on port 8194, elicited a status message to confirm the capture, unplugged the cable, elicited another message but didn't see a packet. The Router log showed two informational messages
    <date> <time> xxxx I Host IP Addresses have changed
    <date> <time> xxxx I Shutting down...
    Subsequently when the Message Router could eventually connect again the server "saw" two connections from this endpoint.

    There might be a historical reason why RMS doesn't utilize downstream heartbeats on the upstream connection (that it doesn't drop a connection from a certain IP when it receives another one from the same address is reasonable, this might be endpoints behind a NATting router). Anyway, restarting the service should alleviate the problem. Please note that RMS tries to reestablish the previous downstream connections (to the endpoints' 8194) after the restart, there will be a short surge of SYNs.

    Christian