Most Clients Shown As 'Disconnected' in SEC 5.5.0

Hi folks,

We are running Sophos Enterprise Console (SEC) 5.5.0 on a Windows 2008 R2 Enterprise (64-bit) Server.

I have recently noticed that more than 50% of our client PCs to which Sophos Endpoint Security & Control has been deployed are shown as 'disconnected' in SEC. I have carried out a ping-sweep of the network and can confirm that most, if not all, of these PCs are actually powered on, connected to the network and working fine.

Only after I restart the Sophos Message Router Service on the client PCs do they then change their status to 'connected' in SEC. I have no wish to carry this task out on several hundred client PCs individually as you can imagine, so I'm hoping someone can possibly shed some light on what may be happening here and suggest a solution to this issue?

Many thanks,

John P

  • In reply to QC:

    Either I am looking in the wrong place or I cant see 10.63.*.*, from the log I can see. The loopback suggestion was 1 of 3 suggestions to be passed to the customer on what he could do to get around the issue however was also advised it was probably not the best suggestion to use.

    So the customer has given us a router log and in this we can see the below.

    25.09.2017 08:56:49 0D88 T IPAddressSet::InitialiseWithHost() called
    25.09.2017 08:56:49 0D88 T Added host network address:172.30.*.*:0
    25.09.2017 08:56:49 0D88 T Added host network address:127.0.0.1:0
    25.09.2017 08:56:49 0D88 T IPAddressSet::InitialiseWithHost() returns
    25.09.2017 08:56:49 0D88 I Local IP addresses: 172.30.*.*
    25.09.2017 08:56:49 0D88 I Resolved name: scopat-*.*.local
    25.09.2017 08:56:49 0D88 I Resolved alias/es:
    25.09.2017 08:56:49 0D88 I Resolved IP addresses: 127.0.0.1

    so scopat-*.*.local is resolving to loopback not 172.30.*.*, as far as I can see this is causing the IOR to show loopback also which obviously is why they are seeing issues.

    IOR:010000002600000049444c3a536f70686f734d6573736167696e672f4d657373616765526f757465723a312e300000000100000000000000a0000000010102000a0000003132372e302e302e310001204100000014010f004e5550000000210000000001000000526f6f74504f4100526f7574657250657273697374656e740003000000010000004d657373616765526f757465720000000300000000000000080000000100e001004f415401000000180000000100e001010001000100000001000105090101000000000014000000080000000100a60086000220

    If we parse this we see:

    _IIOP_ParseCDR: byte order LittleEndian, repository id <IDL:SophosMessaging/MessageRouter:1.0>, 1 profile
    _IIOP_ParseCDR: profile 1 is 160 bytes, tag 0 (INTERNET), LittleEndian byte order
    (iiop.c:parse_IIOP_Profile): bo=LittleEndian, version=1.2, hostname=127.0.0.1, port=8193, object_key=<....NUP...!........RootPOA.RouterPersistent.........MessageRouter>
    (iiop.c:parse_IIOP_Profile): encoded object key is <%14%01%0F%00NUP%00%00%00%21%00%00%00%00%01%00%00%00RootPOA%00RouterPersistent%00%03%00%00%00%01%00%00%00MessageRouter>
    (iiop.c:parse_IIOP_Profile): non-native cinfo is <iiop_1_2_1_%2514%2501%250F%2500NUP%2500%2500%2500%2521%2500%2500%2500%2500%2501%2500%2500%2500RootPOA%2500RouterPersistent%2500%2503%2500%2500%2500%2501%2500%2500%2500MessageRouter@tcp_127.0.0.1_8193>
    object key is <#14#01#0F#00NUP#00#00#00!#00#00#00#00#01#00#00#00RootPOA#00RouterPersistent#00#03#00#00#00#01#00#00#00MessageRouter>;
    no trustworthy most-specific-type info; unrecognized ORB type;
    reachable with IIOP 1.2 at host "127.0.0.1", port 8193

    The IOR Response should be giving us the IP Address 172.30.*.* not 127.0.0.1

    I believe this is what support are getting at here and this is also why it is believed to be an environmental issue as covered in https://community.sophos.com/kb/en-us/17268

  • In reply to WomboCombo:

    Hello WomboCombo,

    I'm not directly involved, just know-it-all.

    an environmental issue
    perhaps. That it seems to happen after certain Windows updates suggests that it's not RMS' fault in the first place.

    Resolved IP addresses: 127.0.0.1
    obviously
    not what it should be. But we (the customers) have no idea what to check. I bet everything looks normal when you inspect it on the endpoint (even if you could run a trace at this early stage). We don't know what API the Router uses or which methods it calls. So we can't check if this is perhaps noticeable somewhere else.
    The IOR is perhaps built by some Windows function, but the process is not under control of the customer.

    But then comes the part where Sophos should be able to help (even if it's Microsoft's fault in the first place):
    As everything works correctly "a little bit later" and occurs only after boot it's likely related to the initialization of the networking stack. The Router recognizes the invalid IOR and wait for an adapter change notification - which it doesn't get though. Could be a race condition. But none of the What to do points in 17268 apply. It would resolve itself were the Router to check again after, say, a minute.

    We (no, they) have already tested some workarounds like Delayed Start, adapter dis-/en-able, Router restart - all this shows it's a rather short-lived situation. Admittedly it's likely something that hasn't been seen before and this is not covered in the Router's logic. Sophos should at least acknowledge this issue and comment on the workarounds - and not suggest actions that can't work.

    Just may two cents
    Christian

  • In reply to WomboCombo:

    Hello WomboCombo,

    Your post states, "The loopback suggestion was 1 of 3 suggestions to be passed to the customer on what he could do to get around the issue however was also advised it was probably not the best suggestion to use."  JohnP initiated this thread.  I don't know whether he was the recipient of the three suggestions of what to do and that the loopback solution was probably not the best solution.  The email we received included:

    As a test could you to try the following please.
    1. Add an entry in the hosts file for 127.0.0.1 to translate to the IP of this server.
    2. Then see if this resolve the issue.

    Are you, or John perhaps, aware of what the other two proposed solutions comprised?

    Ian.

  • In reply to WorEen:

    Hi guys,

    Thank you all for your continued input. I appear to be having more of a return on the forum than from Sophos Support. As of yet I have received no recommendations from them on how this issue may be resolved.

    As for the '3 suggestions', I was working under the impression that they were recommendations made by Christian earlier in this discussion and were: Delayed Start on Sophos Message Router service, Restart Sophos Message Router service or adapter disable/enable.

    We amended Group Policy to delay starting the Sophos Message Router service as it was the least path of resistance and easily implemented. However, it has proven not to be the cure for our current ill. We are still seeing PCs as 'disconnected' in SEC when they are actually online.

    Ultimately, I admit, this issue may not lie at the feet of Sophos but I would expect that their Support Team would (as Christian quite rightly states) "acknowledge this issue and comment on the workarounds - and not suggest actions that can't work".

    I'm also looking into possible network issues, cabling etc. to see if there may be an issue at that end which may have contributed to this current situation. I'm making slight headway, but it's too early to say if our network is at fault here. Suffice to say that it appears (so far) it is only Windows 10 Enterprise 2015 LTSB PCs with a Broadcom NexXtreme Gigabit Ethernet adapter (driver version 16.8.1.0) which are affected by this issue. I will update this post if I find anything awry.

    Many thanks,

    John

  • In reply to isoffice:

    Hello John,

    still seeing PCs as 'disconnected' in SEC
    from the
    descriptions and logs posted here it seems that "this issue" manifests itself at Router startup. The telltale 127.0.0.1 with the invalid IOR should either be in the first lines of the current Router log or not there at all. And of course - did the service actually start delayed? Endpoints might not yet have applied the GPO.

    Thanks for the soft- and hardware details.

    Christian

  • In reply to QC:

    Hi Christian,

    I selected a few 'problem' PCs and can confirm that the Group Policy update did apply. The service did indeed start, approximately 2 minutes after the PC booted up. Needless to say, the 'invalid IOR' entry was still in the log files. and the PCs shown as 'disconnected'.

    The quest goes on!!

    Many thanks,

    John

  • In reply to WorEen:

    Hi Ian,

    Have you had any further feedback from Sophos on this issue?

    Despite submitting outputs from the Sophos Diagnostic Utility last week, I haven't heard anything back from Support on this problem.

    Many thanks,

    John

  • In reply to isoffice:

    Hello John,

    I am working on this issue currently and one thing we have noticed is that a telnet hostname 8192 returns the loopback address in the IOR until a restart of the Sophos Message Router is completed. We are currently looking deeper into ORB logging to try and identify the root cause of this. If you could message me your case reference I will ensure both cases are added to the open development ticket I am creating.

    If you can enable the below and also send us in new router logs after the below has been done it would help in our investigations.

    On rare occasions it may be necessary to enable additional "orb debug logging" for the 'Sophos Message Router' service, to do so:

    1. Stop the 'Sophos Message Router' service.
    2. Browse to: 
      • HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Sophos Message Router
    3. Modify the ImagePath value by adding the following switches to the end of the image path value:
      -ORBDebug -ORBDebugLevel 10 -ORBVerboseLogging 2
    4. Start the 'Sophos Message Router' service.
    5. Reboot the machine
    6. Collect the logs located in C:\ProgramData\Sophos\Remote Management System\3\Router\Logs\
  • In reply to WomboCombo:

    Hi WomboCombo,

    I have messaged you with the details you requested.

    Many thanks for your assistance in this matter.

    Best regards,

    John P

  • In reply to isoffice:

    Hello John & WomboCombo,

    Just a quick update... an obviously competent Sophos engineer logged onto one of our affected laptops yesterday afternoon.  Over a two hour period he performed a wide and thorough range of tests that included telnet connectivity to the SEC and packet inspection using RawCap along with an extensive review of log files.  All to no avail I'm afraid.  He is going to contact the 'Corba' people.  I believe this is a result of the router logs which report: Initializing CORBA; Creating an ORB runner with 4 threats, ultimately identifying the router's IOR and the correct local IP address of the workstation and the incorrect resolved IP of 127.0.0.1.

    More to this than meets the eye.  I shall keep you posted.

    Kind regards,

    Ian.

  • In reply to WorEen:

    Hi Ian,

    Many thanks for that update. It does indeed look like the issue goes much deeper than originally thought.

    I, at last, heard back from Sophos Support who informed me that may call has been escalated and will be addressed by a next level engineer.......when one becomes available.

    Many thanks,

    John

  • In reply to isoffice:

    Carrying on carrying on.  Earlier this week we received an email informing us that the engineer was still awaiting a response and that the JIRA ticket had not been updated.

    Today we were requested to create a new software subscription updating policy (subscribing to a previous recommended subscription) together with associated groups, updating and anti-virus and HIPS policies. We applied this to two affected workstations.

    The original version was 10.7.2 VE3.69.2.  The two affected workstation are now running 10.6.4 VE3.67.3. After the policies were applied and rebooting when prompted the workstation are showing connected.

    I'll keep you posted.

    Ian.

  • In reply to WorEen:

    Hello Ian,

    I assume you've rebooted more than once.
    As far as I can see there were some changes in RMS, it's 4.1.0.140 vs. 4.1.1.127.

    Thanks for the update
    Christian