This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Intermittent disconnected servers in Entreprise Console

Hello All,

Our environment contains one SEC 5.5.1 with server 2016, and is mainly use to protect Windows Server (2008R2, 2012R2,2016)

We are facing issues to have all servers connected to our console, numbers will go up and suddenly drop, and start to go up again

We have about 900 endpoints, where at max i will say we can see 300 connected all the rest is disconneted

On the SEC router log file we have many entries with

error code: 336462231 - error:140E0197:SSL routines:SSL_shutdown:shutdown while in init

I am assuming that all those actions are bring down the router service and that is why numbers dont go up and stay stable.

On the client side i can find various errors

19.05.2020 10:15:54 166C E ACE_SSL (5664|5740) error code: 336027804 - error:1407609C:SSL routines:SSL23_GET_CLIENT_HELLO:http request

some with

19.05.2020 15:41:38 AED8 E ParentLogon::RegisterParent: Caught CORBA system exception, ID 'IDL:omg.org/CORBA/TRANSIENT:1.0'
OMG minor code (2), described as '*unknown description*', completed = NO

Clients will get connected and the suddenly disconnected for some hours, and then comeback online.

Help and guidance, to where to start looking will be appreciated
I am not sure all servers are afected in the same way

Thanks in advance
Carlos

This thread was automatically locked due to age.

Parents

0 Shweta over 4 years ago

Hi Carlos Fernandez1

Wireshark would be more helpful here, please check this article and see if it helps you to give it a start.

Shweta

Community Support Engineer | Sophos Technical Support
Are you a Sophos Partner? | Product Documentation | @SophosSupport | Sign up for SMS Alerts
If a post solves your question use the 'Verify Answer' link.

The New Home of Sophos Support Videos! - Visit Sophos Techvids
Cancel
Vote Up 0 Vote Down

Cancel
0 Carlos Fernandez1 over 4 years ago in reply to Shweta

Hi thanks for your answer

I checked already this part with wireshark, but unfortunately did not get any client with the filter set in this article

Regards,
Cancel
Vote Up 0 Vote Down

Cancel
0 Carlos Fernandez1 over 4 years ago in reply to QC

client_hello.txt

I have attached here the capture from a client which is showing as not connected in console

There is an interesting line with the server hello, where it says Malformed packet:TLS

Regards,

Carlos
Cancel
Vote Up 0 Vote Down

Cancel
0 QC over 4 years ago in reply to Carlos Fernandez1

Hello Carlos,

obtaining the IOR from port 8192 succeeds. But this client doesn't get as far as establishing a TCP connection on 8194, let alone initiating the handshake. But on to your next post.

Christian
Cancel
Vote Up 0 Vote Down

Cancel
0 QC over 4 years ago in reply to Carlos Fernandez1

Hello Carlos,

the capture doesn't show the details of the TLS traffic thus it's not clear what happens and when (and whether the this session was abnormally taken down, or rather not initiated).

There are some interesting points: First of all, there seem to be host standby routers between endpoint and server as the packets from the server are addressed to a HSRPv1 Virtual MAC. Most of the time the endpoint advertises a rather small TCP receive window (256 bytes or less): Later in the trace there are SACKs that suggest lost packets. The Malformed Packet is a red herring, I see it as well, think it's Wireshark's dissector.

If possible please run Wireshark on both ends at the same time. With a specific capture filter (host 10.xxx.xxx.xxx and port 8194) it shouldn't have significant impact. Sometimes it helps to see both sides.

Christian
Cancel
Vote Up 0 Vote Down

Cancel
0 Carlos Fernandez1 over 4 years ago in reply to QC

Christian,

Thanks for your answer

Here are attached the 2 captures on the sides (I add to use another server that the one above, but this one as same symptoms and is not PROD so I could reboot and test different things)

sec_to_client_8194.txt

client_to_SEC_8194.txt

Regards,
Cancel
Vote Up 0 Vote Down

Cancel
0 QC over 4 years ago in reply to Carlos Fernandez1

Hello Carlos,

this one looks like a complete fail as the server responds to every SYN with an RST. You said that there's no port exhaustion, didn't you? Anything in the Event logs on the server? Both servers (the server and the client) are VMs, aren't they?

There seems to be three scenarios: The server refusing connection with a RST, the TLS handshake failing, and some connections working (with maybe some that are stable and some that intermittently fail). If I were to encounter this I'd blame it on HSRP, Cisco, and the network guys, No, seriously, if there is an infrastructure problem it should surface with other TLS connections as well, not only RMS.

I might be wrong (had a hard week - hold it, it's just half a week but tomorrow's a holiday and it's a false Friday today) but there seems to be no problem with RMS (or its use of TLS). I'd suspect some arcane network problem and at my site I'd collect Wireshark captures from the server and several select endpoints, maybe some Sniffer captures on the network devices as well, and try to find a pattern. In addition I'd try to find out when it started (if it wasn't there from the beginning) and what has been changed around this time.
Troubleshooting was already much too easy with physical machines and plain network devices, almost any idiot could do it. So all this fancy virtualization stuff and nifty network equipment have been invented [;)].
The bright side is, such problems can be solved and the solution is often rather simple. It just requires some targeted effort.

Christian
Cancel
Vote Up 0 Vote Down

Cancel
0 Carlos Fernandez1 over 4 years ago in reply to QC

Hi Christian

Thanks for all the investigation you have done so far

So Both Servers are VM on VMware Infrastructure, most of the endpoints are in the same infrastructure.

The full TCP dynamic range is open it is about 16000+ so i do not believe it port exhaustion

In the Event log, I do have a lot of Schannel errors :

The certificate received from the remote server has not validated correctly. The error code is 0x80092013. The TLS connection request has failed. The attached data contains the server certificate.

Not sure it is related, I do have this error on endpoints as well

After reading your post and some thinking, I have been checking the servers that are stable (I have about 40 VMs) where i noticed they are always green and connected

So funny fact is that the stable VMs are on the same VLAN as the SEC, so there is no firewall in between

I am starting to suspect firewalls, what do you think ?

I will need to investigate with our Network team...

I wish you a nice long weekend, and talk to you soon

Regards,

Carlos
Cancel
Vote Up 0 Vote Down

Cancel
0 jak over 4 years ago in reply to Carlos Fernandez1

Hi Carlos,

Thanks for adding the router key exports. I really just wanted to check the connection cache and thread values were correct for the server.

I've seen configurations where the router on a relay or management server only had a connection cache of 10, the same as a client. As a result, the connections have to be recycled between all the managed endpoints and could create a similar scenario. You are OK there.

You mention VMware for the management server, does it have vMotion setup? I've heard of weird connection issues if the management server is "moved".

Regards,
Jak
Cancel
Vote Up 0 Vote Down

Cancel
0 QC over 4 years ago in reply to Carlos Fernandez1

Hello Carlos,

not a long weekend, just a Wednesday that feels like Friday and a Friday that's hopefully more quiet but a Friday nevertheless [:)].

RMS uses a self-signed certificates, the SEC installer creates the necessary material so that each management server is its own unique CA (unless you reuse the certificates from an existing installation), the CA certificate is distributed to the clients OOB by means of cac.pem (guess why it's called that way).
Apparently there is no issue with the certificates on the server's "home" VLAN thus they must be in principle valid. It might be that the firewall deems self-signed certificates as high risk and interferes - that's why I suggested to capture the traffic at both ends. What arrives at one end is not necessarily what has been sent from the other.

Christian
Cancel
Vote Up 0 Vote Down

Cancel
0 Carlos Fernandez1 over 4 years ago in reply to jak

Hi Jak

Interesting point with the Vmotion, so I have disabled Vmotion for this server

However in believe this is not our problematic, as endpoint in the same VLAN are never disconnected and works perfectly at all time

Regards,
Cancel
Vote Up 0 Vote Down

Cancel
0 Carlos Fernandez1 over 4 years ago in reply to QC

Hi QC

It was a long weekend for me ;)

So I am back with this problematic today, and trying to continue our investigation as a test I will move a couple of VMs in the same VLAN as our SEC to see if this solves the communication issue. I will come back once I have some results

I will do a longer capture on both sides to see if we can get some better data

One point after discussing with our network team is that it seems that our firewall drops connection if there has been no FIN in the session, and I am not able to see any FIN in transmission with wireshark, How does this communication work, is the session open forever ?

We do have a SCCM setup in the same zone with the same clients, configured to use HTTPs, with PKI certificates and this is working for us, with no issues all VMs are connecting properly and showing online. of course the go through the same firewall for communications

Waiting on your feedback

Carlos
Cancel
Vote Up 0 Vote Down

Cancel

Reply

0 Carlos Fernandez1 over 4 years ago in reply to QC

Hi QC

It was a long weekend for me ;)

So I am back with this problematic today, and trying to continue our investigation as a test I will move a couple of VMs in the same VLAN as our SEC to see if this solves the communication issue. I will come back once I have some results

I will do a longer capture on both sides to see if we can get some better data

One point after discussing with our network team is that it seems that our firewall drops connection if there has been no FIN in the session, and I am not able to see any FIN in transmission with wireshark, How does this communication work, is the session open forever ?

We do have a SCCM setup in the same zone with the same clients, configured to use HTTPs, with PKI certificates and this is working for us, with no issues all VMs are connecting properly and showing online. of course the go through the same firewall for communications

Waiting on your feedback

Carlos
Cancel
Vote Up 0 Vote Down

Cancel

Children

0 Carlos Fernandez1 over 4 years ago in reply to Carlos Fernandez1

Client_connection.txt

Hi I have attached a wireshark capture from one client who is trying to connect fails for a while after trying multiple dynamics ports and the suddenly connects, and after a while again fails to connect.

I hope with this capture we can have more detailed information

Regards

Carlos
Cancel
Vote Up 0 Vote Down

Cancel
0 QC over 4 years ago in reply to Carlos Fernandez1

Hello Carlos,

as to FIN: The connection on port 8194 is "never ending", i.e. it's only taken down when one of the Message Routers stops (unless there is a RMS version upgrade the Router stops only at system shutdown). The Router sends an RMS Logoff, the TLS session is ended, the stopping side sends a FIN and upon receiving the ACK a RST. The connection is low traffic, heartbeats are sent by the endpoint and acknowledged ny the server three or four times an hour IIRC.
The connection on port 8192 is one-shot, the server sends the IOR when the connection is established, follows with a FIN, the endpoint responds with FIN,ACK.

As to certificates: I've already said that I suppose the firewall doesn't like the self-signed certificates.

Last but not least: You've mentioned that the number of connected endpoints staggers. This could be explained by the firewall deeming a connection that has no traffic for a certain amount of time as dead if you see these drops in numbers in intervals of 15 minutes or less. But this would require successful connections in the first place.
I'd say this is not impossible if the connections are made through different devices and their configuration is not identical. I've once had a case (not related to SEC) where clients on the same VLAN exhibited quite different behaviour when trying to access an unavailable resource. Turned out that they connected via different routers, one dropped the packets and the other returned an ICMP message ...

Christian
Cancel
Vote Up 0 Vote Down

Cancel
0 QC over 4 years ago in reply to Carlos Fernandez1

Hello Carlos,

didn't see this post of yours before finishing my reply.

Q&D assessment of this trace: I can't explain where the many RSTs before the connection succeeds come from - if it's really the server. But the TLS handshake is apparently performed without problems (no certificate errors, the malformed packet is, as said, "normal"). It seems that later the Router on the endpoint is restarted, at least it looks like an orderly RMS shutdown followed by a new connection attempt a few seconds later.
The server's POV would be interesting, whether it really send the RSTs (and if, why).

This is not a case of alleged certificate issues though. I stand by my previous post - whatever is between server and endpoints takes a hand in it.

Christian
Cancel
Vote Up 0 Vote Down

Cancel
0 Carlos Fernandez1 over 4 years ago in reply to QC

Hi Christian

Thanks for all this helpful information

My network team, who is looking at the issue is asking me if the client server communications for RMS does send a keep alive for the session ?

Do you know how often ?

Regards,
Cancel
Vote Up 0 Vote Down

Cancel
0 QC over 4 years ago in reply to Carlos Fernandez1

Hello Carlos,

RMS does send a keep alive for the session ?
In my next to last post I wrote: heartbeats are sent by the endpoint and acknowledged by the server three or four times an hour IIRC (that'd be every 15 to 20 minutes or so).

Connection takedown by the firewall due to observed inactivity addresses only one of the - as it seems - three problems: There's still the RSTs in response to the SYN from the endpoint, and the TLS handshake issue. Both are at the session start where keep-alives don't come into play.

Christian
Cancel
Vote Up 0 Vote Down

Cancel