We're running UTM 9.5 on AWS. One queen, two swarm nodes. Sunday evening, we began experiencing issues with the VPN service on the Queen. On Monday, the issues continued, allowing users to connect for approximately 10-30 seconds, and then all traffic stopped. They still showed as connected, the UTM status logs still showed them as connected, but no traffic would pass. During this time, I noticed FATAL: terminating connection due to administrator command
in our logs, as well as ntpd exiting on signal 15 (Terminated)
and the dns-resolver subsystem spinning up. All of these repeatedly.
This morning, I started thinking about the fact that maybe the system is restoring configuration (though I don't think it should be) repeatedly, coinciding with all of the above. A little digging led me to the ha_aws service, since it's underlying script talks about syslog, postgres, confd, etc. Stopping the ha_aws service, all of the VPN issues, postgres restarting, ntpd restarting, dns-resolver restarting, etc, were all magically fixed.
My question is: what does the ha_aws service actually do, and is it required to be running for synchronization between the queen and swarm nodes, as well as for configuration backups to be running? Is there a way to troubleshoot what's causing ha_aws to kill these processes? I have an strace of the ha_aws process and it's children.
This thread was automatically locked due to age.