Load Balancing HTTP Check

Hello forum,

I've bumped into an issue that puzzles me.

I have a client that has a load balancing rule active for years, to 4 backend servers. It uses a TCP health check.

This has been running fine until they did an upgrade of the backend last night, and the new backend doesn't seem to like the TCP connects from the UTM. So they've installed lighttpd on the backend servers, and hacked a cgi script together to check the status of the application, and return a HTTP 200 or 503 depending on the operational status.

Checking the status manually works fine:

[server]$ curl http://127.0.0.1
HTTP/1.1 200 Ok
Content-Type: text/html
Content-Length: 65

<html><body>Application Cluster Node is operational</body></html>

[server]$ 

It also works fine from other servers in the subnet, so no server based firewall rules are in the way.

However, as soon as I change the load balancer rule from TCP to HTTP ( with either leaving the URL field empty or entering "index.php" ), all nodes go down.

The server monitor logs:

2022:07:31-16:43:53 firewall-1 service_monitor[29121]: id="4003" severity="error" sys="System" sub="loadbalancing" name="error reading HTTP response: 1/-1"
2022:07:31-16:43:53 firewall-1 service_monitor[29121]: id="4003" severity="error" sys="System" sub="loadbalancing" name="error reading HTTP response: 1/-1"
2022:07:31-16:43:53 firewall-1 service_monitor[29121]: id="4003" severity="error" sys="System" sub="loadbalancing" name="error reading HTTP response: 1/-1"
2022:07:31-16:43:53 firewall-1 service_monitor[29121]: id="4003" severity="error" sys="System" sub="loadbalancing" name="error reading HTTP response: 1/-1"

but there are no requests logged in the lighttpd logs on the 4 backend servers.

I must be missing something obvious here, but I'm staring at it for 2 hours and getting nowhere.

Any tips on where I go wrong?

Parents Reply Children
  • Hey Bob,

    Working:

    If I change it to

    all nodes go down ( obviously I haven't saved this, otherwise I will get a ^$**&^$* again ;) ).

    If I create a new rule, identical to this one but for a different service (to avoid a conflict), all nodes stay up. It doesn't matter if I leave the URL field empty, or use "/index.php", both work.

  • I have just deleted the old rule, and created a new rule. All real servers went down immediately. If I create a rule for the service HTTP, the real servers remain up. I now wonder if it is so stupid as to run the HTTP check on the service port, instead of the HTTP port?

    Specifying "/index.php:80" doesn't work either.

    As a test I creates the rule as above, but used the Citrix Service (port not in use), in combination with the HTTP host check:

     ,
    but according to the log, it uses a TCP check on the Citrix port?

    2022:07:31-23:46:04 firewall-1 service_monitor[2234]: id="4000" severity="info" sys="System" sub="loadbalancing" name="REF_PacLoaCitriIcaToDatab 2 TCP 172.18.5.13:1494 changed state to OFFLINE"
    2022:07:31-23:46:04 firewall-1 service_monitor[2234]: id="4000" severity="info" sys="System" sub="loadbalancing" name="REF_PacLoaCitriIcaToDatab 0 TCP 172.18.5.11:1494 changed state to OFFLINE"
    2022:07:31-23:46:04 firewall-1 service_monitor[2234]: id="4000" severity="info" sys="System" sub="loadbalancing" name="REF_PacLoaCitriIcaToDatab 1 TCP 172.18.5.12:1494 changed state to OFFLINE"
    2022:07:31-23:46:04 firewall-1 service_monitor[2234]: id="4000" severity="info" sys="System" sub="loadbalancing" name="REF_PacLoaCitriIcaToDatab 3 TCP 172.18.5.14:1494 changed state to OFFLINE"

  • /etc/service_monitor.conf contains:

    [REF_PacLoaCitriIcaTo 0]
      #REF_NetHosDatabNode1
      service http://172.18.5.11:1494 /"index.php"
      interval 5
      timeout 3
    
      action proc REF_PacLoaCitriIcaTo 0
      action confd_status REF_PacLoaCitriIcaTo REF_NetHosDatabNode1
    
    
    [REF_PacLoaCitriIcaTo 1]
      #REF_NetHosDatabNode2
      service http://172.18.5.12:1494 /"index.php"
      interval 5
      timeout 3
    
      action proc REF_PacLoaCitriIcaTo 1
      action confd_status REF_PacLoaCitriIcaTo REF_NetHosDatabNode2
    
    
    [REF_PacLoaCitriIcaTo 2]
      #REF_NetHosDatabNode3
      service http://172.18.5.13:1494 /"index.php"
      interval 5
      timeout 3
    
      action proc REF_PacLoaCitriIcaTo 2
      action confd_status REF_PacLoaCitriIcaTo REF_NetHosDatabNode3
    
    
    [REF_PacLoaCitriIcaTo 3]
      #REF_NetHosDatabNode4
      service http://172.18.5.14:1494 /"index.php"
      interval 5
      timeout 3
    
      action proc REF_PacLoaCitriIcaTo 3
      action confd_status REF_PacLoaCitriIcaTo REF_NetHosDatabNode4

    So we can conclude that my assumption was right, HTTP Check does a HTTP request to the defined service, so it is only useful for webservers, not for any other service you can have a health check page for.

    Bummer, as that only leaves a ping check, which says precisely zero about the availability of the service.... Angry

  • That makes sense.  Thanks for sharing your result as I don't think this issue has ever been discussed here.

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
  • Given the fact service_monitor is UTM own code, this can be easily fixed by Sophos by allowing a port number in the HTTP check URL, which won't break existing functionality.

    However, given the speed at with the UTM evolves, I'm not holding my breath...