Important note about SSL VPN compatibility for 20.0 MR1 with EoL SFOS versions and UTM9 OS. Learn more in the release notes.

ips.log filling at high rate - normal and good for the SSD lifetime?

Today we've had a partial outage due to high /var partition usage.

It was flapping between 70% and over 90% in a short time.

/dev/var                179.3G    138.6G     40.7G  77% /var
/dev/var                179.3G    138.8G     40.5G  77% /var
/dev/var                179.3G    138.9G     40.4G  77% /var
/dev/var                179.3G    139.0G     40.3G  77% /var
/dev/var                179.3G    131.0G     48.3G  73% /var
/dev/var                179.3G    135.7G     43.6G  76% /var
/dev/var                179.3G    141.5G     37.8G  79% /var
/dev/var                179.3G    141.5G     37.8G  79% /var
/dev/var                179.3G    141.6G     37.7G  79% /var
/dev/var                179.3G    141.6G     37.7G  79% /var
/dev/var                179.3G    141.7G     37.6G  79% /var
/dev/var                179.3G    141.7G     37.6G  79% /var
/dev/var                179.3G    141.8G     37.5G  79% /var
/dev/var                179.3G    144.4G     34.8G  81% /var
/dev/var                179.3G    137.8G     41.5G  77% /var
/dev/var                179.3G    137.9G     41.4G  77% /var
/dev/var                179.3G    137.9G     41.4G  77% /var
/dev/var                179.3G    138.0G     41.3G  77% /var
/dev/var                179.3G    139.5G     39.8G  78% /var
/dev/var                179.3G    146.0G     33.3G  81% /var
/dev/var                179.3G    143.0G     36.3G  80% /var
/dev/var                179.3G    144.3G     35.0G  80% /var
/dev/var                179.3G    124.5G     54.8G  69% /var
/dev/var                179.3G    124.6G     54.7G  69% /var
/dev/var                179.3G    124.7G     54.6G  70% /var
/dev/var                179.3G    124.8G     54.5G  70% /var
/dev/var                179.3G    124.9G     54.4G  70% /var
/dev/var                179.3G    125.1G     54.2G  70% /var
/dev/var                179.3G    125.4G     53.9G  70% /var
/dev/var                179.3G    171.1G      8.1G  95% /var
/dev/var                179.3G    148.8G     30.5G  83% /var
/dev/var                179.3G    155.9G     23.4G  87% /var
/dev/var                179.3G    146.0G     33.3G  81% /var
/dev/var                179.3G    144.5G     34.8G  81% /var
/dev/var                179.3G     55.2G    124.1G  31% /var
/dev/var                179.3G     55.3G    124.0G  31% /var
/dev/var                179.3G     56.4G    122.9G  31% /var
/dev/var                179.3G     56.4G    122.9G  31% /var
/dev/var                179.3G     56.8G    122.5G  32% /var

Firewall was rotating the IPS Logs at a high rate and compressing them caused additional CPU load.

The heartbeat service failed as secondary issue causing heartbeat rules not to work any more.

Also we've had a huge ammount of these IPS detections:

20583
BROWSER-FIREFOX Mozilla multiple location headers malicious redirect attempt

Together with Support on the phone, we flushed the report partition and rebooted the firewall after stopping IPS.

/var was at 31% then.

IPS was manually started again after the reboot.

Now it seem to me IPS is still logging at a high rate. about 1GB to the text file /log/ips.log per Minute.

Support could not find issues related to IPS.

I'm unsure.

I wonder if such a high log rate is normal for IPS and if this is OK for the SSD?

XGS4500_AM02_SFOS 21.0.0 GA-Build169 HA-Primary# ls -lhS ips.*
-rw-r--r--    1 root     root       24.4G Nov 21 11:22 ips.log-20241121_111359
-rw-r--r--    1 root     root        1.0G Nov 21 13:16 ips.log
-rw-r--r--    1 root     root      123.4M Nov 21 13:06 ips.log-20241121_130655.gz
-rw-r--r--    1 root     root      116.9M Nov 21 13:11 ips.log-20241121_131102.gz
-rw-r--r--    1 root     root      109.3M Nov 21 13:14 ips.log-20241121_131454.gz
-rw-r--r--    1 root     root      105.8M Nov 21 13:13 ips.log-20241121_131303.gz
-rw-r--r--    1 root     root      102.8M Nov 21 13:09 ips.log-20241121_130901.gz
XGS4500_AM02_SFOS 21.0.0 GA-Build169 HA-Primary# head /log/ips.log
2024-11-21T12:15:07.934223Z [25339] daq_metadata session id 5158 rev 36361 appid 106  hbappid 0 idp 14 appfltid 0
2024-11-21T12:15:07.935265Z [25340] daq_metadata session id 19208 rev 51831 appid 106  hbappid 0 idp 14 appfltid 0

Log is filling even when this is still disabled (disabled during case debug today)

Support case is 02034074 



changed topic "hat" high rate -> at high rate
[bearbeitet von: LHerzog um 1:38 PM (GMT -8) am 21 Nov 2024]
  • We will track this and also reduce the visibility of those debug commands, which are to "harsh" on the system to be only available if Sophos Support is involved. 
    Thanks for reporting! 

    __________________________________________________________________________________________________________________

  • just for reference after the issue is fixed:

    this is how a regular ips.log would look like

    2024-11-25T08:06:52.172001Z [25865/0x0] [nsg_nse_policy.c:1570:__nsg_error] xxx.xxx.xxx.xxx:55909 to xxx.xxx.xxx.xxx:443: Error from nse: NSE:Stream [0xbe00683e;code:62;sub:104] Flow reset
    2024-11-25T08:06:52.172116Z [25865/0x0] [nsg_tcphold.c:1568:tcp_hold_process_control] Failed to get tcp_hold_session ssnptr 0x1dcb7b10 (hold_state: 3).
    2024-11-25T08:06:52.189184Z [25863/0x0] [nsg_nse_policy.c:1570:__nsg_error] xxx.xxx.xxx.xxx:55951 to xxx.xxx.xxx.xxx:443: Error from nse: NSE:Stream [0xbe00683e;code:62;sub:104] Flow reset
    2024-11-25T08:06:52.354643Z [25862/0x0] [nsg_nse_policy.c:1570:__nsg_error] xxx.xxx.xxx.xxx:55911 to xxx.xxx.xxx.xxx:443: Error from nse: NSE:Stream [0xbe00683e;code:62;sub:104] Flow reset
    2024-11-25T08:06:52.354802Z [25862/0x0] [nsg_tcphold.c:1568:tcp_hold_process_control] Failed to get tcp_hold_session ssnptr 0x1de148f0 (hold_state: 3).
    2024-11-25T08:06:52.422138Z [25863/0x0] [nsg_nse_policy.c:1570:__nsg_error] xxx.xxx.xxx.xxx:57702 to xxx.xxx.xxx.xxx:443: Error from nse: NSE:Internal [0xb000058f;code:143;sub:5] Flow timeout
    UST sessiontbl_get_tuple API returned -1
    UST sessiontbl_get_tuple API returned -1
    2024-11-25T08:06:52.654064Z [25865/0x0] [nsg_nse_policy.c:1570:__nsg_error] xxx.xxx.xxx.xxx:57704 to xxx.xxx.xxx.xxx:443: Error from nse: NSE:Internal [0xb000058f;code:143;sub:5] Flow timeout
    UST sessiontbl_get_tuple API returned -1
    UST sessiontbl_get_tuple API returned -1
    2024-11-25T08:06:55.510853Z [25865/0x0] [nsg_nse_policy.c:1570:__nsg_error] xxx.xxx.xxx.xxx:57725 to xxx.xxx.xxx.xxx:443: Error from nse: NSE:Internal [0xb000058f;code:143;sub:5] Flow timeout
    UST sessiontbl_get_tuple API returned -1
    UST sessiontbl_get_tuple API returned -1
    2024-11-25T08:06:55.558649Z [25865/0x0] [nsg_nse_policy.c:1570:__nsg_error] xxx.xxx.xxx.xxx:57723 to xxx.xxx.xxx.xxx:443: Error from nse: NSE:Internal [0xb000058f;code:143;sub:5] Flow timeout
    UST sessiontbl_get_tuple API returned -1
    UST sessiontbl_get_tuple API returned -1
    2024-11-25T08:06:55.559531Z [25862/0x0] [nsg_nse_policy.c:1570:__nsg_error] xxx.xxx.xxx.xxx:57722 to xxx.xxx.xxx.xxx:443: Error from nse: NSE:Internal [0xb000058f;code:143;sub:5] Flow timeout
    UST sessiontbl_get_tuple API returned -1
    UST sessiontbl_get_tuple API returned -1
    2024-11-25T08:06:55.618653Z [25862/0x0] [nsg_nse_policy.c:1570:__nsg_error] xxx.xxx.xxx.xxx:57721 to xxx.xxx.xxx.xxx:443: Error from nse: NSE:Internal [0xb000058f;code:143;sub:5] Flow timeout
    UST sessiontbl_get_tuple API returned -1
    UST sessiontbl_get_tuple API returned -1
    2024-11-25T08:06:55.683107Z [25863/0x0] [nsg_nse_policy.c:1570:__nsg_error] xxx.xxx.xxx.xxx:57720 to xxx.xxx.xxx.xxx:443: Error from nse: NSE:Internal [0xb000058f;code:143;sub:5] Flow timeout
    UST sessiontbl_get_tuple API returned -1
    UST sessiontbl_get_tuple API returned -1
    

    file growth is minimal:
    less than 10MB per minute
    92.5M 09:14:37 /log/ips.log
    92.5M 09:15:37 /log/ips.log

  • This has occurred in a few cases recently where support/GES/partner/??? are not deleting debug flag files.

    awarrenhttp:
    /sdisk/tmp/debug.cfg

    ips:
    /var/tmp/debugp.conf


    Both awarrenhttp and ips have special debug flag files that allow for very finetuned levels of logging to be enabled and disabled (a mask of bits that turn on/off areas of logs) .  When awarrenhttp and ips are started they look for the presence of the file.  If found, they turn on DEBUG logging.  However there is no way of telling csc/service that DEBUG is on.  Therefore the output of the "service -S" command is incorrect.

    It is critical to do two things:
    1) Delete the flag files, ideally immediately after enabling debugging so you do not forget
    2) Do not rely on "service -S" to tell you if DEBUG is on.  Rely on the whether the log file is filling up with debug lines.

    I did not know that we are publishing GES MER publicly.

  • Confirmed - Logs back to normal! Great and Thank you very much!

    And was on the right track from the beginning!

    I created /var/tmp/debugp.conf for the case mentioned here  Webfilter HTTPS decryption breaks ChatGPT: HTTP parsing error encountered

    That file contained my own IP address for the ChatGPT debug.

    LuCar mentionend  https://community.sophos.com/sophos-xg-firewall/f/discussions/145539/ips-log-filling-up-disk where  suggested to search and delete the debugp log file. I missed that.

    Debug was enabled/disabled 3 times using service -ds nosync ips:debugp

      I communicated 4 times, but found only 6 status code 200 response.

    service -ds nosync ips:debugp
    200 OK

    I expected the file to be ignored when service -ds nosync ips:debugp was used to disable debug again.

    Eventually it enables itself again when the firewall is rebooted? That could be checked. Nilesh wrote me, when IPS starts, it loads the debugp file. That was the case here because the firewall was rebooted one  week ago for the v21 upgrade.

    Sooner or later this wold have happened anyway.

    So it's very important to delete the debugp file after the tests or at least have it containing only

    mask=0

    Maybe that is something for the Doc Team @R  for the article https://support.sophos.com/support/s/article/KBA-000006577?language=en_US

    Thanks a lot everyone involved here! It was an exciting experience!

  • checking SSH session logs I remembered I enabled a debug my self. more below to keep it readable here.

  • The debugp mask was configured in /var/tmp/debugp.conf, which enabled debugging. To disable it, you might have run the command service ips:debugp -ds nosync. However, existing connections will still use the mask that was set when they first arrived.

    I have updated the mask to 0 in /var/tmp/debugp.conf and restarted the IPS, which has fully disabled the WIS debug mode. The IPS restart also resolved the CSC status issue.

    No debug logs are being generated after the IPS restart. Please confirm.

  • case 02001662 / Webfilter HTTPS decryption breaks ChatGPT: HTTP parsing error encountered

    was about IPS but I got no notification debug or hidden debug was enabled.

  • service -S |grep -i ips is showing opposite status hence you are seeing less logs when you enable debug (which actually disabled the debug). we are looking into this issue.

  • we observed that IPS in Debug mode logs way less than when running in non-Debug mode.

    Factor was about 1:4

    In non-debug there was all the connection information in the logs.

    GES believes there is some kind of low-level default Debug enabled, which is ruled out when the service is put to debug mode.

    NON-DEBUG:
    2024-11-22T15:03:20.480455Z [27128] daq_metadata session id 5694 rev 6056 appid 0  hbappid 0 idp 18 appfltid 0
    2024-11-22T15:03:20.480461Z [27128] verdict 0 pktnum 21011930 flowoff [IPS|WEB|AV] appcat [0:0] dsize 388 [xxx.xxx.xxx.xxx:46620 -> xxx.xxx.xxx.xxx:514]
    2024-11-22T15:03:20.480464Z [27128] daq_metadata session id 5694 rev 6056 appid 0  hbappid 0 idp 18 appfltid 0
    2024-11-22T15:03:20.480468Z [27128] verdict 0 pktnum 21011931 flowoff [IPS|WEB|AV] appcat [0:0] dsize 401 [xxx.xxx.xxx.xxx:46620 -> xxx.xxx.xxx.xxx:514]
    2024-11-22T15:03:20.480470Z [27128] daq_metadata session id 5694 rev 6056 appid 0  hbappid 0 idp 18 appfltid 0
    2024-11-22T15:03:20.480475Z [27128] verdict 0 pktnum 21011932 flowoff [IPS|WEB|AV] appcat [0:0] dsize 400 [xxx.xxx.xxx.xxx:46620 -> xxx.xxx.xxx.xxx:514]
    2024-11-22T15:03:20.480477Z [27128] daq_metadata session id 5694 rev 6056 appid 0  hbappid 0 idp 18 appfltid 0
    2024-11-22T15:03:20.480482Z [27128] verdict 0 pktnum 21011933 flowoff [IPS|WEB|AV] appcat [0:0] dsize 389 [xxx.xxx.xxx.xxx:46620 -> xxx.xxx.xxx.xxx:514]
    2024-11-22T15:03:20.480523Z [27128] daq_metadata session id 12985 rev 11175 appid 2712  hbappid 0 idp 5 appfltid 4
    2024-11-22T15:03:20.480533Z [27128] verdict 0 pktnum 7752 flowoff [APP|WEB|AV] appcat [2712:5] dsize 25 [xxx.xxx.xxx.xxx:443 -> xxx.xxx.xxx.xxx:42054]
    2024-11-22T15:03:20.480604Z [27128] daq_metadata session id 12985 rev 11175 appid 2712  hbappid 0 idp 5 appfltid 4
    2024-11-22T15:03:20.480612Z [27128] verdict 0 pktnum 7753 flowoff [APP|WEB|AV] appcat [2712:5] dsize 1250 [xxx.xxx.xxx.xxx:443 -> xxx.xxx.xxx.xxx:42054]
    2024-11-22T15:03:20.480650Z [27128] daq_metadata session id 12985 rev 11175 appid 2712  hbappid 0 idp 5 appfltid 4
    2024-11-22T15:03:20.480656Z [27128] verdict 0 pktnum 7754 flowoff [APP|WEB|AV] appcat [2712:5] dsize 1250 [xxx.xxx.xxx.xxx:443 -> xxx.xxx.xxx.xxx:42054]
    
    
    DEBUG:
    2024-11-22T15:00:19.086120Z [27127/0x57ea000049e5] [nsg_request_fsm.c:542:request_fsm_body_bytes] Event body_bytes in state RESP_PARSE_BODY_SKIP
    2024-11-22T15:00:19.086192Z [27127/0x57ea000049e5] [acl/acl_lookup.c:608:check_access_list] ACL check: USER matched for WEB rule 0 (user-id=24)
    2024-11-22T15:00:19.086196Z [27127/0x57ea000049e5] [acl/acl_lookup.c:611:check_access_list] checking policy [000], if matched, allow
    2024-11-22T15:00:19.086199Z [27127/0x57ea000049e5] [acl/acl_lookup.c:392:__check_acl] checking 'all type 1'
    2024-11-22T15:00:19.086202Z [27127/0x57ea000049e5] [acl/acl_lookup.c:623:check_access_list] All ACLs matched: rule action = allow
    2024-11-22T15:00:19.086206Z [27127/0x57ea000049e5] [nsg_request_fsm.c:542:request_fsm_body_bytes] Event body_bytes in state RESP_PARSE_BODY_SKIP
    2024-11-22T15:00:19.086209Z [27127/0x57ea000049e5] [acl/acl_lookup.c:608:check_access_list] ACL check: USER matched for WEB rule 0 (user-id=24)
    2024-11-22T15:00:19.086211Z [27127/0x57ea000049e5] [acl/acl_lookup.c:611:check_access_list] checking policy [000], if matched, allow
    2024-11-22T15:00:19.086213Z [27127/0x57ea000049e5] [acl/acl_lookup.c:392:__check_acl] checking 'all type 1'
    2024-11-22T15:00:19.086216Z [27127/0x57ea000049e5] [acl/acl_lookup.c:623:check_access_list] All ACLs matched: rule action = allow
    2024-11-22T15:00:19.086218Z [27127/0x57ea000049e5] [nsg_request_fsm.c:542:request_fsm_body_bytes] Event body_bytes in state RESP_PARSE_BODY_SKIP
    2024-11-22T15:00:19.086230Z [27127/0x57ea000049e5] [acl/acl_lookup.c:608:check_access_list] ACL check: USER matched for WEB rule 0 (user-id=24)
    2024-11-22T15:00:19.086232Z [27127/0x57ea000049e5] [acl/acl_lookup.c:611:check_access_list] checking policy [000], if matched, allow
    

  • no, sorry, the service is not running in debug triple confirmed by Sophos Support and me.

    meanwhile even the zipped .gz files getting bigger and bigger now reaching about 2GB while they were at 300MB yesterday

    /dev/var                179.3G    112.3G     67.0G  63% /var
    ls -lhS --full-time /log/ips.* | awk '{print $5,$7,$9}'
    41.2G 14:48:56 /log/ips.log-20241122_143858
    28.5G 15:09:24 /log/ips.log
    2.3G 12:36:43 /log/ips.log-20241122_123643.gz
    1.9G 13:06:46 /log/ips.log-20241122_130646.gz
    1.8G 15:09:24 /log/ips.log-20241122_143858.gz
    1.7G 14:13:53 /log/ips.log-20241122_141353.gz
    1.7G 13:30:50 /log/ips.log-20241122_133050.gz
    1.5G 13:52:49 /log/ips.log-20241122_135249.gz
    XGS4500_AM02_SFOS 21.0.0 GA-Build169 HA-Primary# service -S |grep -i ips
    ips                  RUNNING
    ipsec-monitor        RUNNING

    Debug would show


    service -S |grep -i ips
    ips                  RUNNING,DEBUG
    ipsec-monitor        RUNNING

    GES is working on the issue. Confirmed logging is not normal.