This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Weird UTM freezes randomly approximately once a day ...

I have experienced a strange lockup on my "new" UTM box, but I checked log files and they don't reveal anything, just a bunch of weird characters ...

2023:03:16-01:32:01 escape75 /usr/sbin/cron[25494]: (root) CMD (  nice -n19 /usr/local/bin/gen_inline_reporting_data.plx)
2023:03:16-01:35:01 escape75 /usr/sbin/cron[25649]: (root) CMD (   /usr/local/bin/reporter/system-reporter.pl)
�����������������������������������������������������������������������������������������������������������
2023:03:16-09:03:10 escape75 syslog-ng[4942]: syslog-ng starting up; version='3.4.7' 2023:03:16-09:03:12 escape75 ddclient[5361]: WARNING: cannot connect to checkip.dyndns.org:80 socket: IO::Socket::INET: Bad hostname 'checkip.dyndns.org' 2023:03:16-09:03:24 escape75 system: System was restarted



So,- I've been running the software version of UTM (9.714) on my old unit (an XG115 r2) for a couple of years without any issues,
and recently I have migrated my saved config over to a new unit (XG115 r3) and a few hours after setting up the new unit (at night)

it froze up, and interfaces were not pingable (LAN) so I powered it down and rebooted. It's working again ...

Just wondering if there's something more I can look at to see what the issue was .. I have a hunch maybe it was DHCP related,
as my devices on the LAN were renewing the IP addresses and they were not in the table on the new unit, but it's a wild guess,
so if this doesn't happen again then maybe it's nothing to worry about.

I don't know if there would be an issue moving the config file (and license) from the old unit, but I wouldn't think so.

The new unit was installed the same way as the old unit, using the ssi-9.714-4.1.iso file and removing the /etc/asg with a software license,
and the old unit hasn't experienced any weird issues in years, and the ethernet ports and devices are setup in an identical way, nothing changed.

Just looking for thoughts and ideas ...

Stats from top:

top - 11:32:20 up 2:31, 1 user, load average: 0.09, 0.29, 0.25
Tasks: 163 total, 1 running, 160 sleeping, 0 stopped, 2 zombie
Cpu(s): 0.6%us, 0.5%sy, 0.0%ni, 98.5%id, 0.1%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 3898468k total, 3558768k used, 339700k free, 111124k buffers
Swap: 4194300k total, 112k used, 4194188k free, 1352808k cached

Zombies:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 18256 0.0 0.0 0 0 ? Z 11:30 0:00 [aua.bin] <defunct>
root 18595 0.6 0.0 0 0 ? Z 11:32 0:00 [confd.plx] <defunct>



This thread was automatically locked due to age.
  • I don't believe this is a configuration issue, and it could be either a software and/or hardware issue.

    Did you by chance run a smart test on the disk via SSH?

    smartctl -a /dev/sda

    OPNSense 64-bit | Intel Xeon 4-core v3 1225 3.20Ghz
    16GB Memory | 500GB SSD HDD | ATT Fiber 1GB
    (Former Sophos UTM Veteran, Former XG Rookie)

  • Yes, smart test both short and long passes, as well as memory test, as well as temperatures seem ok.
    I've been monitoring temperatures as the box has been up for over an hour, so I'm just checking what I can.

  • Can you paste the SMART test information here? 

    OPNSense 64-bit | Intel Xeon 4-core v3 1225 3.20Ghz
    16GB Memory | 500GB SSD HDD | ATT Fiber 1GB
    (Former Sophos UTM Veteran, Former XG Rookie)

  • Of course, here it is:

    escape75:/home/login # smartctl -d ata --all /dev/sda
    smartctl 6.3 2015-02-08 r4039 [x86_64-linux-3.12.74-0.424574463.ge309b77.rb7-smp64] (SUSE RPM)
    Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF INFORMATION SECTION ===
    Device Model:     ADATA_IM2S3134N-064GM
    Serial Number:    2I3920032044
    LU WWN Device Id: 5 707c18 1006e2fd0
    Firmware Version: 6.8E
    User Capacity:    64,023,257,088 bytes [64.0 GB]
    Sector Size:      512 bytes logical/physical
    Rotation Rate:    Solid State Device
    Form Factor:      2.5 inches
    Device is:        Not in smartctl database [for details use: -P showall]
    ATA Version is:   ACS-2 (minor revision not indicated)
    SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
    Local Time is:    Tue Mar 21 06:31:32 2023 CET
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    
    General SMART Values:
    Offline data collection status:  (0x00) Offline data collection activity
                                            was never started.
                                            Auto Offline Data Collection: Disabled.
    Self-test execution status:      (   0) The previous self-test routine completed
                                            without error or no self-test has ever
                                            been run.
    Total time to complete Offline
    data collection:                (   32) seconds.
    Offline data collection
    capabilities:                    (0x5b) SMART execute Offline immediate.
                                            Auto Offline data collection on/off support.
                                            Suspend Offline collection upon new
                                            command.
                                            Offline surface scan supported.
                                            Self-test supported.
                                            No Conveyance Self-test supported.
                                            Selective Self-test supported.
    SMART capabilities:            (0x0003) Saves SMART data before entering
                                            power-saving mode.
                                            Supports SMART auto save timer.
    Error logging capability:        (0x01) Error logging supported.
                                            General Purpose Logging supported.
    Short self-test routine
    recommended polling time:        (   1) minutes.
    Extended self-test routine
    recommended polling time:        (   1) minutes.
    SCT capabilities:              (0x0039) SCT Status supported.
                                            SCT Error Recovery Control supported.
                                            SCT Feature Control supported.
                                            SCT Data Table supported.
    
    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000a   100   100   000    Old_age   Always       -       0
      2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
      3 Spin_Up_Time            0x0007   100   100   050    Pre-fail  Always       -       0
      5 Reallocated_Sector_Ct   0x0013   100   100   050    Pre-fail  Always       -       0
      7 Unknown_SSD_Attribute   0x000b   100   100   050    Pre-fail  Always       -       0
      8 Unknown_SSD_Attribute   0x0005   100   100   050    Pre-fail  Offline      -       0
      9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       20706
     10 Unknown_SSD_Attribute   0x0013   100   100   050    Pre-fail  Always       -       0
     12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       147
    167 Unknown_Attribute       0x0022   100   100   000    Old_age   Always       -       0
    168 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       0
    169 Unknown_Attribute       0x0013   100   100   010    Pre-fail  Always       -       196611
    170 Unknown_Attribute       0x0013   100   100   010    Pre-fail  Always       -       0
    173 Unknown_Attribute       0x0012   128   128   000    Old_age   Always       -       4365747135
    175 Program_Fail_Count_Chip 0x0013   100   100   010    Pre-fail  Always       -       0
    180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   100   100   020    Pre-fail  Always       -       553
    192 Power-Off_Retract_Count 0x0012   100   100   000    Old_age   Always       -       116
    194 Temperature_Celsius     0x0022   052   052   030    Old_age   Always       -       48 (Min/Max 44/49)
    197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
    231 Temperature_Celsius     0x0033   069   069   005    Pre-fail  Always       -       31
    233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       127659751936
    234 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       192211142656
    240 Unknown_SSD_Attribute   0x0013   100   100   050    Pre-fail  Always       -       0
    241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       23934369819
    242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       5005309732
    
    SMART Error Log Version: 1
    No Errors Logged
    
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Extended offline    Completed without error       00%     20667         -
    # 2  Short offline       Completed without error       00%     20633         -
    
    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.

  • Oh God, thank you for formatting that, lol. 

    Yeah that's a very healthy disk.  I was wanting the values specifically, as most people just read the pass/fail line and don't worry about anything else.  All the values look really good.

    OPNSense 64-bit | Intel Xeon 4-core v3 1225 3.20Ghz
    16GB Memory | 500GB SSD HDD | ATT Fiber 1GB
    (Former Sophos UTM Veteran, Former XG Rookie)

  • I think you've put in a good effort to try to get it going. What was the source of this device? If you can return it, I would. No point in wasting more time trouble shooting.

    I've bought used pc equip on ebay before. If it works great, if I can't get it going within an evening or 3, it goes back.

  • Yes, it was ebay in fact.

    I have purchased a previous XG115 r2 and it's still running great, but this one has issues from the start ...

    I was just trying to figure out if I'm running into some weird bug possible with one of my LAN devices causing a crash,
    I know it would be strange but I've seen strange things, and the seller claims it was running just fine before it was replaced.

    You know, when I see stuff like this it makes me wonder if it was another bad unit or:

     XG115 Rev 3 freezing sometimes on SFOS 18.5.2 MR-2-Build380 

  • Put it to the test. Attach a single client on the lan side, see what happens.  Looking up specs, it appears the device uses pretty standard i211 nic's. These are generally very well supported.

    Doesn't matter what the seller claims. Just put defective and call it a day :).

    One never really knows what they're buying on ebay. New, openbox, chances are if its a volume seller, the item is some kind of return. Sellers buy pallets of this crap then turn around and sell it if it passes basic tests (ie, posts).

  • I was just trying to figure out if I'm running into some weird bug possible with one of my LAN devices causing a crash

    The chances of that area slim to none, leaning more to none.  I think the unit may be faulty in places we can't test/don't want to bother with and personally, I don't buy these units as they are usually under-performing for my taste.  The seller can preach that all day long; they aren't looking out for you, they are making money.

    You could buy your own machine that isn't hardware specific for what you paid or less and it will last for years and years.  I have had a SuperMicro 1U forever, and finally just replaced the hard drive because it was failing (old 5400 RPM disk to a new SSD), and I updated to a Xeon quad core processor just because I had a dual core running in it before.

    OPNSense 64-bit | Intel Xeon 4-core v3 1225 3.20Ghz
    16GB Memory | 500GB SSD HDD | ATT Fiber 1GB
    (Former Sophos UTM Veteran, Former XG Rookie)

  • I buy and sell (under different ID's ) on ebay too.  I dread selling anything electronic because it may turn into a headache/nightmare.  But that's how the game goes. Don't sell anything there you're not prepared to lose your shirt on.  Have an old toshiba HD dvd player, still in the box never opened. Do I feel lucky..............