This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

CPU leads to packageloss + kernel spinning around nf_conntrack_tuple_taken

We currently have a problem routenely between 06:20am/07:00am. What we have found out is that the appliance SG310 starts losing packets as we hit a higher load. This is somewhat understandable. We cant find the source of the problems. There is no clear indication that there are really a lot of packets going through firewall.


We can see very high software interrupts on both network cards that completely saturate the corresponding cpu cores. During the day we have about 90000 connections without any problems. Conntrack count (conntrackd -s) shows lower values that our peaks. Traffic on FW (iftop) and our switches do not show any problems.

Perf shows:

+  25.46%      ksoftirqd/0  [kernel.kallsyms]                [k] nf_conntrack_tuple_taken
+  23.68%      kworker/1:1  [kernel.kallsyms]                [k] nf_conntrack_tuple_taken
+  10.17%      ksoftirqd/1  [kernel.kallsyms]                [k] nf_conntrack_tuple_taken
+   9.69%       conntrackd  [kernel.kallsyms]                [k] nf_conntrack_tuple_taken
+   2.26%        confd.plx  [kernel.kallsyms]                [k] nf_conntrack_tuple_taken
+   2.14%          swapper  [kernel.kallsyms]                [k] nf_conntrack_tuple_taken
+   2.01%          mdw.plx  [kernel.kallsyms]                [k] nf_conntrack_tuple_taken
+   1.62%        confd.plx  libperl.so                       [.] Perl_hv_common
+   1.31%      ksoftirqd/0  [kernel.kallsyms]                [k] hash_conntrack_raw
+   1.15%      kworker/1:1  [kernel.kallsyms]                [k] hash_conntrack_raw
+   0.69%      ksoftirqd/0  [kernel.kallsyms]                [k] nf_ct_tuple_equal
+   0.67%      kworker/1:1  [kernel.kallsyms]                [k] nf_ct_tuple_equal
+   0.62%          mdw.plx  libperl.so                       [.] Perl_hv_common
+   0.57%      ksoftirqd/0  [kernel.kallsyms]                [k] nf_ct_invert_tuple
+   0.56%      ksoftirqd/1  [kernel.kallsyms]                [k] hash_conntrack_raw
+   0.51%      kworker/1:1  [kernel.kallsyms]                [k] nf_ct_invert_tuple
+   0.50%         postgres  [kernel.kallsyms]                [k] nf_conntrack_tuple_taken

We have disabled all reporting and reduced retention.


There  is nothing strange in the logs.. mdw/sercice monitor go crazy because the appliance has general network problems then (accessing inet and/or internal network).


Paketfilter log does not show unregular traffic hitting the Appliance.


When we fail over the second unit immediately starts spiking with load (before no load) so this rules out any hardwaredefect.


It is either an internal cronjob that makes the load explode (its bevore the admbs maintainance!!) via some bug in conntrack handling or we have packets hitting the firewall (or empty connections.. dangling connections) but cant find the source.

It is like its completely  hidden from us?

1. Any Ideas for debugging?

2. Where would we see if this a "new connection" issue


When we have the problem again is there a safe way to deactivate conntrack?



This thread was automatically locked due to age.
Parents
  • Hi MaxMueller

    I would start "top" from shell at 06:00 and take a look at the CPU load to find out which process is using so much performance.

    It may be the IPS or something else, but with this you can investigate further.

    Regards,

    Michael

Reply
  • Hi MaxMueller

    I would start "top" from shell at 06:00 and take a look at the CPU load to find out which process is using so much performance.

    It may be the IPS or something else, but with this you can investigate further.

    Regards,

    Michael

Children
No Data