High CPU utilization after upgrading to HiveOS 6.5r3a Honolulu.2530

  • 4
  • Question
  • Updated 1 year ago
  • Answered
After upgrading our AP121 access points to HiveOS 6.5r3a Honolulu.2530, we have a site where all APs utulize 100% of CPU. The weird thing is that this only occurs at one site. This site uses the same network policy as 5 other sites. One AP which didn't have the firmware upgrade only utilizes 65-85% CPU, the other ones keep utilising 100%. 

Here's a little diagnostic information:
--------------------------------------------------------------------------------
show cpu detail
--------------------------------------------------------------------------------
CPU total utilization:                100.000%
CPU user utilization:                 5.641%
CPU system utilization:               9.230%
Number of interrupt in last second:   2842
Interrupt utilization:                0.000%
Soft interrupt utilization:           85.128%


--------------------------------------------------------------------------------
show system processes state
--------------------------------------------------------------------------------
Mem: 97436K used, 24468K free, 0K shrd, 1996K buff, 27264K cachedCPU:  2.5% usr  4.2% sys  0.0% nic  0.1% idle  0.0% io  0.0% irq 93.0% sirq
Load average: 37.44 63.29 63.14 4/235 7559
  PID  PPID USER     STAT   VSZ %MEM CPU %CPU COMMAND
 1636  1576 root     S     103m 86.7   0 34.5 /opt/ah/bin/ah_auth
    3     2 root     RW       0  0.0   0 29.9 [ksoftirqd/0]
 1607  1576 root     S     122m102.6   0  5.2 /opt/ah/bin/ah_dcd
 1695  1576 root     S    62576 51.1   0  3.8 /opt/ah/bin/amrp2
 2313  1576 root     S    53512 43.7   0  3.6 /opt/ah/bin/ah_nbr
 7549  7491 root     R     3656  2.9   0  3.0 top
 1576     1 root     S    73076 59.7   0  1.5 /opt/ah/bin/ah_top
 1913     1 root     S    82584 67.5   0  1.1 /opt/ah/bin/ah_vpn
 7486  7316 admin    S    10276  8.4   0  1.1 sshd: admin@pts/0
 2311  1576 root     S     116m 97.3   0  0.7 /opt/ah/bin/capwap
 1853  1576 root     S    43716 35.7   0  0.7 /opt/ah/bin/ah_capture
    4     2 root     SW       0  0.0   0  0.5 [events/0]
 1609  1576 root     S    93052 76.1   0  0.1 /opt/ah/bin/ah_scd
 1639  1576 root     S    62304 50.9   0  0.1 /opt/ah/bin/fed
 1915     1 root     S    61156 50.0   0  0.1 /opt/ah/bin/ah_lcs
 1912     1 root     S    55856 45.6   0  0.1 /opt/ah/bin/ah_bgd
 1870     1 root     S    30108 24.6   0  0.1 /opt/ah/bin/ntpclient
 2585     1 admin    S     7828  6.4   0  0.1 /usr/local/sbin/hiawatha
  846     1 root     S     6620  5.4   0  0.1 /opt/ah/bin/ah_syslogd
 1632  1576 root     S     153m128.7   0  0.0 /opt/ah/bin/ah_brd
 1611  1576 root     S    94688 77.4   0  0.0 /opt/ah/bin/l7d
 2310  1576 admin    S    92124 75.3   0  0.0 /opt/ah/bin/ah_webui
 1906  1576 root     S    62324 50.9   0  0.0 /opt/ah/bin/ah_ipfw
 2312  1576 root     S    61788 50.5   0  0.0 /opt/ah/bin/capwap_srv
 1630  1576 root     S    54168 44.3   0  0.0 /opt/ah/bin/ah_dcm
 1914     1 root     S    53628 43.8   0  0.0 /opt/ah/bin/ah_ltr
 1759  1576 root     S    44552 36.4   0  0.0 /opt/ah/bin/ah_ibeacon
 1905  1576 root     S    38796 31.7   0  0.0 /opt/ah/bin/dhcpc
 1540     1 root     S    36368 29.7   0  0.0 /opt/ah/bin/ah_monitor
 2305  1576 root     S    35988 29.4   0  0.0 /opt/ah/bin/ah_cli
 1827     1 root     S    30868 25.2   0  0.0 /sbin/mdnsd
 7491  7486 root     S    29840 24.4   0  0.0 -ah_cli_ui
 2670     1 root     S    11520  9.4   0  0.0 /usr/local/bin/php-cgi -b 127.0.0.1:2008
 7316  2735 root     S    10276  8.4   0  0.0 sshd: admin [priv]
 2735     1 root     S     7628  6.2   0  0.0 /sbin/sshd -f /etc/sshd_config
 2748     1 root     S     3540  2.9   0  0.0 /sbin/getty -L ttyS0 9600 vt100
    1     0 root     S     3328  2.7   0  0.0 init
 1581     1 root     S     2424  1.9   0  0.0 /opt/ah/bin/guard
  847     1 root     S     1984  1.6   0  0.0 /opt/ah/bin/ah_klogd
   74     2 root     SW       0  0.0   0  0.0 [sync_supers]
  819     2 root     SWN      0  0.0   0  0.0 [jffs2_gcd_mtd6]
   76     2 root     SW       0  0.0   0  0.0 [bdi-default]
    5     2 root     SW       0  0.0   0  0.0 [khelper]
    2     0 root     SW       0  0.0   0  0.0 [kthreadd]
    8     2 root     SW       0  0.0   0  0.0 [async/mgr]
   77     2 root     SW       0  0.0   0  0.0 [kblockd/0]
   86     2 root     SW       0  0.0   0  0.0 [khubd]
  107     2 root     SW       0  0.0   0  0.0 [kswapd0]
  108     2 root     SW       0  0.0   0  0.0 [aio/0]




Log snippet (it keeps going on like that):

2016-02-24 11:29:00 info    last message repeated 2 times2016-02-24 11:29:00 info    capwap: SSL:unknown version (0)
2016-02-24 11:29:00 info    ah_cli: admin:<show tech >
2016-02-24 11:29:00 info    last message repeated 5 times
2016-02-24 11:28:59 info    capwap: SSL:unknown version (0)
2016-02-24 11:28:59 info    ah_bgd: bonjour gateway cancel browse service type:_workstation._tcp.
2016-02-24 11:28:59 info    ah_bgd: bonjour gateway cancel browse service type:_ssh._tcp.
2016-02-24 11:28:58 info    ah_bgd: bonjour gateway cancel browse service type:_apple-mobdev2._tcp.
2016-02-24 11:28:58 info    ah_bgd: bonjour gateway cancel browse service type:_rfb._tcp.
2016-02-24 11:28:58 info    ah_bgd: bonjour gateway cancel browse service type:_teamviewer._tcp.
2016-02-24 11:28:58 info    ah_bgd: bonjour gateway cancel browse service type:_pblipc._tcp.
2016-02-24 11:28:58 info    ah_bgd: bonjour gateway cancel browse service type:_airdroid._tcp.
2016-02-24 11:28:58 info    ah_bgd: bonjour gateway cancel browse service type:_ni-logos._tcp.
2016-02-24 11:28:58 info    kernel: [fe]: access interface counter 4, portal interface counter 1
2016-02-24 11:28:58 info    kernel: [mpi]: wifi0.2 notify kevent KEVT_IF_CHG, type IF_DOWN(2)
2016-02-24 11:28:58 info    kernel: [mac]: MAC learning: recv netdev notifier event 2 on wifi0.2
2016-02-24 11:28:58 info    amrp2: amrp delete ifp wifi0.2(4018:b14e:4dd5)
2016-02-24 11:28:58 info    capwap: CAPWAP receive kevent KEVT_IF_CHG, eventid = 17, size = 36
2016-02-24 11:28:58 info    amrp2: receive kevent KEVT_IF_CHG type IF_DOWN from wifi0.2 index 13
2016-02-24 11:28:58 warn    mDNSResponder: MDNS:add virtual interface VLAN1
2016-02-24 11:28:57 info    last message repeated 3 times
2016-02-24 11:28:57 info    capwap: SSL:unknown version (0)
2016-02-24 11:28:57 debug   kernel: ah_spectral_scan_vdetach: vap wifi0.2 DOWN, spectral scan module re-attached to vap 81642000, name=wifi0.3
2016-02-24 11:28:57 info    kernel: [mpi]: wifi0.2 notify kevent KEVT_IF_CHG, type IF_GOING_DOWN(7)
2016-02-24 11:28:57 info    capwap: CAPWAP receive kevent KEVT_IF_CHG, eventid = 17, size = 36
2016-02-24 11:28:57 info    amrp2: receive kevent KEVT_IF_CHG type IF_GOING_DOWN from wifi0.2 index 13
2016-02-24 11:28:57 warn    mDNSResponder: interface change, mdnsd refresh interface list and service!!!
2016-02-24 11:28:57 info    ah_bgd: bonjour gateway cancel browse service type:_jenkins._tcp.
2016-02-24 11:28:57 info    ah_bgd: bonjour gateway cancel browse service type:_airplay._tcp.
2016-02-24 11:28:57 info    ah_bgd: bonjour gateway cancel browse service type:_sftp-ssh._tcp.
2016-02-24 11:28:57 info    ah_bgd: bonjour gateway cancel browse service type:_ipps._tcp.
2016-02-24 11:28:57 info    ah_bgd: bonjour gateway cancel browse service type:_hudson._tcp.
2016-02-24 11:28:57 info    ah_bgd: bonjour gateway cancel browse service type:_workstation._tcp.
2016-02-24 11:28:57 info    ah_bgd: bonjour gateway cancel browse service type:_ssh._tcp.
2016-02-24 11:28:57 info    ah_bgd: bonjour gateway cancel browse service type:_apple-mobdev2._tcp.
2016-02-24 11:28:57 info    ah_bgd: bonjour gateway cancel browse service type:_rfb._tcp.
2016-02-24 11:28:57 info    ah_bgd: bonjour gateway cancel browse service type:_teamviewer._tcp.
2016-02-24 11:28:57 info    ah_bgd: bonjour gateway cancel browse service type:_pblipc._tcp.
2016-02-24 11:28:57 info    ah_bgd: bonjour gateway cancel browse service type:_airdroid._tcp.
2016-02-24 11:28:57 info    ah_bgd: bonjour gateway cancel browse service type:_ni-logos._tcp.
2016-02-24 11:28:57 info    kernel: [fe]: access interface counter 5, portal interface counter 1
2016-02-24 11:28:57 info    kernel: [mpi]: wifi0.1 notify kevent KEVT_IF_CHG, type IF_DOWN(2)
2016-02-24 11:28:57 info    kernel: [mac]: MAC learning: recv netdev notifier event 2 on wifi0.1
2016-02-24 11:28:57 info    amrp2: amrp delete ifp wifi0.1(4018:b14e:4dd4)
2016-02-24 11:28:57 info    capwap: CAPWAP receive kevent KEVT_IF_CHG, eventid = 17, size = 36
2016-02-24 11:28:57 warn    ah_dcd: Interface wifi0.1 is down.
2016-02-24 11:28:57 info    amrp2: receive kevent KEVT_IF_CHG type IF_DOWN from wifi0.1 index 10
2016-02-24 11:28:57 warn    mDNSResponder: MDNS:add virtual interface VLAN1
2016-02-24 11:28:56 debug   kernel: ah_spectral_scan_vdetach: vap wifi0.1 DOWN, spectral scan module re-attached to vap 8168c000, name=wifi0.2
2016-02-24 11:28:56 info    kernel: [mpi]: wifi0.1 notify kevent KEVT_IF_CHG, type IF_GOING_DOWN(7)
2016-02-24 11:28:56 info    capwap: CAPWAP receive kevent KEVT_IF_CHG, eventid = 17, size = 36
2016-02-24 11:28:56 info    amrp2: receive kevent KEVT_IF_CHG type IF_GOING_DOWN from wifi0.1 index 10
2016-02-24 11:28:56 warn    mDNSResponder: interface change, mdnsd refresh interface list and service!!!
2016-02-24 11:28:56 info    ah_dcd: access interface wifi1.3 is up now
2016-02-24 11:28:56 warn    ah_dcd: Interface wifi1.3 is up.
2016-02-24 11:28:56 info    ah_dcd: access interface wifi1.2 is up now
2016-02-24 11:28:56 warn    ah_dcd: Interface wifi1.2 is up.
2016-02-24 11:28:56 info    ah_dcd: access interface wifi1.1 is up now
2016-02-24 11:28:56 warn    ah_dcd: Interface wifi1.1 is up.
2016-02-24 11:28:56 info    ah_dcd: access interface wifi1.3 is down now
2016-02-24 11:28:56 warn    ah_dcd: Interface wifi1.3 is down.
2016-02-24 11:28:56 info    ah_dcd: access interface wifi1.2 is down now
2016-02-24 11:28:56 warn    ah_dcd: Interface wifi1.2 is down.
2016-02-24 11:28:56 info    ah_dcd: access interface wifi1.1 is down now
2016-02-24 11:28:56 info    ah_bgd: bonjour gateway cancel browse service type:_jenkins._tcp.
2016-02-24 11:28:56 info    ah_bgd: bonjour gateway cancel browse service type:_airplay._tcp.
2016-02-24 11:28:56 info    ah_bgd: bonjour gateway cancel browse service type:_sftp-ssh._tcp.
2016-02-24 11:28:56 info    ah_bgd: bonjour gateway cancel browse service type:_ipps._tcp.
2016-02-24 11:28:56 info    ah_bgd: bonjour gateway cancel browse service type:_hudson._tcp.
2016-02-24 11:28:56 info    ah_bgd: bonjour gateway cancel browse service type:_workstation._tcp.
2016-02-24 11:28:56 info    ah_bgd: bonjour gateway cancel browse service type:_ssh._tcp.
2016-02-24 11:28:56 info    ah_bgd: bonjour gateway cancel browse service type:_apple-mobdev2._tcp.
Photo of YMEsko

YMEsko

  • 11 Posts
  • 1 Reply Like

Posted 2 years ago

  • 4
Photo of Gary Smith

Gary Smith, Official Rep

  • 299 Posts
  • 61 Reply Likes
Hi,

We can see that the CPU is not high because of a particular process but rather because of a SIRQ function.

It would be interesting to see your config if possible. It might be that we can give some pointers on what might be causing the high CPU.

Can you also give some insight into the network configuration?
This AP is connected to a switch with;
What size broadcast domain?
How many VLANs are allowed to Trunk?
What are the traffic levels going through the ethernet port?
what are the traffic types?
What other devices or services share the VLANs?
 

Kind Regards,
Gary Smith
Photo of Nick Lowe

Nick Lowe, Official Rep

  • 2491 Posts
  • 451 Reply Likes
Do you have a packet capture that you would be prepared to share for analysis purposes?
(Edited)
Photo of YMEsko

YMEsko

  • 11 Posts
  • 1 Reply Like
Sure no problem. I filtered out the other traffic for security purposes. The packets listed in the trace were causing the issue. You can find the pcapng file on http://we.tl/XfHBcU53AD
Photo of Carsten Loemker

Carsten Loemker

  • 8 Posts
  • 2 Reply Likes
Hi,
had the same problem here, the device population are mostly Apple devices, 700+ iPads, 500+ OSX devices, Apple Tvs & 100 - 150 Windows machines.
With HiveOS 6.5r3a Honolulu.2530 running on an AP121 the CPU usage seemed to climb quickly to 100% and the AP became unresponsive.
With HiveOS 6.4r1g the problem disappeared and the CPU usage is usually below 30%.
Have rolled all existing AP121s back to HiveOS 6.4r1g using the following method
1. Reboot AP, mainly because at starting point it is unresponsive
2. Start ssh session to AP within a minute
3. Deploy image using tftp: # save image tftp://hostip/AP121-141-6.4r1g.img.S
4. Push complete config
To me it looked that once the AP runs itself up to 100% CPU the activity on the switch went through the roof.
Photo of Gary Smith

Gary Smith, Official Rep

  • 299 Posts
  • 61 Reply Likes

I wanted to add some detail about the steps we took to troubleshoot this issue. Hopefully it might be helpful for others that are seeing High CPU issues.


Firstly we made access to an AP which was showing high CPU via an SSH connection. Looking at a "show cpu detail" we could see that the CPU was at 100% and that the SIRQ was taking up the majority of this.

show cpu detail

--------------------------------------------------------------------------------
CPU total utilization:                100.000%
CPU user utilization:                 5.641%
CPU system utilization:               9.230%
Number of interrupt in last second:   2842
Interrupt utilization:                0.000%
Soft interrupt utilization:           85.128%

--------------------------------------------------------------------------------
show system processes state
--------------------------------------------------------------------------------
Mem: 97436K used, 24468K free, 0K shrd, 1996K buff, 27264K cachedCPU:  2.5% usr  4.2% sys  0.0% nic  0.1% idle  0.0% io  0.0% irq 93.0% sirq

My assumption at this point, because of the SIRQ levels, was that the traffic from the network was likely to be causing an issue rather than a particular process on the AP.

As a process of elimination and to try and pinpoint the issue; We removed all SSIDs from the interfaces. After doing this we saw that the CPU immediately dropped to 10-20%

We then added the SSIDs back one at a time. We saw that the CPU went up immediately after enabling just one SSID on one interface.

We looked at the Rx bytes on the Eth0 interface to see how fast the count was incrementing. A “Show int eth0” suggested that the "Rx bytes” count was incrementing at about 1Mb/10 seconds

At this point I was confident that the issue was network traffic related. We tried a couple of other tests to throttle Broadcast and Multicast traffic which made no difference to the CPU levels.

For reference we tried; “interface eth0 rate-limit multicast 2000” and “interface eth0 rate-limit broadcast 2000”.

The next step was to collect a wired packet capture to see what was actually being sent to the AP. This is where the IPv6 Multicast traffic and source was identified. 


The results are posted in this thread by YMEsko.


Some thoughts;

The issue was not so apparent before the upgrade to 6.5r3. The previous version was 5.1r5. Why?

- IPv6 support was added in 6.X and while the AP is running 5.1r5, the AP probably just ignores the traffic entirely. I need to verify this.


Why did throttling the multicast traffic not work even though the issue was found to be multicast?

 - My assumption (and I will do some research) is that the “rate-limit” we applied only applies to IPv4 traffic and therefore the IPv6 traffic was being ignored. The command to rate-limit has been in the HiveOS since well before IPv6 support was added. To me this means that the command is only relevant to IPv4 related multicast.


Please bare in mind that the AP CPU will have to process the traffic coming in from the wired side even if that traffic is not destined for the AP or the wireless clients. This is why it is suggested to segregate the traffic using VLANs. My preference and the preference of others is to;

  • Configure the AP Management VLAN to be seperate from other network functions
  • Configure the wireless user VLANs to be seperate from wired users
  • Configure switchports to only trunk relevant VLANs
  • Configure the AP eth0 to allow VLAN "Auto"

Whilst we see High CPU on the AP’s in certain conditions, there is usually a reason and something that can be done to prevent the issue. The initial steps taken in this thread; “show CPU detail” and “show system processes state” are correct. This will help identify which process (e.g. L7D, ah_auth, bgd, etc) might be related to the high CPU. If SIRQ appears to be the culprit, it is then worth looking at the wired traffic coming in to the AP.


Kind Regards,

Gary Smith

Photo of Johan H

Johan H

  • 2 Posts
  • 0 Reply Likes
  • Configure switchports to only trunk relevant VLANs
+1! This was the case on ours ap121!
Photo of Marek Szymonski

Marek Szymonski

  • 44 Posts
  • 4 Reply Likes

Hi,

How did you stop the IPv6 multicast traffic?

Photo of Joe Nesci

Joe Nesci

  • 6 Posts
  • 0 Reply Likes
Did you ever get a resolution to this on the AP230's?  I have the same issue.  Packet trace shows MDNS hitting the eth port connected to my switch.
Photo of Bill W.

Bill W.

  • 222 Posts
  • 35 Reply Likes
Joe, what version of HiveOS are you using? Try either 6.5r6 or 7.1r1 if you are not already running either of those.
Photo of Joe Nesci

Joe Nesci

  • 6 Posts
  • 0 Reply Likes
I'm currently running 6.5r6. Just backed it down from 6.8r1 per support.
Photo of Gary Smith

Gary Smith, Official Rep

  • 299 Posts
  • 61 Reply Likes
Hi Joe,

Can you let me know the support ticket number you have for this issue? I would like to take a look at the data to see if I can make recommendations.

Kind Regards,
Gary Smith
Photo of Joe Nesci

Joe Nesci

  • 6 Posts
  • 0 Reply Likes
Gary,
Support case # is 00186952. Thanks.