Periodic Reboot required?

  • 5
  • Question
  • Updated 3 years ago
  • Answered
We have worked through our access point crash/reboot issues.  We are currently working through high cpu issues.  Now we are noticing that weird behavior will happen on an ap or group of aps; we reboot the ap(s); and everything is good for a while.  Some of the things we have noticed are:  1) high cpu - seems to be tied to dynamic airtime fairness and/or the amount of time the access point has been online, 2) 5Ghz radio not responding to new client requests, 3) Unable to connect to aps on a floor, but those aps show connected clients with good ip addresses passing traffic.  It has become a standard practice around here to pull a tech data prior to rebooting an access point in order to get the users going again quickly and to have something for tech support to troubleshoot.

The problem I have with this situation is that in most cases everything looks ok from the Hivemanager point of view and I have to wait for our students to complain and go out to the locations to actually verify that there is an actual problem.  I'm not advocating having to reboot access points regularly, but I will in the short term if it gives our students a better experience.  

Am I the only one experiencing these kinds of issues?  BTW, we have all AP330 access points.
Photo of Van Jones

Van Jones

  • 75 Posts
  • 4 Reply Likes

Posted 4 years ago

  • 5
Photo of Andrew MacTaggart

Andrew MacTaggart, Champ

  • 483 Posts
  • 86 Reply Likes
I have not seen this behavior with 330s

how many clients are connected to each radio?

I have dynamic airtime fairness enabled as well

do a

"show system processes state" to see if you can see what it is eating all the cpu
it looks like top

Mem: 282628K used, 232444K free, 0K shrd, 6148K buff, 50560K cached
CPU:  0.2% usr  0.4% sys  0.0% nic 98.8% idle  0.0% io  0.0% irq  0.2% sirq
Load average: 0.02 0.01 0.00 2/240 1491
  PID  PPID USER     STAT   VSZ %MEM CPU %CPU COMMAND
  905   395 root     R     3496  0.6   0  0.1 top
    6     2 root     SW       0  0.0   1  0.1 [ksoftirqd/1]
 1354     1 root     S     168m 33.4   0  0.0 /opt/ah/bin/ah_brd
 ....

unless you have already figured it out

I have wips disabled by the way
Photo of Van Jones

Van Jones

  • 75 Posts
  • 4 Reply Likes
Through our testing with support, our high cpu utilization over time seems to be related to DAS, not WIPS.  We tested with WIPS off for a while, but that didn't help, but turning off DAS system wide helped tremendously with CPU issue.  We are currently running special code on an access point that will capture detailed information relating to the cpu.  That AP is currently running at 50% cpu with zero clients: 

Mem: 180492K used, 71032K free, 0K shrd, 8932K buff, 59272K cached
CPU:  0.6% usr  1.4% sys  0.0% nic 48.9% idle  0.0% io  0.0% irq 48.9% sirq
Load average: 2.48 2.18 2.06 2/243 6460
PID PPID USER STAT VSZ %MEM CPU %CPU COMMAND
7 2 root SW 0 0.0 1 12.3 [ksoftirqd/1]
4 2 root SW 0 0.0 0 7.8 [ksoftirqd/0]
1149 1 root S 121m 49.2 1 0.9 /opt/ah/bin/ah_dcd
1 Support did mention a previously squashed bug that has reappeared concerning the transmit buffer getting stuck. He said this would cause the access point to look ok in Hivemanager, but not allow it to service additional clients. The cli command he gave us to verify is: "show interface wifiX _counter | inc buff" wifiX is either wifi0 for 2.4Ghz or wifi1 for 5Ghz. Repeat this command every few seconds over a couple of minutes. If the output doesn't change, then you are experiencing this bug.
Photo of Andrew MacTaggart

Andrew MacTaggart, Champ

  • 483 Posts
  • 86 Reply Likes
Mr Van Jones

You have sparked my curiousity.

What power levels are you using for the radios?
What b/g and a data rates are disabled?
What sort of client mix are in the environment?
how many clients are connecting to AP radios on average?
any other QoS features enabled?
any firewall rules applied?

From the above output

if you press the number "1"

stats should show up for both cpu's

CPU
0.6 usr - user processes are using 0.6% of the cpu
1.4 sys - system processes are using 1.4% of the cpu
0% nic -
NI  --  Nice value
	  The nice value of the task.  A negative nice value means higher pri-
	  ority, whereas a positive nice value means lower priority.  Zero  in
	  this field simply means priority will not be adjusted in determining
	  a task's dispatchability.
48.9 idle - percentage of available cpu
0% io would equate to wa - percentage of time cpu waiting for io to complete
0% irq would equate to percentage of time cpu is servicing hardware interrupts
48.9% sirq would equate to percentage of time cpu is servicing software interrupts
***Your values are higher then mine***

But it is east star so no one is around, but I do have 28 idle clients connected


STAT - S means sleeping, not sure what W stands for though
S  --	 Process Status
	  The status of the task which can be one of:
	     'D' = uninterruptible sleep
	     'R' = running
	     'S' = sleeping
	     'T' = traced or stopped
	     'Z' = zombie

	  Tasks	 shown as running should be more properly thought of as 'ready
	  to run'  --  their task_struct is simply represented	on  the	 Linux
	  run-queue.   Even  without  a true SMP machine, you may see numerous
	  tasks in this state depending	 on  top's  delay  interval  and  nice
	  value.
%MEM	--  Memory usage (RES)
	  A task's currently used share of available physical memory.

%CPU	--  CPU usage
	  The task's share of the elapsed  CPU	time  since  the  last	screen
	  update,  expressed as a percentage of total CPU time.	 In a true SMP
	  environment, if 'Irix mode' is Off, top  will	 operate  in  'Solaris
	  mode'	 where	a task's cpu usage will be divided by the total number
	  of CPUs.  You toggle 'Irix/Solaris' modes with the  'I'  interactive
	  command.

VSZ - I don't know what this maybe virtual image

VIRT -- Virtual Image (kb) The total amount of virtual memory used by the task. It includes all code, data and shared libraries plus pages that have been swapped out. (Note: you can define the STATSIZE=1 environment vari- able and the VIRT will be calculated from the /proc/#/state VmSize field.) VIRT = SWAP + RES.

So did support tell you that ah_dcd was equal to DAS?


PID  PPID USER     STAT   VSZ %MEM CPU %CPU COMMAND

1744     1 root          S     121m    49.0   1  0.1 /opt/ah/bin/ah_dcd

what happens to your Software irq levels when DAS is disabled?

Cheers
A


Photo of Tim Ruda

Tim Ruda, Official Rep

  • 40 Posts
  • 56 Reply Likes
Hi Andrew,

I've been working closely with Van on these issues so I thought I would chime in on this post as well.

The issue with DAS hasn't been attributed to client load, bandwidth contention, or hostile RF (that would be too easy! :) ) it's been seen with 0 clients connected in a clean RF spectrum.
The problem seems to be within the QoS engine that DAS is build on top of. We are currently performing some advanced debugging for our developers to isolate what particular process may be showing the condition over time, but nothing specifically in the TOP output will tell you unfortunately.

So in short- we are working to identify the correct the condition with this particular feature.

Photo of Andrew MacTaggart

Andrew MacTaggart, Champ

  • 483 Posts
  • 86 Reply Likes
Hi Tim

Since I have 80 330's as well and I am not seeing this issue, I was trying figure out what might be different.

Since a reboot fixes the stuck cpu, there must be an event that triggers the high stuck cpu.

Since, it is involves DAS my first thoughts were that maybe too many airtime hungry clients joined at one time, causing some sort of overflow.

So that is why I was interested in # of clients, disabled rates, power levels, types of clients, etc...

By any chance does Van have 802.11w enabled as well?
if it helps
I have DAS enabled, but QoS is not in use
wips disabled
802.11w disabled

It's a quiet week here with all the children on holiday, but if you need me to gather some cpu info when it gets busy, I would be glad to help out. My cpu's seem to be healthy

show cpu detail
CPU total utilization:                1.990%
CPU user utilization:                 0.497%
CPU system utilization:               0.995%
Number of interrupt in last second:   199
Interrupt utilization:                0.000%
Soft interrupt utilization:           0.497%
------------------------------------------------------------
CPU0 utilization:
CPU total utilization:                2.941%
CPU user utilization:                 0.980%
CPU system utilization:               1.960%
Interrupt utilization:                0.000%
Soft interrupt utilization:           0.000%
------------------------------------------------------------
CPU1 utilization:
CPU total utilization:                1.980%
CPU user utilization:                 0.990%
CPU system utilization:               0.000%
Interrupt utilization:                0.000%
Soft interrupt utilization:           0.990%

Cheers
A

(Edited)
Photo of Tim Ruda

Tim Ruda, Official Rep

  • 40 Posts
  • 56 Reply Likes
Hi Andrew,

Thanks for your interest :) 
To compare- we had WIPS disabled, QoS disabled, no 802.11w management frames.

The odd thing about this issue is that it didn't seem to happen during a specific event as I would hope. We had done a heavy amount of log analysis and not come up with any particular crashing or "heavy" usage system service around the development of the issue.

What we've seen is also not an immediate impact. The work done so far debugging shows that after a reboot of the AP to recover, it will take 5-10 days before the CPU slowly climbs back up to a high state. We saw the first 5 days after the recovery reboot operate completely normal. 

I'm glad to hear you aren't having the problem on your end with DAS enabled, although this is one of the most unusual conditions I've encountered and doesn't seem to have any commonalities beyond the feature being enabled.

Rest assured, our support department is working closely with the development team on this one and we do plan to resolve the condition as quickly as possible.
Photo of Roberto Casula

Roberto Casula, Champ

  • 231 Posts
  • 111 Reply Likes
Just to chime in here as well. Yes, I have seen the reported symptoms intermittently across multiple customers for years now. Despite numerous support calls, we have not completely eliminated issues at all customers.

At some customers, we find that they have no problems if they downgrade to 5.1r5a. At other customers, we find they have a better experience if they upgrade to 6.1. No definitive "5.1 is better than 6.1" or vice vera - seems to vary from customer to customer.

We have had issues that have been resolved by replacing an AP even though that same AP when redeployed to a different site works fine. (This is a COMPLETE mystery).

At some customers high CPU has been resolved variously by:
Moving away from using GRE tunnels (for example by deploying a "WiFi guest VRF" to separate guest traffic rather than using the GRE function of the APs)
Disabling the "safety net" feature
Disabling band-steering and load-balancing
Disabling some of the other high-density features

My general thoughts on the above are that the APs do not cope well when there is just one "badly behaved" client (which could be a hardware issue or a driver issue) - anything that causes the AP to have to retry tx or rx frequently seems to trigger a problem and impact everyone using that AP. There doesn't seem to be much protection against this situation. Having said that, there have definitely been situations where we have had high CPU and it not be associated with a bad client - I think there are multiple issues.

In terms of random reboots, that seems to be primarily associated with WIPS, Location Server and the QoS engine/DAS, but again not consistently across multiple customer configurations.

As far as clients not being able to connect reliably while other clients already connected seem to be working fine, we have seen this numerous times. Very often the problems seem to be specifically related to the inability of the client to receive a DHCP response. All the logs and debug traces on the AP suggest the DHCP packet is being sent but the client does not seem to receive it. We even had a situation where we had two APs on a floor working absolutely fine. We then upgraded one of the APs from an AP121 to an AP330 and then found that users could associate and authenticate to either AP but when connecting to the AP330, DHCP would fail. However if the user went to the other end of the office, associated to the AP121, then roamed back to the AP330 they would then carry on working absolutely fine for the rest of the day. This problem would sometimes go away after a reboot for a day or so, and then come back. We spent a long time verifying there was no problem from a LAN/DHCP perspective as everything "checked out" on the AP from a logging/debug perspective. We tried a complete factory default, upgrade/downgrade to different 6.1x releases to no avail. The problem was fully resolved simply by downgrading the AP to 5.1r5a.

So my general perception is that there a number of issues which may have a common or completely separate root cause and it seems very difficult to get to the bottom of it.

Hopefully the more people report these issues and submit debug data to Aerohive, the more chance we have of finally getting a resolution
Photo of Roberto Casula

Roberto Casula, Champ

  • 231 Posts
  • 111 Reply Likes
Oh, and just to say that in the cases of high CPU, it is always software interrupts that are causing the high utilisation rather than a HiveOS process. This suggests the problem may be in the interaction with the driver provided by the chipset vendor (i.e. Atheros). That certainly seems to be the consensus regarding the random reboots too - they seem to be correlated to functions which ask the chipset to do "two things at once".
Photo of Helpnet ULG

Helpnet ULG

  • 3 Posts
  • 0 Reply Likes
That's exactly the symptoms we are experiencing, thanks for the post.
- no dhcp received by clients connected to an AP (sometimes 5Ghz, sometimes 2,4Ghz) and it appeared with version 6.1x.
- random reboot of some APs with no log explaining why.
- High CPU due to Interrupt.
- Client experimenting slow access.

WIPS is deactivated, no rogue detection, location server is on so is QoS for voice

Photo of Van Jones

Van Jones

  • 75 Posts
  • 4 Reply Likes
Andrew,

What power levels are you using for the radios?

- We generally let the access points choose their own power levels and channels unless there is something that is obviously wrong

What b/g and a data rates are disabled?
- b/g - 1 through 9 are N/A, 11 is basic and 12-54 are optional
- a - 6, 12 and 24 are basic and 9, 18, 36, 48 and 54 are optional

What sort of client mix are in the environment?
- academic buildings - mostly iphone/android phones, tablets and laptops
- residence halls - everything in academic plus gaming consoles, wireless printers, and anything else that you can think of or buy at a walmart or best buy.

how many clients are connecting to AP radios on average?
- as few as 0, as many as 60, depending on the area and time of day

any other QoS features enabled?
- no

any firewall rules applied?
- no

...more answers to come
Photo of Andrew MacTaggart

Andrew MacTaggart, Champ

  • 483 Posts
  • 86 Reply Likes
I have a dense deployment
my power settings are statically set for 2.4GHz 5 to 7 dBm and for 5GHz 7 to 10 dBm
My channels are manually selected as well.

b/g - data rates everything disabled up to 24Mbps(basic)
a  - data rates everything disabled up to 24Mbps(basic)
b clients denied under client selection

I strive for no more then 35 Clients on 5GHz radio and 25 on 2.4GHz
when you say 60 is that on 1 radio? I would say that may too many

I had a staging room that complained about connectivity issues, they had 55 clients connect to the 5GHz radio, so I created a radio policy just for that area to load balance between 2.4GHz and 5GHz instead of urging clients to 5GHz.

cheers
Photo of Van Jones

Van Jones

  • 75 Posts
  • 4 Reply Likes
Andrew, what type of environment are you serving?  Academic, residence hall, business?
Photo of Andrew MacTaggart

Andrew MacTaggart, Champ

  • 483 Posts
  • 86 Reply Likes
Academic
Lower Primary
500+ clients daily
50 AppleTVs
150 Ipads
50 ipods
150 MBA and MBP
a slew of BYOD

3 other schools using Cisco wireless currently. Upper Primary, Middle School and High School. I use similar settings for the other schools, they all have 1 to 1 programs

(Edited)
Photo of Helpnet ULG

Helpnet ULG

  • 3 Posts
  • 0 Reply Likes
Any news regarding this ?
Photo of Van Jones

Van Jones

  • 75 Posts
  • 4 Reply Likes
The last update I received from AH support on the reboots was on 4/7 and the last update on the cpu was 4/24.  I'm getting ready to stir the pot again.
Photo of Andrew MacTaggart

Andrew MacTaggart, Champ

  • 483 Posts
  • 86 Reply Likes
Any chance you are doing snmp monitoring, I have seen Cisco device cpu spike when using snmp.

A
Photo of Helpnet ULG

Helpnet ULG

  • 3 Posts
  • 0 Reply Likes
Thank you for the info Van.
Hope our support ticket will help.
Photo of Kyle Heading

Kyle Heading

  • 9 Posts
  • 1 Reply Like
Just like to add a plus 1 for this with us as well.

Im not sure if its related to the high CPU or not but we definitely seem to have issue every 4-5 weeks with our AP's.

Its not every AP so far I haven't been able to find a pattern but it seems to either cause some clients no end of trouble connecting or will allow clients to connect but not get a DHCP address. There are other clients which can connect but then there bonjour devices don't get advertised.

Although the most common problems seems to be the devices connect but just don't get a DHCP address.

This has been happening for about 6-12 months now, it seems to have gotten worse with the 6.1r3 but can't really confirm that.

Simply rebooting the AP's does the trick for a while then it needs doing again.

We are currently on 6.1r3 HiveManager online, but I'm thinking of switching to an onsite manager as I saw that the onsite manager has API access in 6.1r5 I'm hoping I am be able to trigger a scheduled reboot of the devices using the API.

We have been very happy with the Aerohive Systems so far they do seem very solid and we are very happy with the support we have gotten from Aerohive.

This is obviously an issue we would like resolved and we are happy to help in anyway possible even if that means testing early release software.

Thanks!



Photo of Joel Brooks

Joel Brooks

  • 20 Posts
  • 4 Reply Likes
6.1r6a and since the 6.x update, I've noticed a solid trend. AP's not accepting any sort of delta update. I can try many things, but the only fix is a reboot. 40 days seems to be the threshold. Coincidence? 1203 AP's currently active.
Photo of Nick Lowe

Nick Lowe, Official Rep

  • 2491 Posts
  • 451 Reply Likes
Perhaps try and reproduce with the recently released 6.2r1?
(Edited)
Photo of Joel Brooks

Joel Brooks

  • 20 Posts
  • 4 Reply Likes
Yes, have not updated yet, so I will repost if the problem persists. Was this issue or something like it, addressed in the latest update? Sorry have not read release notes yet.
Photo of Joel Brooks

Joel Brooks

  • 20 Posts
  • 4 Reply Likes
Google search revealed this thread. Sorry to resurrect! 
Photo of Joel Brooks

Joel Brooks

  • 20 Posts
  • 4 Reply Likes
Resurrecting the thread again. We now have ~400 - 230's running in our high school locations. Running HiveOS 6.2r1a.1931

Last week (Friday) we started experiencing client complaints about connectivity. Uptime on all ~400 ap's was at 25 days. Due to mac filtering changes we periodically make, we typically get a full config/ reboot every 2 weeks or so. In this case, it's been 25 days uptime. Sporadically, across 4 high schools clients were complaining about connectivity. Looking at the ap's through HM (we have a VA) clients show healthy, good SNR and RSSI etc. In some cases, I cannot ping the client in question. Rebooting the AP takes care of the issue regardless.

Fast forward to Monday (yesterday) AP uptime at 28 days. Widespread complaints at our high schools. Same exact symptoms. Again, rebooting AP is taking care of the issue.

I will now start to take note of the AP's CPU usage, assuming this happens again in 25 days.

Clearly this is a problem. We've seen this with our 330's in the past. We have roughly 1000 of those deployed. Typically after ~30 days uptime the 330s start having weirdness and a reboot takes care of it.

I'd love to hear anybody else's experiences. Also, any way possible to schedule reoccurring reboots. 
Photo of Sean Mulligan

Sean Mulligan

  • 1 Post
  • 0 Reply Likes
I too work for a school district, in which we have about 250 AP230s.  It appears we may be having a similar experience. 

In our case, the CPU utilization of the access point slowly creeps up, eventually reaching 100% and users are unable to connect at that point. The only way to fix the problem, which has been plaguing us for months, is a reboot. It takes approximately 25-45 days for the CPU to become maxed out again after a reboot

 If you are experiencing a similar issue, check your CPU status after 25 days:  If you connect to your AP via command line, run this command: show cpu detail.  The 230s actually have two CPUs.  The command will display CPU total utilization, but that is deceiving:it is actually the average between the two CPUs.  In our particular situation, CPU0 is the one that always gets maxed out. It is natural for both of the CPUs to fluctuate, sometimes reaching near 100%, but the problem occurs when it gets stuck at 90 to 100%. 

This has been happening on a regular basis with our access points, every 30 days or so, and is very frustrating to deal with.  We are currently working with Aerohive to find a fix.  There is no way to schedule a reboot unless you use HiveManager to push a configuration update and trick the access point into a reboot.
Photo of Roberto Casula

Roberto Casula, Champ

  • 231 Posts
  • 111 Reply Likes
Do any of your SSIDs have captive web portal enabled? If so, there is an issue which is resolved in 6.4r1a (and should also be in 6.2r1c for the AP110/120) which gives you these symptoms after a period of uptime. The problem relates to the web server daemon and the way it interacts with other processes in the system over time.

The symptoms start with sporadic client connectivity issues which get progressively more pronounced, then at some point later the AP will lose its connection to HiveManager and it will usually not be possible to SSH onto the AP (though it does still respond to PING).

There is a workaround (of sorts) in the older releases to increase the CWP idle timeout to its maximum value, but a proper fix is in these newer releases.

Our customer base is seeing very few issues on the 6.4r1a code compared with all earlier releases.
Photo of Joel Brooks

Joel Brooks

  • 20 Posts
  • 4 Reply Likes
Thanks for the info. I will update all my 230's to 6.4r1a code and cross my fingers. We actually do utilize a CWP for our guest access. So that could be the issue.
Photo of Van Jones

Van Jones

  • 75 Posts
  • 4 Reply Likes
Our AP230s are currently running special code 6.4r1b Hongkong.E2139 and what you described was one of the issues that got us to this version.  We do not run CWP.  Our issues seem to have gone away with this version (knock on wood).  We have been around long enough to know that even though issues may seem similar, they don't always have the same root cause, so you would to need to speak with support regarding your specific issue to see if this code would help.
Photo of rob.butterworth

rob.butterworth

  • 3 Posts
  • 0 Reply Likes
I have AP121 units and the most loaded of them needs repeated reboots since the last update (6.5r1). I will open a ticket but have not raised one since the portal changed to NetSuite so I have to wait up to 48 hours for my portal account before I can do so!

I've had to reboot twice in 48 hours, which isn't really acceptable.
Photo of Nick Lowe

Nick Lowe, Official Rep

  • 2491 Posts
  • 451 Reply Likes
HiveOS 6.5r1 was only released for the AP130.

6.4r1b is the latest version of HiveOS currently available for AP121. What version do you have installed on the APs?
(Edited)
Photo of rob.butterworth

rob.butterworth

  • 3 Posts
  • 0 Reply Likes
Sorry, you're quite right - I was reading the HiveManager version. The APs are running HiveOS 6.4r1a.2103 (which was the release that came with HiveManager 6.5r1).
Photo of Roberto Casula

Roberto Casula, Champ

  • 231 Posts
  • 111 Reply Likes
6.4r1d is the latest version for the AP121 (released in early May). Although there are only a few specific fixes listed between 6.4r1a and 6.4r1d in the release notes, I am sure there are more things that are fixed but not documented (for reasons best known to Aerohive, though they aren't the only vendor that does this).

Our customers that upgraded to this release seem to be stable, including those that were having APs "locking up" in earlier releases.
Photo of rob.butterworth

rob.butterworth

  • 3 Posts
  • 0 Reply Likes
Roberto, I'll give that a try.
Photo of Nick Lowe

Nick Lowe, Official Rep

  • 2491 Posts
  • 451 Reply Likes
Sorry, I meant to type 6.4r1d. Had a brain fart moment!
(Edited)
Photo of Mitchell Erblich

Mitchell Erblich

  • 1 Post
  • 0 Reply Likes
HiveOS tends to run a full set of CLI commands. This full set may be excessive in some environments. The knowledgeable admin will reduce this set to a min number of CLI commands needed for your environment.


ALSO,,, ONLY for the EXPERT admins.. This may seem strange, but if the commands are rotated (enabled , not enabled) on a two/three week basis depending on the usage of the AP or BR, I have seen APs have a much longer up time, as much as 2x to 3x of what you are currently experiencing. 

Rotating commands allows any delay frees, aging of entries, cruft,  etc structures / memory to be freed from the individual processes/ tasks and then to be re-allocated.

Note: More than 1 process may be needed to do this at one time, and you should verify that the process is back up after this procedure and the system is back to normality.