AP 121's drop connections randomly.

  • 5
  • Question
  • Updated 2 years ago
  • Answered
We are having connectivity issues with our AP121's.  They randomly drop connections or laptops can't connect even though they attach to an AP and pick up an IP address.  For those that drop, they have an IP and appear to be connected but can't browse in the network or to the Internet.  The AP's would often spike at 100% CPU usage.  We are running both the 2.4 and 5 Ghz band and 802.11 g/n and a/n because of our device mix.  The typical AP will have up to 30 devices connected. Last year was fine.  Over the summer we upgraded to the latest firmware and had start the year with it and the connectivity problems. Working with tech support we tried working with channels, bands, classifier maps- nothing worked.  This past weekend we back rev'ed to 5.1r5 and this week the wifi experience was much better however we still random laptops or chromebooks not able to connect or drop the connection.  We have been working with tech support but they appear stumped at this point.  Our users are frustrated and turning away from wifi. Anyone have any ideas for where we can look?
Photo of Mike

Mike

  • 7 Posts
  • 0 Reply Likes

Posted 4 years ago

  • 5
Photo of Bill W.

Bill W.

  • 222 Posts
  • 35 Reply Likes
Are you using Bonjour Gateway?  And do you have multiple sites?  If you do, check your BG settings.  We had a similar issue in which users were complaining that they were dropping off randomly, and the CPU utilization on the AP121s would sit at 100% a lot of the time.  This all occurred out of the blue for us.

With help from an Aerohive engineer, we saw that the issue looked to be related to Bonjour.  After checking our BG settings, I saw that the Max Wireless Hop setting had been changed from 1 (which is what I had it set at) to 0 (which is unlimited and the default setting).  After changing the setting back to 1, the issue was resolved.

I believe this occurred as a result of the upgrade of HM from 6.1r6 to 6.2r1.  The complaints started to come in around the time after the upgrade.  The upgrade from 6.2r1 to 6.2r1a did not change the setting.
Photo of Mike

Mike

  • 7 Posts
  • 0 Reply Likes
We aren't using the Bonjour Gateway.  We have back rev'ed the AP firmware to 5.1R5 and things are much better.  So far we have had two days of good wifi experiences in the classrooms.  Thanks for the help!
(Edited)
Photo of Travis Kaufman

Travis Kaufman, Champ

  • 113 Posts
  • 30 Reply Likes
I would take an AP, make it a remote sniffer - install wireshark on a windows PC and run a PCAP on the remote interface.   From there try to filter mDNS traffic ( 224.0.0.251 )  See how many packets are hitting that AP, that can cause massive CPU useage.  

Also - how many SSID's are your broadcasting?  

In the end, remove network overhead, and Air-Time utilization. 
Photo of Kevin Rodgers

Kevin Rodgers

  • 11 Posts
  • 0 Reply Likes
I had a similar issue going from 6.1r6a to 6.2r1 on APs at our high school. CPUs were hitting 100% and there were major connectivity problems as a result. I downgraded back to 6.1r6a and that fixed the issue, but I'm still working with support to find out what else is going on. I don't have any CPU issues at our other schools and they're running 6.2r1.
Photo of Mike

Mike

  • 7 Posts
  • 0 Reply Likes
That's the funny thing here.  We have our other schools at 6.2R1 without any problems.  We dropped back to 6.1r6a at the school with problems with no improvement so went back to 5.1r5.  Are you using 121's? (The other schools have 121's too.)
Photo of Kevin Rodgers

Kevin Rodgers

  • 11 Posts
  • 0 Reply Likes
Yes, we have mostly 121s and a few 330s.
Photo of Manoah Coenraad

Manoah Coenraad, Champ

  • 72 Posts
  • 67 Reply Likes
When you look at the SSID's via a program like InSSIDer, do you also see some fluctuate SSID's?
How many SSID's are you broadcasting?
Photo of Kevin Rodgers

Kevin Rodgers

  • 11 Posts
  • 0 Reply Likes
I'm broadcasting 3 SSIDs. When I had the CPU problem, the APs also broadcasted their virtual access console SSIDs.
Photo of Manoah Coenraad

Manoah Coenraad, Champ

  • 72 Posts
  • 67 Reply Likes
When looking at high CPU issues on an AP, the following might be a useful checklist;

a. "show sys proc" and "show sys proc state"

- Do this everytime; before and after trying different commands and options. It gives more details into the CPU usage compared to "show cpu" and "show cpu detail"

 

b. If customer have DAS enabled, do "no qos airtime enable"

- May or may not be the reason for seeing high CPU. Useful to try and reduce one possible reason

 

c. If customer have AVC running, do "application reporting disable"

- May or may not be the reason for seeing high CPU. Useful to try and reduce one possible reason

 

d. Understand number of packets/sec coming into eth0 by "show int eth0"

- Find the delta within a few seconds for ingress packets into eth0. Purpose is to see if high traffic may be responsible for high CPU. Always also note egress packets, just in case

 

e. Unicast/multicast/broadcast number of packets/sec on w0 and w1 using "show int w0 _count" and "show int w1 _count"

- Multicast/broadcast packets are multiplied at the AP. Enough of these may contribute to high CPU due to the number of processing involved. Understanding the packets/sec of unicast/multicast/broadcast will help find or eliminate possible cause

 

f. CRC error rate and retries from w0 and w1 with "show int w0 _count" and "show int w1 _count"

- Too much bad CRC and retries may increase the CPU load when AP processes them

 

g. Number of BSSIDs

- This is a multiplication factor to the multicast/broadcast packet/sec

 

h. Number of clients, both local and neighbor by "show amrp client", "show auth"

- When AP exhibit high CPU. Is it related to high number of clients? If so, how are the clients authenticated?

 

i. Client traffic pattern

- What type of traffic clients are running (eg. lots of video streaming, teleconference, voice) may help give clues on why CPU may be high

 

j. AP density and roaming by "show amrp client", "show roaming-cache", "show acsp neighbor", "show acsp", "show amrp"

- If APs are relatively close together (good enough RSSI seen by clients), are clients roaming a lot? If clients are constantly roaming about, it may add to CPU load

 

k. "show log buffer", "show log flash"

- Any events prior to/during high CPU log events? Time stamp when these happen. Do those correspond to a particular time? (eg. lunch time, 3rd class period, Friday night gaming ... etc)

 

l. "show qos counter user", "show qos counter user-profile"

- Another good way to get packet and byte information from AP, plus packets dropped by QoS. Getting these information for packets/sec and bytes/sec may help.

 

m. Traffic counter from edge switch

- Often times the counters from the edge switch can give us a good idea of traffic information by knowing ingress/egress packets at switch port. Get counters/stats or port mirroring

Photo of Mike Kouri

Mike Kouri, Official Rep

  • 1030 Posts
  • 271 Reply Likes
Kevin,
Seeing the Virtual Access Console being advertised indicates that the AP has lost it's backhaul connection somehow. That may be another area to explore. What is the upstream device?
Photo of Kevin Rodgers

Kevin Rodgers

  • 11 Posts
  • 0 Reply Likes
The VACs are no longer advertised since I downgraded to 6.1r6a and the CPUs are not running at 100% anymore either. The upstream devices are HP ProCurve switches.
Photo of Travis Kaufman

Travis Kaufman, Champ

  • 113 Posts
  • 30 Reply Likes
Good to hear. 
Photo of Manoah Coenraad

Manoah Coenraad, Champ

  • 72 Posts
  • 67 Reply Likes
strange behaviour, we have a lot of customers with AP121 running on 6.2r1 without any problem.
Photo of Jeff Haydel

Jeff Haydel

  • 6 Posts
  • 4 Reply Likes
This thread came to the attention of the Aerohive PSE group and we have been reviewing both Mike's case as well the thread to see how we could help.  I believe that, as I type this, Mike is currently in talks with his local SE.  Our hopes are to get some extra attention to his problem and see if a few of the recommendations made in this thread by Manoah (really a VERY good troubleshooting process post) as well as a few ideas that the PSE's came up with could isolate and eventually resolve Mike's issue with 6.2r1 on his AP121's. 

To talk to another point in Mike's post, I had discussions with the various support engineers that worked with Mike on this case as to what was and was not tried and the logic behind these decisions.  While keeping our internal processes to ourselves please be assured that we will use this case, like all cases, to improve our support to you, our users. 

I am confident that we will get Mike's Aerohive installation humming along in short order.  I hope that he decides to post the results of how his problem is solved. 

Thank you,

Jeff Haydel
PSE, Aerohive
Photo of Mike

Mike

  • 7 Posts
  • 0 Reply Likes
Just to update- we returned all  of the AP's in the school to 5.1r5 and had immediate improvement.  We put 330's in a couple areas that had a particularly large number of users with good success.  Roaming is still an issue.  We are collecting more information about this.
Photo of Peter Powell

Peter Powell

  • 6 Posts
  • 3 Reply Likes
Did you ever get a definitive resolution to this which allowed upgrading ot current release code levels? We have been seeing similar issues with AP120/121 and high CPU and high soft interrupt usage.
Photo of Mike

Mike

  • 7 Posts
  • 0 Reply Likes
No, nothing definitive.  We back rev'ed the firmware to 5.1R5 and everything has been working without any problems.
Photo of Dennis Stellern

Dennis Stellern

  • 14 Posts
  • 3 Reply Likes
Mike are you still running 5.1R5?  Or was a resolution found?
Photo of Mike

Mike

  • 7 Posts
  • 0 Reply Likes
No resolution other than keeping the AP121's at 5.1R5.
Photo of Dennis Stellern

Dennis Stellern

  • 14 Posts
  • 3 Reply Likes
With 30 devices on what is your CPU at on the 121?  Just moved one of ours back to 5.1R5 and its better but not great 30 to 60 % utilization
Photo of Mike

Mike

  • 7 Posts
  • 0 Reply Likes
Our utilization will drop to around 20% or so.  We can easily support 50+ devices on an AP121 at 5.1R5 without any issues.  If you have neighboring AP's you may have to downgrade those also.  We wound up downgrading 40 121's to get things working well.
Photo of Julian Daniel

Julian Daniel

  • 2 Posts
  • 0 Reply Likes
Mike: what about the switches to which the APs are connected? We had similar symptoms with our 1700 APs from day one with HP A-series (H3C/ComWare) switches, until we upgraded the firmware to 2221p02.

The client would show as connected, and sniffing the network showed DHCP packets were going out from the server...but the client didn't see them. The firmware that we used when we deployed our switches had a bug that would drop DHCP packets, and this drove us crazy until we upgraded all of them.

Now our clients connect flawlessly, and roam with ease.
Photo of Paul Messias

Paul Messias

  • 1 Post
  • 0 Reply Likes
We have been having the same problem at our High School between classes when all the students and their devices are in motion. During that time, all the APs we can see lose connection to the hive and show their virtual access console.

When it is happening, there are no errors or dropped packets on the switches on the ports the APs are connected to.

We downgraded from 6.4r1a.2103 to 6.1r6a.1794 on the APs and the problem has gone away - hasn't been seen for weeks with the old software version.

Time for a ticket of my own.
Photo of Julian Daniel

Julian Daniel

  • 1 Post
  • 0 Reply Likes
It should be noted that in my case the switches show no errors - it was a packet capture that showed the DHCP packets going out from the server and client, but neither side seemed to be aware of the response. The firmware update on the HP switches addressed this issue, and our Aerohive connectivity issues pretty much disappeared.
Photo of Kevin Whelan

Kevin Whelan

  • 53 Posts
  • 2 Reply Likes
can some one link me the 5.1r5 img we have the same problem 100% cpu with very high soft interrupt usage, tried disabling all wips and bonjour completely.
Photo of Bill W.

Bill W.

  • 222 Posts
  • 35 Reply Likes
Login to the Support site and go to Software Downloads. Click on Previous Software Downloads. Then click on HiveOS and HiveManager Version 5.1. Then click on 5.1r5. Then click on HiveOS Firmware. Then finally click on AP121-141-5.1r5 to download it.
Photo of Kevin Whelan

Kevin Whelan

  • 53 Posts
  • 2 Reply Likes
thanks installed 5.1r5 on two for testing and have noticed an immediate drop from 60% cpu to 39 and a smaller improvement to ram so am considering now deploying to all 121s
Q is this safe ,wasn't there some SSL fix after 5.1

as an side
my original problem was 121 s all on 100%CPU and after going through the suggestions above disabling dynamic airtime in network policy dropped all the 121s from 100% to the the current levels of 60 or 39 dependant on firmware so I suspect the DAS feature is just too much for the 121,Should I just leave it off? or does anyone use it sucessfully, ?are my settings wrong somewhere else? I t looks like a good idea on paper and to be fair the 121s were still all working without user complaints at 100%cpu for months with anything up to 25 users so is the 100% a red herring?

tested results averages over 24hrs couple of APs with 0 users
WIPS removed
Bonjour removed

with DAS on all 121 at 100% regardless of firmware version
DAS off   6.41g  60%cpu ram 65%
DAS off  6.1r6   60%cpu ram 60% no noticeable difference
DAS off  5.1r5   39%cpu ram 50%
(Edited)
Photo of Mike Kouri

Mike Kouri, Official Rep

  • 1030 Posts
  • 271 Reply Likes
Kevin,
HiveOS in general is designed as an embedded system OS, and should be able to operate successfully for long periods of time at near the CPU and memory limits. However, like all human endeavors, it's not perfect, and there ARE issues with freeing up memory under certain circumstances when it is no longer needed. Adding to that, the AP121 is one of our older platforms, designed when 802.11n was relatively new, and as such it has the smallest memory footprint and the least powerful CPU in the current AP product portfolio. 

I am very glad that you were able to resolve your current problems by reverting to an older version of code, but I hope that you will give HiveOS 6.5r3 a chance when we release it next week. We have spent the last several months trying to catch and clean up those memory issues, and have also found a fixed a few bugs that contribute to excessive CPU load across an entire network. 
Photo of Kevin Whelan

Kevin Whelan

  • 53 Posts
  • 2 Reply Likes
Sorry I seem to have misled you
I wasn't able to resolve by reverting but by turning off DAS.
I did see some differences in previous versions so thought I would post them for others as a guide to what I experienced in testing because there is a lot of post discussion of previous versions by some pretty knowledgeable people on these forums.
I don't think that cleaning up memory issues will help when as soon as DAS is turned on with no clients it still shoots instantly to 99% and stays there regardless of any clients which doesn't seem related to freeing up memory issues
I seem to have problems with DAS being to much for the small footprint.

so are you saying I should turn DAS back on? and ignore the cpu it is designed to operate at that high level. Can the 121 handle DAS and then should I be seeing those cpu levels or Do I have other issues I need to look at  is my real question? Is it just me and our setup.
there is a warning of sorts that DAS is only added to admin profiles not default?

Sorry, I know nothing about these systems so hence the question I am hoping for some helpful feedback from  others. These units were installed education wide across the whole country not that long ago so it seems we have inherited almost redundant product for very high density usage which is not very inspiring.
(Edited)
Photo of Nick Lowe

Nick Lowe, Official Rep

  • 2491 Posts
  • 451 Reply Likes
Hi Kevin,

DAS is of increasingly limited benefit these days where you have a well designed cell plan and good, appropriate coverage that facilitates the use of high data rates with modern clients.

(DAS is of greatest benefit where clients are straggling on with the lowest data rates, consuming a disproportionate amount of air time in the process, and where legacy clients are in use in an environment - particularly 802.11b clients.)

On balance, you may well find it better left disabled in your environment.

Have you disabled all the 802.11b data rates? If not and if you can, you should do this.

Cheers,

Nick
(Edited)
Photo of Kevin Whelan

Kevin Whelan

  • 53 Posts
  • 2 Reply Likes
Thanks  that eases my worries a bit,
yes 802.11b is disabled.
I am trying to change my setup to a better high density model and  was only following the Aerohive high density k12 education guide pdf  and it was stipulated in there to be turned on
Photo of Dennis Stellern

Dennis Stellern

  • 14 Posts
  • 3 Reply Likes
Be careful following those guides. They are a good to look at for ideas but using there settings could make the network worse. Not a one size fits all thing. I would make changes one at a time and see if it gets better or worse.
Photo of Kevin Whelan

Kevin Whelan

  • 53 Posts
  • 2 Reply Likes
Yes thanks noted. I aplied all the settings and it broke network completely so now got to figure out what actually broke it, some clients were unable to connect at all unless ap was rebooted and getting disconnects every 5 mins, max was 2_3 clients per AP. Disastrous 3 days trying to get it to work until I gave up and went back to old profiles which work perfectly
Photo of Johan H

Johan H

  • 2 Posts
  • 0 Reply Likes

We have also notised this problem on AP121. And may have found a couse.

We use multiple vlan (802.1Q) on the AP-backhaul.

We noteced that the switch had a vlan (802.1Q) on the port connected to the AP that was not used in any of profiles on the AP. This vlan hosted a large L2-Broadcast domain whith a fair bit of broadcast and multicast. and even if the AP doesent have the vlan in its configuration it still couses the the cpu in the AP to peak upto 100% from time to time.

Ive believe this can be coused by software handling of 802.1Q tagged frames in CPU and that 802.1Q is not handled by the AP:s nic in "hardware".

Others can try this by disbling vlan by vlan on the switchport until you se the the cpu go down.

We are running HiveOS 6.5r5

(Edited)
Photo of Gary Smith

Gary Smith, Official Rep

  • 299 Posts
  • 61 Reply Likes
Hi Johan,

It's good info. I made a point about switch port settings and AP port setting in this thread;
https://community.aerohive.com/aerohive/topics/high-cpu-utilization-after-upgrading-to-hiveos-6-5r3a...

I hope it's useful.

Kind Regards,
Gary Smith