hive manager timeouts. service becoming unusable, what is my next step please ?

  • 1
  • Question
  • Updated 2 years ago
We are having so many issues with HiveManager 6.6r3 (although it has been happening for over 12-24 months now and with all previous versions) that the system is seriously flawed to the point of being unreliable.
the system if left alone works perfectly, day to day
78 APs mix of 121 and 330 running 6.4 but problem has occurred with many previous versions of OS so don't think has relevance

The problem is whenever we want to make a change,
add a temp SSID, add a ppsk user that the whole system just struggles so any changes are now met with complete dread.
To push a new config for a new ppsk user to just half of the APs has turned into a 3-4 hr very manual ordeal.
Items are added to a queue that is many times longer than the devices we have.

Multiple AP timeouts and fails which seem to have no association with current user loads,
and no association with time since last reboot or last complete config push to AP
Device update results page gets so confused that eventually page just times out and shows empty even though I havn't deleted the entries.
When I check the APs to see which have taken the update and are green, trying to redo the red entries results in AP is currently under control by another admin which  can take 30mins to timeout and regain control so I can retry.
same issues occur if I'm just trying to reboot the devices.

The issue seems to be overloading at the Cloud end and not related to the state of the APs here as I have tested them with no users on system, with fresh reboots ,with fresh configs and still get the same results
I do understand that 121 struggle being so old and having memory issues but 330s don't seem to be any better.

Basically I spend the next 3-4 hours waiting, retrying, some having reties up to 10 times before they will go through till I eventually get a stable current system.
There are some APs that tend to be repeat offenders but this is certainly not a constant and some days things are better,never good or usable but certainly better than others
A good day I might get 35 updates successful on first push, a bad day is 12

Not sure if this is just because we are at the bottom end of the world in NZ and are getting bad queues or priorities or its because our APs are old but I really need to sort this out after growing more and more frustrated over last 12 months ever hoping that something would magically improve.
Would an upgrade to NG help, should I be looking at a HiveVM onsite.
Does anyone else especially in my part of the world have similar issues or is it just us.
Photo of Kevin Whelan

Kevin Whelan

  • 53 Posts
  • 2 Reply Likes

Posted 2 years ago

  • 1
Photo of Crowdie

Crowdie, Champ

  • 972 Posts
  • 272 Reply Likes
We had huge issues with the firmware from 6.4r1 to 6.5r3.  It was almost unusable at some sites. The 6.5r4 firmware resolved our issues so I would strongly recommend moving to this firmware.

Have you looked at the CPU and memory utilisation on your AP121s?  I found that in the firmware versions pre 6.5r4 the CPU utilisation was commonly 80% plus.

We have a number of HMOL sites here in NZ and using Australian servers we haven't had an issue.

Where abouts in NZ are you?
(Edited)
Photo of Kevin Whelan

Kevin Whelan

  • 53 Posts
  • 2 Reply Likes
We're in Hawkes Bay

At your suggestion  Ive just tried to update 30 x 330 APs from 6.53 to 6.54 that are in a separate network policy all with no clients attached at present.
checked a handfull of APS first all sitting on 52%cpu and 40% memory with 2 days uptime running 6.5r3
whole network has maybe 30 users tops at the time and very light load on 500mb internet pipe
APs all sit in seperate vlan, fibre/gigabit backbone

14 were successful started 2.17pm and finished at 2.48pm
there was 10 min period at start were basically nothing happened at all then some started to receive the upload
3 still say preparing and the rest have timed out at first stage Upload captive web portal files after about 20 mins but like I say I get similar results just trying to restart the devices.

retries for failed 16
2nd attempt much better response times 15 successfull and all under 10 mins start to finish
1 stuck on preparing so will wait for that to timeout and retry.
So nearly 1 hour gone now and have managed to coax one 3rd of our APs to update, one still currently stuck in limbo

So when trying to add a new user urgently during the working day you can begin to see my problem.
Photo of Crowdie

Crowdie, Champ

  • 972 Posts
  • 272 Reply Likes
Ah the sunny Hawkes Bay.  No wonder your access points don't want to do any work.

Which server is housing your HMOL instance?

What is the delay you are getting from your access points to the HMOL server?  From my experience it should be less than 17 ms.

If you change your configuration update type to "Delta Upload (Compare with the running Aerohive device config)" does this improve your configuration upgrade performance?
Photo of Kevin Whelan

Kevin Whelan

  • 53 Posts
  • 2 Reply Likes
hm-aus-056 does that sound right?
AP_PC-Lab#capwap ping redirector.aerohive.com

CAPWAP ping parameters:
Destination server: redirector.aerohive.com (54.172.0.252)
Destination port: 12222
Count: 5
Size: 56(82) bytes
Timeout: 5 seconds
--------------------------------------------------
CAPWAP ping result:
82 bytes from 54.172.0.252 udp port 12222: seq=1 time=222.254 ms
2016-09-05 16:37:53 alert kernel: [screen]: wifi0.2 IP spoof attack from a89f:bac8:544a is detected.
82 bytes from 54.172.0.252 udp port 12222: seq=2 time=209.128 ms
82 bytes from 54.172.0.252 udp port 12222: seq=3 time=214.985 ms
82 bytes from 54.172.0.252 udp port 12222: seq=4 time=208.155 ms
82 bytes from 54.172.0.252 udp port 12222: seq=5 time=207.758 ms
------- redirector.aerohive.com CAPWAP ping statistics -------
5 packets transmitted, 5 received, 0.00% packet loss, time 6368.366ms
rtt min/avg/max = 207.758/212.456/222.254 ms



AP 330 with no clients latest 6.5.4 no network load whatever 23mins uptime
ping results to 52.65.20.196
either 4003ms or complete timeouts
all APs seem the same
Photo of Crowdie

Crowdie, Champ

  • 972 Posts
  • 272 Reply Likes
Interesting that you can't PING the HMOL server.  Do you have ICMP blocked on a firewall?
Photo of Crowdie

Crowdie, Champ

  • 972 Posts
  • 272 Reply Likes
The following is from an AP330 access point in Mount Maunganui running the 6.5r4 firmware.

MountAP#capwap ping redirector.aerohive.com
CAPWAP ping parameters:
Destination server: redirector.aerohive.com (54.172.0.252)
Destination port: 12222
Count: 5
Size: 56(82) bytes
Timeout: 5 seconds
--------------------------------------------------
CAPWAP ping result:
82 bytes from 54.172.0.252 udp port 12222: seq=1 time=197.868 ms
82 bytes from 54.172.0.252 udp port 12222: seq=2 time=197.386 ms
82 bytes from 54.172.0.252 udp port 12222: seq=3 time=197.278 ms
82 bytes from 54.172.0.252 udp port 12222: seq=4 time=197.852 ms
82 bytes from 54.172.0.252 udp port 12222: seq=5 time=197.169 ms
------- redirector.aerohive.com CAPWAP ping statistics -------
5 packets transmitted, 5 received, 0.00% packet loss, time 5995.36ms
rtt min/avg/max = 197.169/197.510/197.868 ms

MountAP#capwap ping hm-aus-056.aerohive.com
CAPWAP ping parameters:
Destination server: hm-aus-056.aerohive.com (52.65.20.196)
Destination port: 12222
Count: 5
Size: 56(82) bytes
Timeout: 5 seconds
--------------------------------------------------
CAPWAP ping result:
82 bytes from 52.65.20.196 udp port 12222: seq=1 time=29.700 ms
82 bytes from 52.65.20.196 udp port 12222: seq=2 time=28.889 ms
82 bytes from 52.65.20.196 udp port 12222: seq=3 time=28.807 ms
82 bytes from 52.65.20.196 udp port 12222: seq=4 time=28.921 ms
82 bytes from 52.65.20.196 udp port 12222: seq=5 time=28.817 ms
------- hm-aus-056.aerohive.com CAPWAP ping statistics -------
5 packets transmitted, 5 received, 0.00% packet loss, time 5296.829ms
rtt min/avg/max = 28.807/29.26/29.700 ms
If you execute a show capwap client command what do you get? I am wondering if your access points have fallen back to the HTTP transport.
(Edited)
Photo of Kevin Whelan

Kevin Whelan

  • 53 Posts
  • 2 Reply Likes
Thanks for the support and showing some interest,much appreciated


CAPWAP client: Enabled
CAPWAP transport mode: UDP
RUN state: Connected securely to the CAPWAP server
CAPWAP client IP: 10.0.239.168
CAPWAP server IP: 52.65.20.196
HiveManager Primary Name:hm-aus-056.aerohive.com
HiveManager Backup Name:
CAPWAP Default Server Name: redirector.aerohive.com
Virtual HiveManager Name: lindisfarne.school.nz
Server destination Port: 12222
CAPWAP send event: Enabled
CAPWAP DTLS state: Enabled
CAPWAP DTLS negotiation: Disabled
DTLS next connect status: Enable
DTLS always accept bootstrap passphrase: Enabled
DTLS session status: Connected
DTLS key type: passphrase
DTLS session cut interval: 5 seconds
DTLS handshake wait interval: 60 seconds
DTLS Max retry count: 3
DTLS authorize failed: 0
DTLS reconnect count: 0
Discovery interval: 5 seconds
Heartbeat interval: 30 seconds
Max discovery interval: 10 seconds
Neighbor dead interval:105 seconds
Silent interval: 15 seconds
Wait join interval: 60 seconds
Discovery count: 0
Max discovery count: 3
Retransmit count: 0
Max retransmit count: 2
Primary server tries: 0
Backup server tries: 0
Keepalives lost/sent: 3/956
Event packet drop due to buffer shortage: 0
Event packet drop due to loss connection: 11
Photo of Crowdie

Crowdie, Champ

  • 972 Posts
  • 272 Reply Likes
That looks fine.  From that you certainly shouldn't be experiencing what you are.

Could you take one of the access points (a spare perhaps) to another site to test it?  Even taking it home and connecting it to your residential router (if you have PoE) could be an option.

What this would test is whether the cause of the issue is the school environment or any proxy/filtering/firewall services applying to your WAN link.
Photo of Gary Smith

Gary Smith, Official Rep

  • 299 Posts
  • 61 Reply Likes
Hi Kevin,

I took the liberty of looking at your VHM instance. My observations suggest that the issue here is more than likely traffic related. The CPU numbers that you report are high. Although you say 52% - the AP has two CPU cores and one of which is currently maxed out. This is going to cause problems. I see the CPU levels are high even with no clients connected. I do not believe that this is client traffic related.





My guess at this point without actually debugging further is that this is network traffic related. I would suggest that you look at how the switchports are configured, what types of traffic are hitting the AP etc.

As a best practise, I reccommend setting the "allowed VLAN" to auto. This seems to help the CPU in your case;
AP_Plantation#int eth0 allowed-vlan auto

AP_Plantation#
AP_Plantation#show cpu
CPU total utilization: 6.930%
CPU user utilization: 1.485%
CPU system utilization: 1.485%
AP_Plantation#
AP_Plantation#
AP_Plantation#int eth0 allowed-vlan all
AP_Plantation#show cpu
CPU total utilization: 51.652%
CPU user utilization: 1.652%
CPU system utilization: 2.066%
AP_Plantation#
AP_Plantation#


Allowed VLAN: Set a list of VLAN IDs that the AP can use to filter traffic allowed to cross the Ethernet interface. By default, an AP allows traffic tagged with any VLAN ID to traverse its Ethernet interface. To allow traffic tagged with any VLAN ID, enter all. You can alter this default behavior by entering the keyword "auto" or one or more specific VLAN IDs, as described below:

Enter auto to allow traffic whose VLAN ID matches that of the management interface, virtual management interface, native VLAN, or the default VLAN configured in user profiles.

Enter one or more individual VLAN IDs between 1 and 4094. You can enter a single VLAN ID or multiple VLAN IDs by separating them with commas; for example, 1,10,20,30. You can also enter a range of VLAN IDs by separating the start and end by a dash; for example, 1-20.


I also wrote about high CPU in another thread - you may find it useful.;
https://community.aerohive.com/aerohive/topics/high-cpu-utilization-after-upgrading-to-hiveos-6-5r3a...

It might be that the change of "allowed VLAN" is enough to resolve your issue. Please let us know.
It would be worth engaging with your support provider so that they can loop in Aerohive support for additional help if required.

Kind Regards,
Gary Smith
(Edited)
Photo of Kevin Whelan

Kevin Whelan

  • 53 Posts
  • 2 Reply Likes
thanks Gary
I read through you thread, we were on 6.41 at the time and have had same issues with earlier builds, that thread was specifically about 6.53 bug that stopped once reverted,we also have no ipv6 on vlans  and only at 47% sirq so  suspect that wasn't the issue.
But I have made the vlan changes you suggested to All APs and noticed the drastic cpu drop like you suggested but it appears it wasn't enough to resolve our issue

It took all day to apply the changes and after packet capturing ethernet on many APs we were seeing 12 pkts in 10 mins on the APs with no clients yet still get the same problems
Reboot commands time out multiple times, packet capture shows 1-2 packets in either direction then console just stalls until timeout.
an upload can sit for 10 mins then we see some data pkts flow
tried multiple reboots and updates and still get maybe 25%
 APs with no clients still can take multiple timeouts for the SSH HiveOS tool to connect
ping times using the hiveos commands to aerohive from the APs show 4000ms but I suspect that is some java bug because it has to go through Australia and back, Ive yet to test SSH from a local client
ping times from anything else on the vlan are 30-40 so I suspect local SSH will be fine too

Finally after having full updated every AP today,rebooting every AP in last few hours checking many for CPU usage(basically nill) and testing a HiveOS bee update it stalls for 20 mins and tells me 9 devices are offline, which they obviously are not.
Just at a loss to see why things are so unresponsive from HiveOS and makes operations frustrating and timeconsuming and it doesn't seem to matter what os version or whether 121 or 330 so ruling out cpu and memory I think.
Long term what are my options,upgrade of APs would fix it?
local VM ?
Photo of Gary Smith

Gary Smith, Official Rep

  • 299 Posts
  • 61 Reply Likes
Hi Kevin,

I'm taking another look now.

I see that ICMP PING fails altogether;
AP_Staff-Workroom#ping hm-aus-056.aerohive.com
PING hm-aus-056.aerohive.com (52.65.20.196) 56(84) bytes of data.
--- hm-aus-056.aerohive.com ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4000ms
AP_Staff-Workroom#

AP_Staff-Workroom#ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
--- 8.8.8.8 ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4001ms

I notice that a UDP PING will work;
AP_Staff-Workroom#capwap ping hm-aus-056.aerohive.com
CAPWAP ping parameters:
Destination server: hm-aus-056.aerohive.com (52.65.20.196)
Destination port: 12222
Count: 5
Size: 56(82) bytes
Timeout: 5 seconds
--------------------------------------------------
CAPWAP ping result:
82 bytes from 52.65.20.196 udp port 12222: seq=1 time=40.45 ms
82 bytes from 52.65.20.196 udp port 12222: seq=2 time=41.134 ms
82 bytes from 52.65.20.196 udp port 12222: seq=3 time=57.392 ms
82 bytes from 52.65.20.196 udp port 12222: seq=4 time=48.807 ms
82 bytes from 52.65.20.196 udp port 12222: seq=5 time=48.172 ms
------- hm-aus-056.aerohive.com CAPWAP ping statistics -------
5 packets transmitted, 5 received, 0.00% packet loss, time 5247.693ms
rtt min/avg/max = 40.45/47.110/57.392 ms
AP_Staff-Workroom#

Crowdie asked; "Interesting that you can't PING the HMOL server.  Do you have ICMP blocked on a firewall?"
Are you blocking ICMP PING or anything else? DNS is working fine.

Tracert seems to get as far as the default gateway and then stops. This, to me, suggests that something is being blocked/controlled by your WAN link/Firewall.


I did an update on the same AP (connected to HM via UDP) and it seems to take around two minutes for a delta update;


I then changed the CAPWAP transport to HTTP;


At the time the update was successful, the AP had changed transport mode back to UDP.
Crowdie also made mention; "If you execute a show capwap client command what do you get? I am wondering if your access points have fallen back to the HTTP transport."


My suggestion to you at this point is to look again at the network. It would be an idea to connect another deice to what would be an AP switchport and see if you still have the same PING issues. Also, see if you can pass traffic to the internet etc.

I still do not believe that this is a software/hardware issue with the Aerohive but I am open to investigate the possibilities.

Kind Regards,
Gary Smith


(Edited)
Photo of Kevin Whelan

Kevin Whelan

  • 53 Posts
  • 2 Reply Likes
Thanks Gary
I don't think its a hardware issue either, more a cloud response issue
Our Firewall has had a sudden issue with ICMP which coincided with your testing unfortunately and it could still be the issue but we only switched to a new fibre provider and new firewall at beginning of year and problems precede that.
Its possible they were more of a cpu memory issue then though which has been fixed in recent builds.
Im told Tracert is blocked by our fibre provider

anyway firewall icmp fixed to day,tried to upgrade from Bee again
still various devices down according to test
spent all day forcing them through anyway so now on new capwap

here is example of being ssh connect at same time as upload has stalled.
upload started at 2.42 fails at 2.57
this was attempt 7 at upload




Photo of Gary Smith

Gary Smith, Official Rep

  • 299 Posts
  • 61 Reply Likes
Hi Kevin,

Without doing more live debugging, it's going to be difficult to progress this issue via the community. My suspicions are still that this is a network issue and not an Aerohive specific issue. To prove/disprove I would ask that you open a case with your distributor and then we can work together via a support ticket.

Kind Regards,
Gary Smith
Photo of Kevin Whelan

Kevin Whelan

  • 53 Posts
  • 2 Reply Likes
Thanks for all your help