SLA compliance over time both device and client

  • 1
  • Question
  • Updated 2 years ago
  • (Edited)
Hi guys,
I implemented a simple wireless at school. We have only one SSID with PSK. There is no problem with the wireless. Except for SLA compliance both Device and Client device.
Most of the time the Compliant Devices % is 99-100%. But around after lunch time, when 500 devices connect and using wireless, the Compliant Devices is just around 30-50% even 10%, our network slowing down and lag.
Is there anyone facing this issue before? And how did you handle it?

I searched through our community and document, and found out we can make change of SLA in User profile. And here is what I did:


But still, we have no diffirence.

School have 25APs, HiveAP230, and HMOL firmware 6.2.1a.

Any idea or suggest is welcome. Thanks in advance.
Photo of Hoang Tung

Hoang Tung

  • 31 Posts
  • 0 Reply Likes

Posted 3 years ago

  • 1
Photo of Hoang Tung

Hoang Tung

  • 31 Posts
  • 0 Reply Likes
Hi, 
After I change SLA in User Profile from 1000 to 5000 (with Logs and Bost), I start getting a warning devices. Here is the screenshots:


Is there something not right?

Thanks
Photo of Eastman Rivai

Eastman Rivai, Official Rep

  • 146 Posts
  • 17 Reply Likes
Hoang Tung,

The SLA information tells you that your network cannot cope up with the load. You may need to investigate the number of access points, co-channel interference, data rates settings, power, etc. 

I do not not have the visibility of your network, but I would change my data rate settings to
6 9 12-basic 18 24 36 54 for both radios and monitor. A passive site survey will help determine the best settings.

It may best as well to enable band steering so client will prefer to use 5GHz radio.



Eastman
(Edited)
Photo of Hoang Tung

Hoang Tung

  • 31 Posts
  • 0 Reply Likes
Hi Eastman,
I do enable band steering on 2.4GHz radio, and changed my data rate as well, but we still have no better result.
I did the site survey with AirMagnet Pro, the coverage was pretty good.

In mean time, one of school's teacher noticed that the AFP traffic consumed lots of bandwidth, does this affects the wireless?
Photo of Hoang Tung

Hoang Tung

  • 31 Posts
  • 0 Reply Likes
Hi Eastman,
Thanks for your reply.
I will check on that and keep your updated.
Photo of Crowdie

Crowdie, Champ

  • 972 Posts
  • 272 Reply Likes
In HiveManager click on the Dashboard link followed by the Troubleshooting tab,  The 'Top 10 APs by Channel Utilization", "Top 10 Aerohive Devices by Errors", "Top 10 APs by Retries" and "Top 10 APs by Airtime Utilization" graphs will give a pretty good description of the wireless network's performance.

If you have retries or errors above 10% or your channel/airtime utilization is above 30% you have issues.
Photo of Patvinder Singh

Patvinder Singh

  • 1 Post
  • 0 Reply Likes
Hi crowdie , what are issues i may face if this is what it shows for my errors .Please advice me on this matter.
Photo of Crowdie

Crowdie, Champ

  • 972 Posts
  • 272 Reply Likes
OK to "borrow" from the Aerohive help:

CRC Error Violations: CRC (cyclic redundancy check) error violations are listed here when the number of CRC errors detected in received wireless frames (divided by the total number of frames received in reporting period) exceeds the CRC error rate threshold for an device, which is 30% by default.

Tx Drop Violations: An Aerohive device considers a transmitted wireless frame as dropped when it retries transmitting the same unicast frame the maximum number of times without receiving an acknowledgment from the intended recipient. The device tracks all the dropped Tx unicast frames and all the Tx unicast frames during the report period. (Only unicast frames can be tracked this way because multicast and broadcast frames do not require an acknowledgment.) It then calculates the Tx drop rate by dividing the total number of dropped Tx unicast frames by the total number of Tx unicast frames. Tx drop violations appear here when the rate of dropped Tx unicast frames exceeds the threshold, which is 40% by default.

Rx Drop Violations: An Aerohive device might drop a wireless frame on its ingress wifi interface for several reasons such as the arrival of a duplicate frame or a frame that cannot be decrypted. The device tracks all the dropped Rx frames and all the Rx frames during the report period. It then calculates the Rx drop rate by dividing the total number of dropped Rx frames by the total number of Rx frames. Rx drop violations appear here when the rate of dropped Rx frames exceeds the threshold, which is 40% by default.

So again in English:

A CRC error occurs when a frame arrives corrupted or can't be understood.

A Tx Drop occurs when the access points gives up trying to transmit a piece of data.

A Rx Drop occurs when the access point gives up trying to understand a piece of data intended for it.
 
As a rule you want to keep your errors below 10% if you have a data only wireless network and 2% (good luck with that) for a wireless network carrying voice data.  If you have a high density access point deployment you can expect your error counts to be higher.

Looking at your error counts:

Your 2.4 GHz error counts exceed 10% and some look closer to 20%.  This is most likely because your access points are physically too close or you have too many 2.4 GHz radios enabled.  Sometimes you need to have your access points close for the 5 GHz coverage and this causes your 2.4 GHz error counts to rise (all things being equal a 2.4 GHz signal will travel further than a 5 GHz signal).  In this case you can disable some 2.4 GHz radios.

Your 5 GHz error counts look good with the B1-Z4-AP04 access point's 5 GHz error count starting to get high.  As this is a transmit (from the point of view of the access point) error possible causes include the access point being to close to other access points, clients being able to go into a marginal coverage area without roaming to another access point or clients being able to leave the coverage area (is the access point on the perimeter of your coverage area?). 
Photo of Hoang Tung

Hoang Tung

  • 31 Posts
  • 0 Reply Likes
Hi Crowdie,
Thanks for your reply,
I checked as you said. There is no errors above 10%, maximum was 9% for one AP (2.4GHz Tx retry). The rest had nothing above 30%.
Here is the screenshots of Top 10 Devices by Errors:


Does it involve anything about SLA?

Thanks
(Edited)
Photo of Crowdie

Crowdie, Champ

  • 972 Posts
  • 272 Reply Likes

The SLA settings are located in the radio profile assigned to the access point:

Select the correct density for the access point and then click on the "Customize" button.  You can now advise the access points how you expect them to operate congestion wise:

From Aerohive Help:

To counter traffic congestion from clients with otherwise healthy Tx/Rx bit rates, APs can monitor client throughput and report their SLA (service level assurance) status to HiveManager. The APs can also dynamically increase the amount of airtime for clients with a significant backlog of queued packets so that the AP can send them out faster and thereby improve their throughput.

In this section, you set the throughput level for the SLA: High-density, performance-oriented networkNormal-density network, or Low-density, coverage-oriented network. The level you select reflects the needs or objectives of the deployment. For example, if the main goal of the deployment is to provide coverage rather than throughput, then you might select the normal- or low-density network option. On the other hand, if the objective of the deployment is high throughput, then you probably want to set the level for normal- or high-density networks. The default throughput level is normal.

To see the default settings that constitute each throughput level, click Customize. For each radio mode (or phymode)—11a, 11b, 11g, 11n, 11ac—there is a pair of settings for bit rate, success rate, and usage. In most cases, the AP and client use several different rates for transmitting and receiving packets, changing rates as factors such as RSSI and packet loss change. Therefore, to determine a common "middle" point to which various client scores can be compared, HiveManager provides a pair of settings for each phymode:

Rate: The rate setting defines the transmission bit rate that you expect clients with healthy connectivity to use. For 11a/b/g modes, the rates are in Mbps speed. For 11n mode, the rates are Mbps and MCS (modulation coding scheme). If you want to modify the rate for a phymode, choose a different rate from the drop-down list.

Success: The success setting defines the percent of packets that you expect clients with healthy connectivity to transmit successfully—that is, packets transmitted without retries—at the defined rate. If you want to modify the success percent for a phymode rate, enter a different number from 1 to 100 % in the Success field.

Usage: The usage setting defines the percent of time that you expect clients with healthy connectivity to transmit at the defined rate. If you want to modify the usage percent, enter a different number from 1 to 100% in the Usage field. Note that the aggregated usage for the two bit rates must be equal to or less than 100%.

If you have selected the wrong density type (say normal when the access point is actually part of a low density deployment) then you will get SLA violations.

Photo of Hoang Tung

Hoang Tung

  • 31 Posts
  • 0 Reply Likes
Thanks Crowdie, I will give it a shot and update you later.
Photo of glenstorey .

glenstorey .

  • 5 Posts
  • 0 Reply Likes
re: AFP. 
One of the teachers (and the I.T. Director) here! 
Sorry AFP might be a secondary issue - it's a bit of a bandwidth hog, but I suggested that setting its priority to 'low priority' might help mitigate the load issues we're having. 
Photo of Eastman Rivai

Eastman Rivai, Official Rep

  • 146 Posts
  • 17 Reply Likes
Can you run the top applications by usage report between 12 to 14.  Can you please check if dynamic airtime schedule is enabled? Please disable it if it is enabled. 

You may need to assign AFP to a lower priority queue and apply rate-limit if to it. Please let me know if you need help on this.

Eastman
Photo of Hoang Tung

Hoang Tung

  • 31 Posts
  • 0 Reply Likes
Hi Eastman, 
Are you able to help me with this case?

Thanks,
Tung
Photo of Hoang Tung

Hoang Tung

  • 31 Posts
  • 0 Reply Likes
Hi Eastman,
This is the report of top 10 applications usage between 12 to 14:


Dynamic airtime schedule is disable as here:


I assigned AFP to lower priority as here:

This is for rate control:


Is there anything else I should check?
Thanks.
Photo of Hoang Tung

Hoang Tung

  • 31 Posts
  • 0 Reply Likes
Here is the update for SLA on 24 June:



Sometime, the SLA Noncompliance Devices reach 98.80%. That was really bad.
(Edited)