Unable to create a AAA User Directory object because unable to JOIN the Active Directory (AD) domain

  • 1
  • Question
  • Updated 2 years ago
Ok, more specifically, we are unable to modify existing or create new 'AAA User Directory Settings' due to the inability to join a new AD domain.
Unknown error: HiveAP ACUSTOMER-RADIUS-AP was unable to join the Active Directory domain CUSTOMER-AD.NEWDOMAIN.tld. .
[NOTE:  I have a support case open with Aerohive.  This post is a copy/paste with some edits of that case, as here I can format things nicely and try to keep things more generic.  I've scoured this forum for any/all mentions of this topic--and I've learned a few tidbits so thanks to everyone who does post here--and I've Googled like mad, but so far, I can't find any answers.  So I'm posting here in hopes to trigger thoughts/etc.]

  • On-prem virtual HiveManager v6.8r3a (upgraded to this from v6.8r3 in an attempt to resolve this issue but to no avail)
  • AP330s running HiveOS v6.6r2a

Quick context:  We run an HM VA providing MSP type services to our customers.  One of those customers is having this issue in their VHM.  We'll call them the 'CUSTOMER'.

Now we had a working network profile setup that had 3 SSIDs:  a guest network, a voip network, and a staff network.  The issue here revolves around the staff network, as it was the only set up to use an AP as a RADIUS server to authenticate against their MS Active Directory server.

But recently the CUSTOMER began a process to migrate from their old AD servers to new AD servers, and in the process they changed their AD domain from 'CUSTOMER.OLDDOMAIN.tld' to 'CUSTOMER-AD.NEWDOMAIN.tld'.  (Mind you, we had nothing to do with this migration, but we were informed all accounts were brought over, etc.  More on this later).

We figured, "No problem.  We'll just adjust the 'AAA User Directory Settings' object currently configured for their old AD server domain and IP address, set those to the new domain and IP, and have the unit join that domain."  But that's as far as we got.

Sure, we first modified the DNS setting to the new AD server, set the domain to the new domain, and clicked [Retrieve Directory Information], at which point it populated the 'Active Directory Server:' and 'BaseDN:' just fine.

HOWEVER, when we entered the 'Domain Admin:' username and password and clicked [Join], all we got was an error message saying:
Unknown error: HiveAP ACUSTOMER-RADIUS-AP was unable to join the Active Directory domain CUSTOMER-AD.NEWDOMAIN.tld. .
And this is pretty much as far as we got.

Mind you, I spent the next half DAY trying to get this working, but to no avail.
  • We verified that the domain admin and domain user accounts were setup properly, not disabled, had the right permissions, and even reset the passwords just to make sure.
  • We ran Wireshark ON the new AD server (our guys have admin access to their servers) and verified that ACUSTOMER-RADIUS-AP was, in fact, doing LDAP queries against the AD server when we clicked on [Retrieve Directory Information] and when we clicked on [Join].
  • What we did NOT see were any rejection messages, just regular queries and responses.
  • The only oddity we noticed was that the AP made a DNS query of its hostname but using the OLD domain (e.g., ACUSTOMER-RADIUS-AP.CUSTOMER.OLDDOMAIN.tld instead of ACUSTOMER-RADIUS-AP.CUSTOMER-AD.NEWDOMAIN.tld).
  • SSHing into the RADIUS AP, I found the config showed settings for the old domain, which makes perfect sense since we were unable to modify the settings to the point where HM would let us save, ergo we never got to the point of pushing the new domain info to the APs.
  • I tried creating entirely new 'AAA User Directory Settings' objects using different APs in the hopes whatever this was was only local to the one AP.  No dice.
I got so frustrated with the situation that after upgrading HM to
v6.8r3a didn't help (which I didn't expect), I did the following:
  • I rebooted ALL the APs and tried again.  No dice.
  • I removed the staff SSID from the network profile, did a complete push of the config, and after the APs rebooted, verified that the one we wanted to use as the RADIUS server had no vestige of the old domain in its config.  Yet trying again, same error.
  • I tried to delete the entire RADIUS/AD setup, but eventually ran into an issue where HM would not let me delete the staff SSID, even though it was no longer referenced in the network policy, there were no other network policies, and by searching I could find no reference to that SSID except in the audit log.  Yet trying to delete it I got the usual
  • The removal failed because "CUSTOMER-Staff-AD" is still in use by another configuration item. Please disassociate references to this item from other configuration items before removing it.
  • The same thing happened when I tried to delete the 'AAA Client Settings' object:
  • The removal failed because "AP-RADIUS-CUSTOMER" is still in use by another configuration item. Please disassociate references to this item from other configuration items before removing it.
Seriously, if you're going to deny me the ability to delete something, how about telling me what the heck is referencing it.  Is that too much to ask?  CLEARLY HM knows or why is it saying this?

At this point I'm at a complete loss.  I've even attempted to simply create a 'AAA User Directory Settings' object with the old AD IP and domain (they left the old setup running in tandem as they migrate things over), but trying to do so results in the following error message:
Unable to retrieve Active Directory Information. The Aerohive RADIUS server is currently disconnected from HiveManager. Please restore the connection and try again.
This, of course, makes absolutely no sense, since the whole POINT of this step is to CREATE that Aerohive RADIUS server!

Oh, and allow me to add something here.  On top of everything else, the reason this has been so frustrating is that earlier this week with another customer (so yes, another VHM within our HiveManager setup), I began working with them to setup this same exact feature, and the setup of the 'AAA User Directory Settings' object went smooth as glass.  And as we successfully setup the CUSTOMER's original AD configuration over a year ago, it's even more surprising to have this happening.

Anyway, at this point I could really use some help.  Thanks for taking the time to read all this, and thanks in advance to any and all who are able to provide some insight.
Photo of Frank


  • 15 Posts
  • 8 Reply Likes

Posted 2 years ago

  • 1
Photo of Frank


  • 15 Posts
  • 8 Reply Likes
Problem solved!

Short answer:  If you get "Unknown error" when attempting to join the domain, it could be due to a connectivity issue between the AP and AD server.  But to be sure, SSH into the AP and do some diagnostic commands to get enough info to work with.

This afternoon I was on the phone with Aerohive Tech Support.  After describing the problem in detail and having the technician go off to talk to a senior engineer, he came back and we began doing some testing while we waited.  I believe the time was invaluable as it helped provide me with more info finally.

The HiveManager GUI did not provide useful feedback.  It only had the "Unknown error" message.  As the tech explained it to me, he has come to consider these not as unknown errors so much as unexpected ones.  That is, the programmers did not anticipate certain sequences of events and therefore did not setup a specific error message for this event.  The "catch all" error is the "Unknown error" message.

Anyway, to the point, what he had me do was SSH to the RADIUS AP.  Once there, we used this command:
   exec aaa net-join domain <string> fullname <string> server <string> username <string> password <string> [ computer-ou <string> ] [ sasl-wrapping {sign} ]
to attempt to join the domain via the AP's CLI directly.

In the end, we got back
Exec-Program output:
Exec net failed for timeout(###0xFFFFFFFF###)

Ok, not very helpful.  But even so, "timeout" is better than nothing.  So next, he had me turn on debugging by typing in
   _debug radiusd excessive
   _debug radiusd ldap-libs
There are many options just for this subgroup of debug commands:
   _debug radiusd {basic|info|excessive|verbose|samba-tools|ldap-libs|sip-lib|cmlib}
We then tried to join the domain again.

Once it timed out, we then executed
   sh log buffered | i debug
And here's where the fun comes in.  I won't bother to copy/paste the entire thing, but here's a relevant piece of the logs (sanitized, of :
2016-09-02 17:38:11 debug   net: net: restart winbindd.
2016-09-02 17:38:11 debug   net: return code = -1
2016-09-02 17:38:11 debug   net: cli_start_connection: failed to connect to ADservername.CUS<20> ( Error NT_STATUS_ACCESS_DENIED
2016-09-02 17:38:11 debug   net: Error connecting to (Operation already in progress)
2016-09-02 17:38:11 debug   net: timeout connecting to
2016-09-02 17:38:10 debug   last message repeated 8 times
2016-09-02 17:38:02 debug   kernel: Skip bg-scan since power-save client is exist
2016-09-02 17:38:02 debug   net: Connecting to at port 139
2016-09-02 17:38:02 debug   net: timeout connecting to
2016-09-02 17:38:01 debug   last message repeated 7 times
2016-09-02 17:37:54 debug   kernel: Skip bg-scan since power-save client is exist
2016-09-02 17:37:53 debug   net: Connecting to at port 445
2016-09-02 17:37:53 debug   net: resolve_hosts: Attempting host lookup for name ADservername.CUSTOMER.tld<0x20>
2016-09-02 17:37:53 debug   net: resolve_wins: WINS server resolution selected and no WINS servers listed.
2016-09-02 17:37:53 debug   net: resolve_wins: Attempting wins lookup for name ADservername.CUSTOMER.tld<0x20>
2016-09-02 17:37:53 debug   net: resolve_lmhosts: Attempting lmhosts lookup for name ADservername.CUSTOMER.tld<0x20>
2016-09-02 17:37:53 debug   net: Connecting to host=ADservername.CUSTOMER.tld
2016-09-02 17:37:53 debug   net: netmask=
2016-09-02 17:37:53 debug   net: bcast=
2016-09-02 17:37:53 debug   net: added interface mgt0 ip=
2016-09-02 17:37:53 debug   net: netmask=ffff:ffff:ffff:ffff::
2016-09-02 17:37:53 debug   net:
2016-09-02 17:37:53 debug   net:
2016-09-02 17:37:53 debug   net: creating default valid table
2016-09-02 17:37:53 debug   net: Processing section "[global]"
2016-09-02 17:37:53 debug   net: params.c:pm_process() - Processing configuration file "/usr/local/etc/smb/lib/smb.conf"
2016-09-02 17:37:53 debug   net: Initialising global parameters
2016-09-02 17:37:53 debug   net: lp_load_ex: refreshing parameters
2016-09-02 17:37:53 debug   last message repeated 5 times

You read the logs from bottom up time-wise.  The key things to note are that first the AP is

1. "Connecting to at port 445"
2. "timeout connecting to"
3. "Connecting to at port 139"
4. "timeout connecting to"
5. "Error connecting to (Operation already in progress)"
6. "cli_start_connection: failed to connect to ADservername.CUS<20> ( Error NT_STATUS_ACCESS_DENIED"

Ah HA!  So when attempting to join, the AP first tries via port 445 (the port used for direct NetBIOS traffic since Windows 2000/XP?) and, failing that, falls back on the original port 139 used by NetBIOS.

So now, just to confirm, we used one more command which I was able to get out of the tech:
   exec _test tcp-service host <ip_addr> port <number> [ timeout <number> ]
So, typing in
   exec _test tcp-service host port 389
I got immediate success.  HOWEVER, when I entered
   exec _test tcp-service host port 445
it timed out.

So now I had something to work with.  In our network, standard policy is to block all NetBIOS traffic between VLANs, making exceptions only when needed.  Now I THOUGHT that the AP VLAN had full direct access to the CUSTOMER staff VLAN, but with this data, I now knew that either

1.  We were bumping up against an ACL, or
2.  the AD server's firewall was preventing NetBIOS access

I was inclined to go with #1.  And sure enough, after carefully constructing the path, I found an ACL where there was a PERMIT statement to allow all traffic from the AP VLAN to the AD VLAN, only it was located BELOW the usual DENY statement we use to block NetBIOS.  This meant that while the routing was intact and we could ping and connect to the AD on port 389 or LDAP query/response, we could not have the AP connect to either port 445 or 139.

Moving the PERMIT statement just above those DENY statements fixed everything.

So in case anyone else ever runs across an issue like this, I hope these instructions are of some help.