delayed client connectivity on guest network.

  • 1
  • Question
  • Updated 1 year ago
OK, weird thing. Here's what we know:

  1. I have an Internet only guest SSID running on a dedicated VLAN (on Catalyst 2960XR stack that barely idles with the traffic I'm pushing through it).
  2. I have clients that try to connect to that SSID, some connect normally right away, but some can't for periods of 2 minutes up to several hours. Then they suddenly can for no obvious reason. It's all or nothing. Browsing is then fine, the little network icon warning goes out, data consuming apps start working.
  3. These clients can be iOS 9 & 10, Win 7-10, MacOSX.x, and even some Androids.
  4. When these clients connect, the client monitor show that everything is fine according to whatever AP they happen to connect to. The corporate edge firewall also shows their traffic passing as if everything was just fine. So much so that I dismissed it the first time someone complained about not being able to connect. Firewall logging, Cisco IOS monitoring and client monitor all showed normal connected behaviour.
  5. DNS, DHCP and VLAN trunking are fine. The few wired clients I've tried as a test didn't get this, and it's not even consistent for all wireless guests on this SSID. These days it's about 50/50 whether a client can connect right away or not. Once connected, the client gets full speed, but zilch before. Apps that need data don't work, no browser, no outbound ICMP (although you can ping the devices _from_ the gateway device of that VLAN while this is going on) 

I've been working with an Aerohive agent for a few weeks now, but I think we've exhausted the  list of plausible sounding possibilities. So step up HN community! Please tell me someone has run into this before and has something, anything useful to say about it. I'm open to any wild and crazy hypothesis at this point. Aliens, Elvis did it, what have you, let's hear it.

Dale.
Photo of Dale Green

Dale Green

  • 2 Posts
  • 0 Reply Likes
  • confounded, mostly.

Posted 2 years ago

  • 1
Photo of Kent

Kent

  • 5 Posts
  • 0 Reply Likes

Hi, I have had something similar, except I had a small switch/router before my firewall.

And that router had a small ARP cache. When we reached that limit new clients that connected to the network got IP address but no internet connection, and if it was a Windows client we could see the small yellow warning sign.

When the ARP cache is available again the client starts working!

//Kent

Photo of Dianne Dunlap

Dianne Dunlap

  • 75 Posts
  • 15 Reply Likes
Interesting.  Nslookup works?  Depending on the switch, you could do 'clear arp' to rule that out.  Also depending on the switch, debug ip icmp to see if it's getting the pings and debug arp.  One might want to run Hivemanager debugs against the PC's mac.
Photo of Dale Green

Dale Green

  • 2 Posts
  • 0 Reply Likes
Thanks Kent and Dianne for your sharing your ideas. This might still be an ARP problem, but it's proving tricky and I'm thinking it may not even be an Aerohive problem. Maybe. Here's why.

I've observed that when a client is in the non-browsing state, the corporate edge firewall (a Checkpoint) also can't ping or traceroute to the clients DHCP address. A traceroute fails before it even gets as far as the main switching stack, which should be the next hop. This might be happening because the client MAC isn't successfully registering with either the Checkpoint or the switching stack, which is a Cisco Catalyst 2960. I've checked the ARP caching capacity of both of those devices and neither is anywhere near capacity. Just seems like MAC registration is not happening, at least immediately like you'd expect on one or both of those devices.

Interestingly though, connecting to the staff SSID on the problem device works immediately all the time, so that if we connect the client to the guest SSID right away after connecting to the staff SSID, the client can connect no problem. So it seems like the MAC of the client is just not known to either the Checkpoint or the Cisco, but it only seems to happen with the guest SSID.

Not sure how to go about diagnosing which just yet, but I'll pass along any progress should I make any. So weird.
Photo of Dianne Dunlap

Dianne Dunlap

  • 75 Posts
  • 15 Reply Likes
You should be able to do 'debug arp', 'show arp | inc <mac>' or 'debug ip icmp' on the Cisco switch to troubleshoot.  I know you're saying staff is ok but not guest.  Are there more ssids and only guest is having the problem?  If you set up a new ssid, does it have the problem?
Photo of Dale Green

Dale Green

  • 2 Posts
  • 0 Reply Likes
OK, so here's an update. I managed to narrow the problem to the TCP session handshake sequence not completing, so syn from the wireless client, syn/ack back from the server, but never an ack back to the server, which didn't make any sense.

(We think) the problem turned out to be the VLAN scheme. I had VLANs configured on the Catalyst and on the APs, but a default route on the Catalyst of 10.0.0.250, which was the inside interface of the edge firewall that had no VLANs configured, just addressing on the hard interfaces. Also, the uplink port to the edge firewall was built as an access port on the VLAN that uses 10.0.0.0/24 addressing. Not sure how, but this addressing scheme worked without a hitch for years. It looks like the VLAN headers were being stripped on the way through to the edge firewall, even though the packets were still being processed by the rule base at the firewall according to their IP address information. How this didn't affect wired clients on VLANs other than the 10.0.0.0 VLAN, I'm not sure, but that's what was happening.

Anyway, the Aerohive APs, the core switching and the edge firewall all agree on a global VLAN configuration now and traffic seems to be moving seamlessly. As a bonus, CPU usage on the edge firewall and the core switching seems to be down measurably. So I guess the moral of the story is that a basic audit of the fundamentals probably would have uncovered this setup oddity much earlier.