Too many open files

  • 1
  • Question
  • Updated 3 years ago
  • Answered
  • (Edited)
I've tried to report this bug through the support channel (case 100622) but I'm getting nowhere. Our AP330s and BR200WPs running 6.1r6 are running out of file handles approximately 42 days after boot. At that point they start frequently logging "Too many open files". (I can determine this because all our devices syslog to Splunk, so I have device logs going back 6 months.) The roughly 42 day time period seems to match leaking a file handle every hour until a limit of 1023 is reached.
The symptoms include: failure to conduct a configuration audit (the whole configuration is returned), failure to update configuration, failure to get tech data in HMOL, unable to connect to HiveManager after a HiveManager upgrade, etc - basically anything that would open or create a file. (Switching/routing/wifi/VPN operations are not affected.) It can be fixed by rebooting, but returns 42 days later.

Aerohive folks: lots more data/logs/evidence is attached to the support case.

Bottom line for me: if this is fixed in 6.4r1 I will reboot the affected units and start testing the upgrade.

Are any other customers seeing this?
Photo of Fraser Hess

Fraser Hess

  • 60 Posts
  • 7 Reply Likes

Posted 3 years ago

  • 1
Photo of Mike Kouri

Mike Kouri, Official Rep

  • 1030 Posts
  • 271 Reply Likes
Fraser,
I do not have direct access to case details, but I have asked for a summary from our Support organization and will look into this.
Photo of Fraser Hess

Fraser Hess

  • 60 Posts
  • 7 Reply Likes
Thanks Mike. I appreciate your attention to this.
Photo of Nick Lowe

Nick Lowe, Official Rep

  • 2491 Posts
  • 451 Reply Likes
Good sleuthing!

There's an undocumented "_shell" command at the CLI that would probably help to get running process statistics as I suspect it will drop down to BusyBox, bash et al. allowing a little further drill down but I haven't yet had time or the motivation to reverse engineer how to get in to it! :P

Definitely one for a support case!

If this reproduces deterministically along the lines you describe in 6.4r1, with an appropriate level of escalation, this should be eminently debuggable.
(Edited)
Photo of Fraser Hess

Fraser Hess

  • 60 Posts
  • 7 Reply Likes
Tier 2 support has now looked at it and proposed updating Application Signatures to v4.0.6. Once I get to a compatible HiveManager and HiveOS version I will test it.
Photo of Nick Lowe

Nick Lowe, Official Rep

  • 2491 Posts
  • 451 Reply Likes
That's certainly piqued my interest: I'm curious how an updated signature set might solve a handle/reference leak that occurs with a regular cadence, as you describe it. I would have suspected a different cause.
(Edited)
Photo of Andrew Garcia

Andrew Garcia, Official Rep

  • 368 Posts
  • 120 Reply Likes
The application signature will not fix the problem.  The file handle leak affects AVC reporting (as well as all the other functions Fraser outlined above), but it has nothing to do with the application signature file itself.

The temporary workaround is to reboot the AP while running 6.1r6.  The fix is to upgrade to 6.2r1 or higher.

I know this issue affected the AP330/350 and the AP121/141.  It did not affect the AP230.  I am not sure about other platforms.
Photo of Nick Lowe

Nick Lowe, Official Rep

  • 2491 Posts
  • 451 Reply Likes
Hopefully not the AP320/AP340 then.
Photo of Andrew Garcia

Andrew Garcia, Official Rep

  • 368 Posts
  • 120 Reply Likes
Deja vu, man.
Photo of Andrew Garcia

Andrew Garcia, Official Rep

  • 368 Posts
  • 120 Reply Likes
There was a file handle leak in 6.1r6 on some platforms.  It was fixed in 6.2r1 and 6.4r1.
Photo of Fraser Hess

Fraser Hess

  • 60 Posts
  • 7 Reply Likes
Thanks Andrew, as soon as I can get the affected units rebooted (tonight), I will be upgrading HMOL to 6.4r1 and testing the new firmware.
Photo of Andrew Garcia

Andrew Garcia, Official Rep

  • 368 Posts
  • 120 Reply Likes
Nice sleuthing on the exact number of days, by the way.  When I had this problem in my network, I only narrowed down the AP uptime to "a long time."

FYI, I let Mike know the relevant bug number and he is working to get support up to speed on this bug. Apologies for the run around.  
Photo of Fraser Hess

Fraser Hess

  • 60 Posts
  • 7 Reply Likes
Once I figured out it was increasing once every hour the math was pretty simple (1023/24 minus some base number of open files). Having all the logs in Splunk plus units with 41 and 44 days uptime to observe really helped.