[Dev] nshd lockups

Mon Sep 11 21:40:59 GMT 2017

Of the last 4 outages, one was caused by the hardware being physically
offline.  The other 3 were:

  | # | Started                 | Resolved                | Host                |
  |---|-------------------------|-------------------------|---------------------|
  | 1 | 2017-08-13 23:00:09 UTC | 2017-08-16 16:58:04 UTC | proton.parabola.nu  |
  | 2 | 2017-08-29 23:04:31 UTC | 2017-08-31 02:57:05 UTC | winston.parabola.nu |
  | 3 | 2017-09-11 18:29:35 UTC | 2017-09-11 19:36:27 UTC | winston.parabola.nu |

After resolving incident #2, I said that I would investigate it more
that Friday (so this message is a bit late).  I have concluded that
both that outage, and the other two listed, were caused by nshd
lockups, causing NSS lookups to block forever.

One of the symptoms of the nshd lockup is that ssh logins are
impossible, even on the emergency@ user.

  While responding to incident #2 I noted that it was weird that I was
  able to get an ssh login on repo at .  That succeeded because it was a
  pre-existing background connection created by librerelease
  HOOKPRERELEASE at 2017-07-28 16:33:38 UTC.  New connections are
  impossible.

This causes problems for us developers, but is generally invisible to
users (a good thing), but that means that it may be a bit before the
issue is noticed (explaining the long outage times).

On winston, we can use the 1984 VPS control panel to reboot it.
However, on proton, without access to the emergency@ user, we have to
get in touch with n1md4 to have it reboot.

I believe what was happening is that a client hangs up (which could be
literally any process on the system getting killed) in the middle of
an NSS lookup with nshd, and nshd never completes its response, never
freeing the lock associated with the request, which creates a deadlock
the next time it receives SIGHUP to reload; the reload routine will
never acquire the lock, but it will block incoming requests from
getting it.  Boom.

In response, I have released version 20170908 of nshd (the
parabola-hackers-nshd package) that I believe resolves the deadlock.
It added per-request limits; most importantly read and write timeouts
(and maximum request size, fixing a different local-user DoS attack
https://labs.parabola.nu/issues/1068).

Despite being released on the 8th, the new nshd version had not yet
been deployed to the servers in time to avoid incident #3 today.  In
response to #3, I have updated nshd (and everything else; -Syu) on
both servers.

Next steps:

 - Perhaps have parabola-hackers load the user information in to an
   SQL database, and have a standard SQL nss/pam module talk to that;
   get our custom code out of the hot-path.
   https://labs.parabola.nu/issues/1465

 - Find a way to ensure that nshd can't block emergency at .  The user
   lookup succeeds, but when ssh'ing to emergency@, it needs to look
   up emergency's group membership, which *does* hit nshd.  Perhaps
   add a special case in the Group_ByMember handler that before even
   grabbing the lock returns if the user us "emergency".

-- 
Happy hacking,
~ Luke Shumaker