[Dev] proton.parabola.nu outage/loss post-mortem

Wed Sep 19 21:23:48 GMT 2018

Hello all,

You're probably aware that we just had a rather long outage.

TL;DR:

 - The outage lasted in some form from 2018-08-27 through 2018-09-18
   (22days / 3weeks+1day)
 - The outage was my (Luke Shumaker's) fault
 - Jonathan "n1md4" Gower / Positive Internet will no longer be
   donating a server; leaving just winston.parabola.nu
 - All services have been migrated from proton.parabola.nu to
   winston.parabola.nu, using backups taken on 2018-08-26.
 - We are in the process of receiving the full disk image of proton,
   to potentially recover any mailing list messages or bug tracker
   changes that occurred in the >24 hours after the backup was taken.

I'd like to specifically thank Jonathan Gower for graciously hosting
us at Positive Internet all these years--I'm pretty sure since before
I was ever a Parabola user, let alone a contributor.

I'd also like to thank Bill Auger for helping to fix my mistake and
migrate services to winston.parabola.nu, when I was too lazy to.

== Part 1: Breakage (2018-08-27) ==

  On the afternoon of Monday 2018-08-27, I performed a software update
  on both servers, winston.parabola.nu and proton.parabola.nu.  In
  doing so on Proton, I made a stupid mistake, and broke the currently
  running sshd, making new logins impossible.  I did not realize this
  before logging out.

  At that time, I had the belief that rebooting it would fix the
  issue.  I emailed Jonathan, asking him to reboot it.

  At that point, everything user-facing was still functional.

  Because of a memory leak in Redmine, there's a job that restarts the
  bug tracker daily.  For a reason unknown to me (though I can take
  guesses), Redmine failed to start after it was stopped the next
  night.  This resulted in HTTP 503 Bad Gateway when visiting
  <https://labs.parabola.nu/>.

  In the ~2 days following being locked out of SSH, for a reason
  unknown to me, disk use on the / partition grew, and filled the
  partition.

  At some point in there, the mailing list stopped accepting new
  messages.  Because of automatic backoff and retry in mail servers,
  it's difficult to say exactly when that happened (at least until I
  get access to the logs from the disk image).

== Part 2: Downtime (2018-08-30) == 

  On Thursday 2018-08-30, at maybe 5PM EDT, Jonathan rebooted the
  server, and emailed me to say so:

    > Rebooted for you now.  Sorry, I don't have regular access to email,
    > else I would reply sooner.

  Unfortunately, it did not come back up.  When I got home at around
  11PM EDT, I saw Jonathan's email that he rebooted it, and I saw
  reports in the IRC channel that it was now entirely offline.  I
  adjusted the DNS to have the appropriate domains point at the other
  winstons.parabola.nu, and set up an error page explaining that we
  were having an outage.

  When I updated the DNS, I shortened the TTL from 24 hours to 5
  minutes.  However, I (stupidly) did not adjust the SPF records
  values, or their TTL.  This will come up later.

  I emailed Jonathan thanking him for rebooting it, but that it hadn't
  come back up, and asking for VNC access in order for me to repair
  it.  Knowing that the drive is qcow2-backed, I knew that he must be
  using Qemu, so I noted in the email the appropriate flags to tack on
  to qemu; and that I'd just need the IP address of the host.

  That Monday, 2018-09-03, he replied that the host didn't have a
  publicly routable IP address, so that wouldn't work.  I replied back
  suggesting a reverse connection, where the host dialed out to my
  server, rather than listening for my computer to dial to him.

== Part 3: Migration (2018-09-06) ==

  On the evening of 2018-09-06, fed up with more than a week of
  downtime on the bug tracker, Bill Auger decided that it was time to
  begin restoring Proton's backups on to Winston.  I had been holding
  my breath for getting Proton fixed (since once Proton was back the
  migration would be "wasted" work in the long-run).

  On 2018-09-07, Bill migrated labs.parabola.nu from the backups to Winston. 

  On 2018-09-09, I began contacting the maintainers of RBL lists to
  have Winston's IP removed from them, so that we would be able to
  send emails from it.

  On 2018-09-13, Bill migrated www.parabola.nu from the backups to Winston.  

Part 4: Sunsetting (2018-09-17)

  On Monday, 2018-09-17 (2 weeks since the last email), I received a
  reply from Jonathan:

    > Hi Luke,
    > 
    >   I think it's time to repo my hosting of repo.  I'm in a different
    > job now, and my time is more demanding, so when problems strike I'm
    > unlikely to a/ notice or b/ have the time to work on fixing it.
    > 
    > I would be happy to rsync the 251G vm-103-disk-1.qcow2 file.
    > 
    > Please let me know what you would like doing with this?
    > 
    > Thanks,
    > Jonathan

  to which I replied:

    > Hi Jonathan,
    > 
    > Thanks for responding.  I understand.  Congratulations/best wishes on
    > the new job!
    > 
    > I've set up a temporary rsync server to receive the file.
    > 
    >    host: [redacted]
    >    port: [redacted]
    >    username: [redacted]
    >    password: [redacted]
    > 
    > That is, something like:
    > 
    >     RSYNC_PASSWORD=[redacted] rsync -vz --progress vm-103-disk-1.qcow2 rsync://[redacted]/
    > 
    > It uses a port forwarding setup--I won't be surprised if the
    > connection dies prematurely; just restart it (that is the beauty of
    > rsync, after all).
    > 
    > Thanks for hosting us all these years!
    > 
    > -- 
    > Thanks and happy hacking,
    > ~ Luke Shumaker

  I began migrating over the remaining services:
   - mailing list and email services
   - redirector.parabola.nu
   - XMPP (Prosody)

  I updated the SPF records at approximately 19:00 EDT on 2018-09-17.
  In order to avoid hurting the IP's reputation, I avoided starting
  postfix until at least 2018-09-18 19:00 EDT, because of the SPF
  record's 24 hour TTL.  While migrating the mailing list, I noticed
  that postfix had already been started.  I stopped it until that
  time.

== Part 5: Remaining steps ==

  proton.parabola.nu served as the parabola_nu node in LibreVPN.
  Before it went down, it was the only operating public LibreVPN node.
  I've since set up public nodes on winston.parabola.nu (Iceland),
  mav.lukeshu.com (Chicago, USA), and ramhost.lukeshu.com (Chicago,
  USA).  However, _all_ old LibreVPN nodes will need to be updated to
  connect to them.  As an alternative, since Proton the
  "parabola.nu:655", as it's public address, we could deploy Proton's
  old Tinc host key to Winston, and have Winston become parabola_nu.

  We need to monitor email deliverability for a while.  I've verified
  that Winston's IP is no longer listed in any of the public RBLs, but
  it may still be on others (like Yahoo!'s).  I think the wiki had
  been set up to proxy emails through Proton; we need to make sure
  that all web services that send emails to so through Postfix on
  Winston.

  PostgreSQL and MariaDB are both running on the same server again.
  This makes me nervous, because they could mess with eachother's disk
  access patterns, and grind the system performance to the ground.
  That said, Winston's disk performance is much better than Proton's
  was.  We'll have to keep an eye on this.  If it becomes a problem,
  maybe we figure out migrating the wiki to Postgres, and stop running
  MariaDB.  Maybe we look for another server, to separate them again.

== Part 6: Lessons learned ==

 - Emergency console: Having a VNC/management console is super
   important.  In the past, Jonathan had been very responsive and
   helpful; but was still a SPOF.  And it eventually bit us.  Perhaps
   having an emergency console needs to be a requirement for any new
   servers, even if it means we turn down some donations?  (This makes
   me a hypocrite, as no one but me has VNC to beefcake.)  On the
   other hand, that SPOF only failed once in the last 4 years... which
   isn't a terrible MTBF, if we can migrate things to other servers
   faster.

 - Backups: On one hand, I'm super psyched that the backups came in
   useful, and worked as needed, IRL.  However, it could have gone
   better:

   * Deploying them was tricky, because we had to merge things with an
     existing server; things would have been easier if we were
     spinning up a new near-exact-copy of Proton.  If we'd (I'd?)
     moved all of Proton's services to Holo configuration packages
     (like I have with many of Winston's), this would have been as
     simple as installing config-parabola-service-FOO and dropping a
     folder from the backups `/srv`.

   * We aren't doing full system backups, we didn't back up any config
     files that were in places other than /etc.  Trusting the
     backup=() arrays, on Winston that's currently:

       /usr/bin/pinentry
       /usr/lib/avahi/service-types.db
       /usr/lib/mailman/Mailman/mm_cfg.py
       /usr/share/icons/default/index.theme
       /var/lib/krb5kdc/kdc.conf

     Fortunately, on Proton I'd had the foresight to replace mm_cfg.py
     with a symlink to a file in /etc, and I remembered that I'd done
     that (I've now written a Holo package to codify it).  Perhaps we
     should switch to more full-system backups?

   * Bill had to wait for me to decrypt the backups to begin restoring
     things, since they are encrypted to my PGP key, since no one had
     suggested anything better.  I do not like being a SPOF.  We need
     to figure out a better encryption story for the backups.

-- 
Happy hacking,
~ Luke Shumaker