[Dev] proton.parabola.nu outage/loss post-mortem
lukeshu at lukeshu.com
Wed Sep 19 21:23:48 GMT 2018
You're probably aware that we just had a rather long outage.
- The outage lasted in some form from 2018-08-27 through 2018-09-18
(22days / 3weeks+1day)
- The outage was my (Luke Shumaker's) fault
- Jonathan "n1md4" Gower / Positive Internet will no longer be
donating a server; leaving just winston.parabola.nu
- All services have been migrated from proton.parabola.nu to
winston.parabola.nu, using backups taken on 2018-08-26.
- We are in the process of receiving the full disk image of proton,
to potentially recover any mailing list messages or bug tracker
changes that occurred in the >24 hours after the backup was taken.
I'd like to specifically thank Jonathan Gower for graciously hosting
us at Positive Internet all these years--I'm pretty sure since before
I was ever a Parabola user, let alone a contributor.
I'd also like to thank Bill Auger for helping to fix my mistake and
migrate services to winston.parabola.nu, when I was too lazy to.
== Part 1: Breakage (2018-08-27) ==
On the afternoon of Monday 2018-08-27, I performed a software update
on both servers, winston.parabola.nu and proton.parabola.nu. In
doing so on Proton, I made a stupid mistake, and broke the currently
running sshd, making new logins impossible. I did not realize this
before logging out.
At that time, I had the belief that rebooting it would fix the
issue. I emailed Jonathan, asking him to reboot it.
At that point, everything user-facing was still functional.
Because of a memory leak in Redmine, there's a job that restarts the
bug tracker daily. For a reason unknown to me (though I can take
guesses), Redmine failed to start after it was stopped the next
night. This resulted in HTTP 503 Bad Gateway when visiting
In the ~2 days following being locked out of SSH, for a reason
unknown to me, disk use on the / partition grew, and filled the
At some point in there, the mailing list stopped accepting new
messages. Because of automatic backoff and retry in mail servers,
it's difficult to say exactly when that happened (at least until I
get access to the logs from the disk image).
== Part 2: Downtime (2018-08-30) ==
On Thursday 2018-08-30, at maybe 5PM EDT, Jonathan rebooted the
server, and emailed me to say so:
> Rebooted for you now. Sorry, I don't have regular access to email,
> else I would reply sooner.
Unfortunately, it did not come back up. When I got home at around
11PM EDT, I saw Jonathan's email that he rebooted it, and I saw
reports in the IRC channel that it was now entirely offline. I
adjusted the DNS to have the appropriate domains point at the other
winstons.parabola.nu, and set up an error page explaining that we
were having an outage.
When I updated the DNS, I shortened the TTL from 24 hours to 5
minutes. However, I (stupidly) did not adjust the SPF records
values, or their TTL. This will come up later.
I emailed Jonathan thanking him for rebooting it, but that it hadn't
come back up, and asking for VNC access in order for me to repair
it. Knowing that the drive is qcow2-backed, I knew that he must be
using Qemu, so I noted in the email the appropriate flags to tack on
to qemu; and that I'd just need the IP address of the host.
That Monday, 2018-09-03, he replied that the host didn't have a
publicly routable IP address, so that wouldn't work. I replied back
suggesting a reverse connection, where the host dialed out to my
server, rather than listening for my computer to dial to him.
== Part 3: Migration (2018-09-06) ==
On the evening of 2018-09-06, fed up with more than a week of
downtime on the bug tracker, Bill Auger decided that it was time to
begin restoring Proton's backups on to Winston. I had been holding
my breath for getting Proton fixed (since once Proton was back the
migration would be "wasted" work in the long-run).
On 2018-09-07, Bill migrated labs.parabola.nu from the backups to Winston.
On 2018-09-09, I began contacting the maintainers of RBL lists to
have Winston's IP removed from them, so that we would be able to
send emails from it.
On 2018-09-13, Bill migrated www.parabola.nu from the backups to Winston.
Part 4: Sunsetting (2018-09-17)
On Monday, 2018-09-17 (2 weeks since the last email), I received a
reply from Jonathan:
> Hi Luke,
> I think it's time to repo my hosting of repo. I'm in a different
> job now, and my time is more demanding, so when problems strike I'm
> unlikely to a/ notice or b/ have the time to work on fixing it.
> I would be happy to rsync the 251G vm-103-disk-1.qcow2 file.
> Please let me know what you would like doing with this?
to which I replied:
> Hi Jonathan,
> Thanks for responding. I understand. Congratulations/best wishes on
> the new job!
> I've set up a temporary rsync server to receive the file.
> host: [redacted]
> port: [redacted]
> username: [redacted]
> password: [redacted]
> That is, something like:
> RSYNC_PASSWORD=[redacted] rsync -vz --progress vm-103-disk-1.qcow2 rsync://[redacted]/
> It uses a port forwarding setup--I won't be surprised if the
> connection dies prematurely; just restart it (that is the beauty of
> rsync, after all).
> Thanks for hosting us all these years!
> Thanks and happy hacking,
> ~ Luke Shumaker
I began migrating over the remaining services:
- mailing list and email services
- XMPP (Prosody)
I updated the SPF records at approximately 19:00 EDT on 2018-09-17.
In order to avoid hurting the IP's reputation, I avoided starting
postfix until at least 2018-09-18 19:00 EDT, because of the SPF
record's 24 hour TTL. While migrating the mailing list, I noticed
that postfix had already been started. I stopped it until that
== Part 5: Remaining steps ==
proton.parabola.nu served as the parabola_nu node in LibreVPN.
Before it went down, it was the only operating public LibreVPN node.
I've since set up public nodes on winston.parabola.nu (Iceland),
mav.lukeshu.com (Chicago, USA), and ramhost.lukeshu.com (Chicago,
USA). However, _all_ old LibreVPN nodes will need to be updated to
connect to them. As an alternative, since Proton the
"parabola.nu:655", as it's public address, we could deploy Proton's
old Tinc host key to Winston, and have Winston become parabola_nu.
We need to monitor email deliverability for a while. I've verified
that Winston's IP is no longer listed in any of the public RBLs, but
it may still be on others (like Yahoo!'s). I think the wiki had
been set up to proxy emails through Proton; we need to make sure
that all web services that send emails to so through Postfix on
PostgreSQL and MariaDB are both running on the same server again.
This makes me nervous, because they could mess with eachother's disk
access patterns, and grind the system performance to the ground.
That said, Winston's disk performance is much better than Proton's
was. We'll have to keep an eye on this. If it becomes a problem,
maybe we figure out migrating the wiki to Postgres, and stop running
MariaDB. Maybe we look for another server, to separate them again.
== Part 6: Lessons learned ==
- Emergency console: Having a VNC/management console is super
important. In the past, Jonathan had been very responsive and
helpful; but was still a SPOF. And it eventually bit us. Perhaps
having an emergency console needs to be a requirement for any new
servers, even if it means we turn down some donations? (This makes
me a hypocrite, as no one but me has VNC to beefcake.) On the
other hand, that SPOF only failed once in the last 4 years... which
isn't a terrible MTBF, if we can migrate things to other servers
- Backups: On one hand, I'm super psyched that the backups came in
useful, and worked as needed, IRL. However, it could have gone
* Deploying them was tricky, because we had to merge things with an
existing server; things would have been easier if we were
spinning up a new near-exact-copy of Proton. If we'd (I'd?)
moved all of Proton's services to Holo configuration packages
(like I have with many of Winston's), this would have been as
simple as installing config-parabola-service-FOO and dropping a
folder from the backups `/srv`.
* We aren't doing full system backups, we didn't back up any config
files that were in places other than /etc. Trusting the
backup=() arrays, on Winston that's currently:
Fortunately, on Proton I'd had the foresight to replace mm_cfg.py
with a symlink to a file in /etc, and I remembered that I'd done
that (I've now written a Holo package to codify it). Perhaps we
should switch to more full-system backups?
* Bill had to wait for me to decrypt the backups to begin restoring
things, since they are encrypted to my PGP key, since no one had
suggested anything better. I do not like being a SPOF. We need
to figure out a better encryption story for the backups.
~ Luke Shumaker
More information about the Dev