About that downtime…

Posted on May 15, 2008
Filed Under Incidents, Updates

Before I get started with the whys and wherefores, let me take this opportunity to personally apologize to all of you who’ve been affected by our recent downtime. And let me make it clear that the purpose of this item is not to point fingers, but to give a transparent explanation of what happened and why, and how we’re going to stop this sort of problem from happening in the future.

Background

On 12 May 2008, we initiated a request to our datacenter to upgrade the operating system on our primary server (melvin.smilingpeanut.com) to Fedora Core 8. (We had previously been running FC4.) We were told this would be routine and that backups and transfers of data would be taken care of with minimal involvement on our part. That said, we made sure to back up core data such as user files, databases and mail on our own just in case. The update was to commence on 14 May 2008 in the very early hours of the morning so as to impact our customers as little as possible.

And we’re down

Shortly after midnight on 14 May, the server was taken offline to start the upgrade procedure. We were told this would take five hours at the most. We had also given instructions to the datacenter techs to restore data simultaneously so that the server would begin serving pages at the earliest second after the upgrade was completed.

About five hours later, we were informed by the datacenter that the server was back up and running.

Later that morning, we were alerted by customers that their files were missing. Upon investigation and communication with the datacenter, we discovered that customer data had not been restored because the server hard drive had been replaced with a fresh one, and all customer data was still on the original drive. This was not what we were told would happen.

We were then forced to wait for the original drive to be physically put back into the server and mounted so that we could recover user data. To compound matters, we were further informed that the settings backup for our account management software (Plesk) was corrupt. (We did not make our own backup of this information because we were told there was no need.)

Then we waited.

Why did it take so long?

Because the Plesk data was corrupt, we could not migrate user data from the other drive until that situation was fixed. To fix this, the datacenter once again removed the original drive (running FC4) and put it into another server to get a version of Plesk running so that the setting information could be exported. Once they had a proper setting backup, they reinserted the original drive back into the server and successfully migrated the Plesk settings on the new drive.

Sometime during this migration, Plesk (on the new drive) crashed and the MySQL database that Plesk depends on became corrupted. It took about three hours before Plesk became usable again and we could proceed with the migration.

The next several hours became a process of checking, rebooting, fixing permissions and running down issues. We restored service at approximately 2:25am MST. Later in the morning, we ran down some other issues that were discovered overnight. All known issues were rectified at approximately 9:00am MST.

No data was lost during this upgrade.

The timeline

14 May
0009 Upgrade begins, server is taken offline
0516 Upgrade is completed, no data is available
0910 Receive first client notification that data is not available
1020 Receive notice from datacenter that new drive is in the server, old is not
1125 Original hard drive re-inserted and mounted on server
1147 Receive notice that Plesk backup may be corrupt
1219 Datacenter checks in - still working on Plesk backup
1312 Put up placeholder page for clients
1402 Datacenter checks in - Plesk backup is unusable
1443 Datacenter checks in - Plesk settings rescued from original drive
1554 Plesk crashes
1631 Datacenter checks in - working on Plesk crash
1731 Datacenter checks in - Plesk database is corrupt
1756 Datacenter checks in - still working on Plesk database
2045 Plesk is back
15 May
0225 All websites up, normal operations resume
0653 Alerted to issue with mailing lists
0812 Mailing lists restored
0853 Back to full capacity - all operations restored

All times MST.

So what now?

As a result of this extended outage, we are taking additional measures to make sure this never happens again:

  1. Plesk user data will now be backed up on its own, separate from the Sunday night backups.
  2. We will now coordinate with the datacenter over the phone during and after any important system upgrades.
  3. We will add a second offsite backup location of both user data and Plesk settings. In the event of any extended maintenance, we will immediately switch DNS to the backup server in order to continue serving web pages.

In reviewing the last two days, we realized that in the event of something like this, we found it very difficult to update clients on what was going on. To fix this, we’re going to make use of the Twitter service, located at:

http://twitter.com/smilingpeanut

Twitter allows you to receive these notices via SMS text message, email and web. This is a free service.

Please bookmark this address in your browser. In the event of a loss of communications, we will post notices and updates there. We may dedicate an official offsite website for client communications at a later date.

The server has been running smoothly for nearly ten hours. We will continue to aggressively monitor the server’s health and stability to ensure no other problems creep up.

In closing

On behalf of everyone at SMILING PEANUT, the folks at the datacenter and everywhere in between, I once again apologize for the problems this has caused. We remain committed to learning from this and making sure it never happens again. If you would like to further discuss this, please call us at 970.449.0844.

Thank you, as always — but particularly today — for your support of SMILING PEANUT.

Chris Lanphear

Recent


Topics


Archives