About that downtime…
Posted on May 15, 2008
Filed Under Incidents, Updates
Before I get started with the whys and wherefores, let me take this opportunity to personally apologize to all of you who’ve been affected by our recent downtime. And let me make it clear that the purpose of this item is not to point fingers, but to give a transparent explanation of what happened and why, and how we’re going to stop this sort of problem from happening in the future.
Background
On 12 May 2008, we initiated a request to our datacenter to upgrade the operating system on our primary server (melvin.smilingpeanut.com) to Fedora Core 8. (We had previously been running FC4.) We were told this would be routine and that backups and transfers of data would be taken care of with minimal involvement on our part. That said, we made sure to back up core data such as user files, databases and mail on our own just in case. The update was to commence on 14 May 2008 in the very early hours of the morning so as to impact our customers as little as possible.
And we’re down
Shortly after midnight on 14 May, the server was taken offline to start the upgrade procedure. We were told this would take five hours at the most. We had also given instructions to the datacenter techs to restore data simultaneously so that the server would begin serving pages at the earliest second after the upgrade was completed.
About five hours later, we were informed by the datacenter that the server was back up and running.
Later that morning, we were alerted by customers that their files were missing. Upon investigation and communication with the datacenter, we discovered that customer data had not been restored because the server hard drive had been replaced with a fresh one, and all customer data was still on the original drive. This was not what we were told would happen.
We were then forced to wait for the original drive to be physically put back into the server and mounted so that we could recover user data. To compound matters, we were further informed that the settings backup for our account management software (Plesk) was corrupt. (We did not make our own backup of this information because we were told there was no need.)
Then we waited.
Why did it take so long?
Because the Plesk data was corrupt, we could not migrate user data from the other drive until that situation was fixed. To fix this, the datacenter once again removed the original drive (running FC4) and put it into another server to get a version of Plesk running so that the setting information could be exported. Once they had a proper setting backup, they reinserted the original drive back into the server and successfully migrated the Plesk settings on the new drive.
Sometime during this migration, Plesk (on the new drive) crashed and the MySQL database that Plesk depends on became corrupted. It took about three hours before Plesk became usable again and we could proceed with the migration.
The next several hours became a process of checking, rebooting, fixing permissions and running down issues. We restored service at approximately 2:25am MST. Later in the morning, we ran down some other issues that were discovered overnight. All known issues were rectified at approximately 9:00am MST.
No data was lost during this upgrade.
The timeline
| 14 May | |
| 0009 | Upgrade begins, server is taken offline |
| 0516 | Upgrade is completed, no data is available |
| 0910 | Receive first client notification that data is not available |
| 1020 | Receive notice from datacenter that new drive is in the server, old is not |
| 1125 | Original hard drive re-inserted and mounted on server |
| 1147 | Receive notice that Plesk backup may be corrupt |
| 1219 | Datacenter checks in - still working on Plesk backup |
| 1312 | Put up placeholder page for clients |
| 1402 | Datacenter checks in - Plesk backup is unusable |
| 1443 | Datacenter checks in - Plesk settings rescued from original drive |
| 1554 | Plesk crashes |
| 1631 | Datacenter checks in - working on Plesk crash |
| 1731 | Datacenter checks in - Plesk database is corrupt |
| 1756 | Datacenter checks in - still working on Plesk database |
| 2045 | Plesk is back |
| 15 May | |
| 0225 | All websites up, normal operations resume |
| 0653 | Alerted to issue with mailing lists |
| 0812 | Mailing lists restored |
| 0853 | Back to full capacity - all operations restored |
All times MST.
So what now?
As a result of this extended outage, we are taking additional measures to make sure this never happens again:
- Plesk user data will now be backed up on its own, separate from the Sunday night backups.
- We will now coordinate with the datacenter over the phone during and after any important system upgrades.
- We will add a second offsite backup location of both user data and Plesk settings. In the event of any extended maintenance, we will immediately switch DNS to the backup server in order to continue serving web pages.
In reviewing the last two days, we realized that in the event of something like this, we found it very difficult to update clients on what was going on. To fix this, we’re going to make use of the Twitter service, located at:
http://twitter.com/smilingpeanut
Twitter allows you to receive these notices via SMS text message, email and web. This is a free service.
Please bookmark this address in your browser. In the event of a loss of communications, we will post notices and updates there. We may dedicate an official offsite website for client communications at a later date.
The server has been running smoothly for nearly ten hours. We will continue to aggressively monitor the server’s health and stability to ensure no other problems creep up.
In closing
On behalf of everyone at SMILING PEANUT, the folks at the datacenter and everywhere in between, I once again apologize for the problems this has caused. We remain committed to learning from this and making sure it never happens again. If you would like to further discuss this, please call us at 970.449.0844.
Thank you, as always — but particularly today — for your support of SMILING PEANUT.
Chris Lanphear