Small Business Server Disaster Recovery: A Painful Lesson
A few weeks ago we had a true disaster recovery scenario. An SBS server (and only server) at one of our client sites was dealing with a RAID issue. Several drives showed as failed/error, and we had contacted the manufacturer for the replacement parts. Unfortunately over the following weekend a third drive failed resulting in an unfortunate situation. The server would not boot due to a lost RAID configuration. After talking with Dell to see if we could potentially rescue some of the configuration we came to the realization that it would not be possible to save it and we would have to rebuild. No problem, right? Wrong.
To further complicate the issue the backups were not rotated as they should have been and had died right in the middle of the previous week’s full backup. At this point we have a server with half the drives failed, a full backup that was partially overwritten, or a backup from the last time the drives had been rotated, about three months earlier, and no RAID configuration. Not an ideal situation. Fortunately we had ordered replacement drives from Dell when we saw the problem coming, which arrived that afternoon, so that was a step in the right direction. Now we had three options:
- Restore the server from the partial backup and see what (if anything) was missing
- Restore the server back 3 months and attempt to restore missing data from the partial backup
- Rebuild the server completely and restore what data we could, but start fresh with what we couldn’t
We decided that one of the worst things we could do was to restore very old data and attempt to bring it up to date. There seemed like a million and one problems waiting for us there, such as documents that were outdated but couldn’t be found in the backups, along with all the other unknowns. Along with that, rebuilding the server completely (although something we had to do to some extent anyway to restore the backups) also did not seem like the best option since that would mean a complete rebuild of Active Directory, rejoining all the systems to the domain, rebuilding policies, etc. So needless to say, we decided that we had backups, they looked to contain a lot of data, so let’s try door number one!
We rebuilt the server to a bare bones SBS installation, installed Backup Exec, and began attempting to catalog the data. This is where it got very interesting. We found after several hours of trying to catalog the backups that the only way we could build a decent set of data in the restore tab of Backup Exec was to catalog the files in order of oldest to newest excluding the catalogs that were not part of the backup set that was backing up when the server crashed. Once the backups were cataloged and what looked like all the data we needed was there, it was a boot into DS Restore Mode and sorting through some of the annoying authentication issues that are well documented on Symantec’s site, and we were ready to restore Active Directory and the data. Once that completed it was restoring the Exchange and SQL Databases, and everything looked like it might be on the right track. Unfortunately we found out that was not the case. After verifying Active Directory was working and we could see the user/computer accounts and the Exchange console, we put the server back in place and tested that the workstations could authenticate to it, shared drives were working again, and it was promising, however we hit a major snag, well two of them. The first snag was that some of the shared data was missing, a byproduct of our partial backup, and the second (and major) snag was that Outlook wouldn’t connect to Exchange.
Taking a closer look it was obvious that the entire Mailbox Store was the problem, as it was coming in at a total size of around 1MB, and Exchange wouldn’t allow us to mount it. After looking back through the backups again (and again), re-cataloging and looking again, we could not find anything that looked like it would work any better than what we already had. So I had a brilliant idea, we just uninstall Exchange, and then reinstall it, export all of the data from each Outlook client’s cached OST file (to a PST) and then import it back into Exchange. That should work like a charm. However we began to notice a pattern here, each time we selected a road to go down, it was uphill the whole way. So I uninstalled Exchange, and with it went all of our AD user accounts…
Back to Step 1
This got me to rethink our entire approach. Since we could not trust the backups (since they had not been rotated properly as we had instructed), and the server had died part way into a backup that was overriding the last one, and even when we seemed to be getting things all back into place, things were still missing, what else might go wrong down the road? Since we had a bad case of Murphy’s Law going with this endeavor, we decided that a clean build would be the way and developed a new plan of attack:
- Rebuild the server fresh
- Configure Active Directory (DNS, DHCP, Users, Policies, etc.)
- Configure Exchange Server
- Export all cached Outlook data from each client PC and import to the server
- Catalog and restore JUST DATA from the backups
- Reconfigure all workstations on the new domain
- Remap drives, printers, etc.
- Resolve other lingering issues that result
Once we decided to go down this road, it was a little bit smoother sailing. Although, as with everything else we did, there were a few obstacles to climb. Some users logged out of their computers while in transition resulting in the inability to log back in since the computers weren’t on the newly rebuilt domain yet. Then I had to boot each machine to a password recovery disk, enable the built in Administrator account (and clear the password where applicable) since the workstations were all Windows 7 machines, so I could remove each station from the old domain and add back to the new domain. Finally, after getting all the computers back on the domain, the Exchange Shell commands to import the exported PSTs weren’t working, so I had to manually import them on each machine. Despite these complications, we were moving in the right direction!
All-in-all the entire process took nearly a week with all of the multiple problems that we ran into. Thankfully the impact to the client was mostly just email availability (critical but they were ok for a few days) and documents/data stored on the server. But we had done a few things right well ahead of time that made this a lot less painful for the users, having them use Cached Exchange Mode, and Redirected Folders (offline folders cached) allowed them to still maintain all of their email and documents with the server down, although they could not access shared drive items or new emails, they still had something, and we had a way to restore email with Exchange gone.
The Short-Short Version
Ok, so because I like to do the short-short version, here are some takeaways:
- Always make sure you, your IT person, or your clients are swapping their backup drives/tapes regularly!
- Test your backups. If you aren’t backing something up, you may not know until you try and restore something that isn’t there.
- Ensure backups, even if rotated properly are overwriting only when necessary and not by default.
- If it makes sense in your environment, enable cached Exchange mode and Redirected Folders (offline folders cached)
- Test your backups. (yes, I know it is here twice).
- In the event you do find yourself in a true Disaster Recovery scenario, test each and every bit of the server (Exchange, AD, Domain Trusts, etc.) prior to putting it back into production
- If a RAID drive fails once, and re-seating it doesn’t solve the problem, replace it ASAP because you don’t know if another drive might go while you wait.
- Consider use of an online backup solution as a replacement for or in addition to your local solution. This will help ensure that something as simple as rotation of media doesn’t hinder your businesses ability to come back after a disaster.
Like always, thanks for reading!