Performing shotgun surgery on a database

(This article is an account of an experience I had early on in my software development life. It is not meant to be a technical overview of how to perform the operations discussed. Some parts may be over-exaggerated for the sake of readability or otherwise wholly misremembered)

Performing shotgun surgery on a database
Erroring

It would have been a Friday. In the world of software bad incidents always happen on a Friday. There I was trying to finish up for the day when all of a sudden my phone rang. There was an unknown number on the display and I briefly considered ignoring it and continuing with my work. However, for some reason, I decided to answer it. “Hello?” I asked, after engaging the call. On the other side of the call was a dull but extremely nervous voice which I did not immediately recognize, but they clearly knew who I was. They started explaining to me that something was wrong with a system and they needed my help with it. “I’m sorry who is this?” I asked naively hoping that they had the wrong person. As luck would have it, they did not have the wrong person. The person on the other end of the line was the database administrator (DBA) of a particularly annoying client. When I grasped what had happened it dawned on me that they would need my help with this. “I’ll get back to you once I know more,” I said and hung up.

SQL is an amazing invention and a powerful tool in the right hands. In the wrong hands, it is also an extremely dangerous tool. In an attempt to fix an issue, the DBA managed to run an UPDATE statement without a WHERE clause. In SQL, an UPDATE statement is used to update a particular table and  WHERE gives the statement instruction on which records in a table to change. Running an UPDATE statement without a WHERE clause would cause SQL to update all records in a particular table with the specified value. At the time we were running MySQL and the primary tool that we used to directly interface with any of the databases on our servers was MySQL Workbench. By default, MySQL Workbench has a setting that would prevent an UPDATE statement from running without a WHERE clause. In order for this to have happened, the DBA would have had to turn this setting off in the settings... A task that had been given to the DBA earlier had caused them to accidentally run such an erroneous statement on the production database.

My mind switched to problem-solving, but before I could solve the problem I had to figure out how much damage had been caused. The particular table on which the statement had been actioned was actually a queue. The queue was used to poll external services for information and the column which was overwritten would cause the wrong information to be retrieved from these services and written out to many different tables (at least 3 with a maximum of 6 tables) depending on which service was hit. This obviously was not ideal as it would cause data corruption and impair the data integrity of the system. The only choice was to halt all operations, take the system offline and roll back to before the statement was run. Thankfully the client agreed to this and we were able to halt the system and take it offline.

I was a developer and therefore not really involved in the maintenance and installation of servers. The functionality of the backups was therefore unknown to me at the time. Time was of the essence and I needed to figure it out posthaste. At the suggestion of a senior colleague, I reached out to our DevOps lead at the company who was in charge of setting up almost all the infrastructure. On a Skype call, he explained to me that the database was incrementally backed up. This was done to save on storage but in practice, a restore could take a long time as we would need to restore the last full backup (which could be days or weeks old) and then wait for any increments after the backup to be restored on top of it. Not ideal, but we were short on time so we agreed to begin the process. A moment of silence as the DevOps lead attempted to set the plan in motion. A while later, his face visibly reddened, I heard what can only be described as the worst words one can ever hear in a situation like this. “We don’t have backups.” Instead of having a directory full of incremental backups, all we had was a log full of error messages stating that the files were not written as the directory could not be accessed.

When things like this happen, there's always a sudden feeling of dread. Not only would you have to explain to the client that you had messed up, but you also need to think up an imperfect way of getting things back up and running. Thankfully we didn't lose the entire database and would only need to make sure that we could clean up any tables that were incorrectly written to by the queue processing. The problem is there was no easy way to know how many entries in the queue had already been processed. Thankfully I knew roughly how many and which tables would have been touched, so it was simple to get a rough estimate of how much damage was caused. When I felt comfortable that I had a possible solution I contacted the client and explained to them what had happened and how to fix it. Surprisingly, because the mistake originally happened on their end they were far more understanding than I had anticipated. The servers the system ran on belonged to the client, so though we had access to manage the servers ourselves, they ultimately controlled the infrastructure. This turned out to be a very good thing because they had set up full disk backups for production which would happen at the start of every day. These backups were written to the staging system which ran on independent hardware from production. This was great news. I had them run me through how I could get access to the backup and then set to work.

I knew two things at this point. The first is that it would not be possible to make a full database backup from the disk backup and restore that on production because restoring a full backup might cause us to lose essential user data that was written after the incident. Anything that the queue touched could be reset and reprocessed, but user-created data could not be handled in this way and would need to be left in its newest state. The second is that it would be infeasible to download the entire disk backup. I would have to grab only the tables of the database that I needed to overwrite. Earlier I had already uncovered which tables would need to be reset, and I started to prepare a process that would grab these tables from the backup. I discovered that MySQL stores the full tables in a folder on disk and I could copy these to my local machine. Moving them into the folder of my test database which was a clone of the production database running on my laptop seemed to be a good first attempt. I overwrote the tables in my local database. MySQL Workbench pointed out to me that some tables were updated and that the database would need to be rebuilt before it could be used. I executed a command which would rebuild the database with the hot-swapped tables, and to my surprise, it worked. Everything went well locally and I felt confident enough that the procedure would work on production so I started prepping to take it to production. At this point, the production system had been in maintenance mode for about 3 or 4 hours and nobody would have been able to make any changes to the system. I had my local backup from before which would still be up to date, so I proceeded with running the process and restoring the tables on the production database. MySQL Workbench didn't actually present me with any notification that any changes had been made to the database, but I ran the rebuild command anyway. Thankfully, after doing this, everything seemed good on production as well. Since the queue table would repopulate itself from the current state in the database I cleared everything out and reset any statuses that needed to be reset on the associated entities. I could then restart the queue worker. It went well and the system started processing the queue and writing out data to the correct places. The entire outage lasted about 4 to 5 hours and it was now Friday evening. We were able to put production back into operation.

I could present the good news to the client and they were very happy that it all went to plan, but of course, we would need to make some changes in order for something like this to never happen again. The first and most obvious change is that the DBA would need to turn the safety back on in his MySQL Workbench configuration. The second is that we would need to ensure that our backups actually work as intended and that we were able to restore them without a lot of effort. To this day I'm not sure if the DevOps lead actually verified the functionality after this incident, but in the remaining time that I was at the company, we never had an event similar to this again. It is, however, as a retrospective, worth it to remind ourselves that it is worth checking and double checking your backup and restoration procedure.

So yeah, that is how I performed shotgun surgery on a MySQL database in production. I'm still surprised to this day that it all went to plan.


If you enjoyed this article consider subscribing for updates below. I'll make sure to make every update worth your time.