Today at around 10:45pm CET, after a couple of glasses of red wine, we deleted the production database by accident 😨. Over 300.00 scoreboards and their associated data were vaporised in an instant.
Thankfully our database is a managed database from DigitalOcean, which means that DigitalOcean automatically do backups once a day. After 5 minutes of hand-wringing and panic, we took the website into maintenance mode and worked on restoring a backup. At around 11:15pm CET, 30 minutes after the disaster, we went back online, however 7 hours of scoreboard data was gone forever 😵.
To be precise, any scoreboards created or scores added on the 17th October 2020 between 15:47 CET and 23:21 CET have been lost. We are extremely sorry about this.
It’s tempting to blame the disaster on the couple of glasses of red wine. However, the function that wiped the database was written whilst sober. It’s a function that deletes the local database and creates all the required tables from scratch. This evening, whilst doing some late evening coding, the function connected to the production database and wiped it. Why? This is something we’re still trying to figure out.
Here is the code that caused the disaster:
def database_model_create(): """Only works on localhost to prevent catastrophe""" database = config.DevelopmentConfig.DB_DATABASE user = config.DevelopmentConfig.DB_USERNAME password = config.DevelopmentConfig.DB_PASSWORD port = config.DevelopmentConfig.DB_PORT local_db = PostgresqlDatabase(database=database, user=user, password=password, host='localhost', port=port) local_db.drop_tables([Game, Player, Round, Score, Order]) local_db.create_tables([Game, Player, Round, Score, Order]) print('Initialized the local database.')
host is hardcoded to
localhost. This means it should never connect to any machine other than the developer machine. Also: of course we use different passwords and users for development and production. We’re too tired to figure it out right now.
What have we learned? Why won’t this happen again?
We’ve learned that having a function that deletes your database is too dangerous to have lying around. The problem is, you can never really test the safety mechanisms properly, because testing it would mean pointing a gun at the production database.
We’ve learned that having a backup which allows a quick recovery is absolutely essential. Thanks DigitalOcean, for making this part reliable and simple.
We’ve learned that even a disaster can have some up-sides. This blog post generated a lot of interest. When life gives you citrus fruits, and so on.
The truth is, we can never be 100% sure that something like this won’t happen again. Computers are just too complex and there are days when the complexity gremlins win. However, we will figure out what went wrong and ensure that this particular error doesn’t happen again.
Thankfully nobody’s job is at risk due to this disaster. The founder is not going to fire the developer – because they are one and the same person.
Also, this webapp is just a side-project. It’s not the software that’s running a power-plant. Nonetheless, we have many users, some of them paying customers, and we try our very best to make them happy. Today we let those users down and that hurts.
The wonderful irony is that not 4 days earlier we tweeted a hilarious meme about deleting your production database:
Again, we are very sorry. Good night.
PS This generated some great discussion on Hackernews.