Flickr’s Operations Nightmare

As an Operations and Service Delivery professional this article at TechCrunch about Flickr’s faux pas feels like my worst nightmare come true. It seems like a flickr employee accidently deleted a long time user’s account and the approximately 4,000 photos that he’d posted over the years, five years of work posting photos, building a reputation, community, and a brand; all lost and apparently beyond recovery.

When faced with an incident like this, or reading about them, I tend to jump in to what I call my ‘Incident Management’ mode. If you’re reading this blog I’m sure you know what I mean, in a nutshell; identify the issue, resolve or reduce the impact, review, learn, and prevent it from happening again. (My apologies to all of my ITIL certified readers.)

Now I admit I don’t have all of the details and I’m certainly not in the position to do a real post mortem of the incident. But let’s assume for the moment that the TechCrunch article is accurate; this wasn’t a question of violation of the terms of service it was an outright mistake by someone in operations who deleted the wrong account.

Right off the bat I can almost hear many of you saying ‘Ah ha, the problem is the tech didn’t follow proper procedure or exercise due diligence!’ and in saying that you’ve pointed your finger squarely at the operations team and thrown the problem over the wall. And to you I say ‘Take off the blinders and have a closer look.’

The underlying issue (root cause) is that a design/architecture decision was made that created an operations nightmare waiting to happen. From the comments I’ve seen this scenario has happened before and more than likely a post mortem identified the flaw, and it was pushed back to the architecture/design group to address. Continual Service Improvement at its best.

Perhaps a risk assessment was completed. Maybe the scope of the required code changes was documented. Or perhaps it was summarily set at the lowest priority; unfortunately Continual Service Improvement as it’s all too commonly applied.

The bottom line is that, for whatever reason, a business decision was made that put the company at a substantial risk. It also placed an undue burden on the operations/support group and resulted in a very embarrassed flickr management team. Imagine the uproar if this was a transaction based business model with the customer generating millions of dollars in revenue annually.

Even now they are scrambling and investing many man hours to assist this paying customer in recovering his data. And if senior management is even only remotely connected to the business I’m sure they’ve spent some valuable time trying to figure out how to save face with their customers.

A flickr staff member goes on to post in the support forums that ‘We’ve been working on the ability to restore accounts for a while and hope to have it completed early this year.’ A poor public response from a notably junior staff member that implies a band-aid rather than a real fix such as a account suspend and hold design.

From my ten years in operations and many more in design and coding this is a scenario that I’ve seen all too often; an engineering mantra and mentality that it’s an ‘operations issue’. Perhaps in some way it really is an operations issue, but it’s one that could have been wholly avoided by common sense design and architecture.

No doubt the cost of fixing the underlying issue when it was first reported by operations would have been far less than the current mad dash to put out the fire.

Leave a Reply Cancel reply