UPS and down times

By dougk, 5 June, 2019
Report

If you were trying to access this website last Monday afternoon or evening (3 June) you might have thought it no longer existed. And for some six and a half hours it didn't. This is why:

The Cove website is hosted at a data centre located in the Melbourne CBD. Just before 3pm Monday several buildings in that area of Melbourne suffered a power outage. Blackouts do happen from time to time. Data centres are well prepared. So prepared that we, as customers, usually don't even notice. Uninterruptible Power Supplies (UPS) kick in immediately. Service continues as per normal until either the mains power comes back on line or diesel generators have leapt into life just a minute or two later. Our information gets well protected by having duplicate copies on parallel storage systems.

But last Monday, Murphy's Law came into play. Firstly, those normally reliable UPS failed. Ouch. The large bank of servers plus the two big storage systems shut down with a thud. That cutoff some transactions in mid stream. Some files were half written.

After the diesel generator powered up, a complete restart of servers and services began. After any crash, the storage system needs checking and repairing. For the massive amount of information stored in a data centre, that is a lengthy but necessary process.

When the mains came back on, it automatically replaced the diesel generator as the source of power- just as it was designed to do. The data repair process was by then in full flight. All was going well. That was, until soon after 4pm.

That was the time the main power supply failed a second time. Another blackout. With the UPS still faulty, that meant there was a second hard crash. That is not good. Quite bad in fact. Particularly for the storage system which was at that time still trying to fix the errors created during the first crash. The automated storage repair process could no longer tell what information was still valid and what had been corrupted. Experts were called upon. Detailed analysis of log files was required. A long, difficult rebuild process began.

Sometime between 9.30pm and 10pm our website was back in operation. An excellent effort under the circumstances. Some services took much longer to be restored. The data centre, our host provider and all their clients including us, that is a lot of very relieved people.

Can a similar outages be prevented from happening again? Processes and practices are certainly being reviewed. But there is a snag. This outage was caused by the UPS systems installed in an effort to prevent outages.

Adding multiple UPS systems, diesel generators, multiple servers, multiple parallel storage systems, multiple network paths sounds great - and mostly it is. But it also adds complexity and with that complexity come ever more potential points of failure. More things to go wrong. And if something can go wrong, at some point it probably will.

Filed under