Catastrophic Data Storage Failure in Australia – Could You Be Next?

“. .this is the first time this problem has been encountered anywhere in the world,” said acting CIO Steve Hamilton of the Australian Tax office (ATO). Except that maybe it has happened before. It’s a reasonable assumption that he was merely repeating a line that was fed to him by somebody he believed. I wonder who that could be?

The ATO transferred their data storage capability from end-of-life EMC/HPE equipment to a new HPE 3PAR SAN (Storage Area Network) ‘as-a-service’ model in Nov 2015. It failed “catastrophically” just over a year later. The ATO lost 1 Petabyte of data because the automatic failover to the second SAN did not come online. Corrupted storage blocks on the main SAN had been faithfully copied to the second SAN.

The ATO does have another backup source, so the data loss is not total or permanent. It knocked out the operation of a large portion of a nation state’s government department for two days. Undoubtedly there will be financial repercussions for HPE and it’s another major blow to the Australian government’s reputation for technical capability in a short period of time after the recent census disaster.

Who are HPE and what is 3PAR?

Hewlett Packard Enterprise invites large enterprises to outsource: “We deliver high-quality, high-value products, consulting, and support services in a single package. That’s one of our principal differentiators.” HPE and Dell fought a bidding war to acquire storage systems supplier 3PAR in 2010. HPE won and paid $2.35 billion. 3PAR SAN systems utilize solid state flash storage that takes advantage of virtualization and cloud resources to promise faster processing speeds.

What went wrong?

The backup design seemingly allowed undetected corrupted data storage blocks to be duplicated to the second SAN, which may indicate a lack of data integrity checking. The root cause analysis of high profile incidents like this is rarely made public. The culprit could be a defective firmware upgrade or simple human error. We will probably never know. The symptom has surfaced previously with 3PAR SAN solutions, like this incident two months earlier. Anecdotal evidence seems to indicate other similar occurrences but IT failures at regular commercial enterprises rarely make it into the headlines.

What could have been done to mitigate the extent of the impact?

From what we know, the design revolved around a single data domain. That appears to represent a single point of failure no matter how much redundancy is built in.

Who is the usual victim of incidents like this?

Large enterprises up to government level are key clients to vendors such as HPE. Outsourcing deals of this nature are big budget projects. The ATO installation was part of a $92 million (AUD 1.29 billion) IT investment, to put it in perspective. Finger pointing inevitably occurs but the client typically puts its faith in the perceived capability and reliability of the vendor for technical design and support of a fit-for-purpose delivery.

How can this scenario be avoided?

It all boils down to the robustness of the design and whether or not the client is willing to spend sufficient budget for the safest possible option. That is not to criticize the ATO. The optimum solution could involve multiple vendors and come in at a cost that any financial controller would blanch at. As always, the delivered solution is a compromise between suitability and affordability. The ATO incident generated interesting technical debate on forums such as Whirlpool that sheds some light on SAN design and similar incidents.

Whenever an organization outsources an operational function, it places its reputation in the hands of the supplier. The bigger the organization, the harder the fall if things go wrong. And it does not get much bigger than a government’s reputation and the consequent slap-in-the face to the politicians in charge. The media exposure and feeding frenzy guarantees a major hit to the supplier also. For the client, the old tried and trusted avenues of due diligence and assigning qualified experts to perform rigorous scrutiny of the vendor’s proposed solution remain the best defense.

Australia shows the world how NOT to run an online census as someone stuffed up their load calculations in a BIG WAY. #CensusFail

On 9th August 2016, Australia was meant to have had its first online census.

What happened instead out to be a farce – a non-event as millions of us tried to log in to complete the form, only to find that the website was down.

The lead up to the census was controversial enough, with strong privacy concerns over the retention of personally identifiable information. But that was only a sneak preview of the monumental stuff up that would happen on Census Night when millions were unable to participate in the census because the servers were down.

While hackers were initially blamed for the downtime, could it have been plain incompetence? But before I dive into that, here’s some background for our friends living outside Australia.

Privacy concerns

The Census 2016 had already drawn oppositions from politicians who raised strong privacy concerns over sharing sensitive personally identifiable information with the Australian Bureau of Statistics (ABS). The problem was twofold: the retention of names for a period of 4 years (previously only 18 months) and the assigning of a unique ID to each person to personally track individuals over the course of their lives in subsequent censuses. Thus the census would change from being a snapshot view into a longitudinal study.

However, it is compulsory to answer every question in the census form, including names and addresses. The government had announced to fine every citizen $180 per day cumulatively for every day of non-compliance. Despite this, several senators announced they would refuse to divulge their name and address in the census form. The hope was that if enough Australians followed suit, it would become exceedingly difficult to fine the large number who failed to comply with census regulations.

The Government’s counter argument to privacy concerns were rather nonsensical: statements that the collection of personal data by the ABS was no worse than “Facebook” or a “supermarket loyalty card”.

On census night things worked for a bit… and then fell over…

census.abs.gov.au – that’s where Australians had to go to complete their census forums.

About 2 million census forms were submitted on census night, before the system fell over. Australians looking to complete the census later in the night, myself included, were greeted with this wonderful page.

CensusFail

And a day later it got worse – the server appeared to be completely offline.

CensusFail2

And now, another day later, the server is back online but unavailable:

CensusFail - Australian online census becomes farce

Predictably, the ABS was forced to relax the $180-per-day punishment of non-compliance.

The 2016 census, described as the “worst-handled census in history”, will cost Australian taxpayers $470 million.

The Census Blame Game

Predictably, the entire census farce has sparked a blame game amongst politicians, ABS spokespersons and the wider tech world.

  1. Early reports from the ABS were that organized cybercriminals from outside the country were responsible for bringing down the website.
  2. Security experts were quick to dig deep into the alleged Denial of Service (DoS) attack and identified little evidence of a cyber-attack.
  3. The government eventually acknowledged the ABS network failed not because of hackers or a malicious attack, but because of an “overcautious” response to the sudden influx of traffic perceived as a possible DoS attack. The issue was tracked down to overloaded routers and false alarm mechanisms.

That’s right: according to the latest explanations, the ABS servers received so much traffic it looked like a DDoS attack, so the overloaded routers were shut down.

Is that actually true? It wouldn’t be surprising if further changes in explanation were given. But let’s run with this argument for a while and see where it leads…

How bad at statistics are the Australian Bureau of Statistics?

The ABS claimed that they had load tested the website to handle 1 million forms per hour, paying a total of $504,017.50 for load testing services, scripts and licences in the last 12 months.

CensusFail RevolutionIT

Let’s think that through – 1 million form submissions per hour.

Australia has a population of approximately 24.1 million people, with an average household size of 2.6 people per household – meaning approximately 9.27 million households. At one census form per household, that means about 9.27 million census forms.

All Australians are supposed to complete the census on the night of 9th August, with fines for late submissions. Let’s say that gives a window of 6 hours – between 6pm and midnight – where we come home from work and get busy filling in the forms. (And yes, Perth is 2 hours behind, so technically I should allow for that, but we’re just talking averages here.)

Even if all the census forms submissions were evenly distributed across those 6 hours, that’s an average of just over 1.5 million forms per hour – or 50% higher than the ABS claimed their load testing.

However, anyone who understands statistical modelling (yes, ABS, that should be you) understands that things just don’t happen uniformly. During times of peak demand, the load can spike to several times the average. If TV stations understand that there are most viewers in prime time, how could the ABS not have predicted the same?

Factors of Safety

Without going into the statistical modelling (let’s leave that to the ABS), it wouldn’t surprise me if at peak times, 3 million forms per hour (twice the average) would have been submitted if the website was actually working.

Therefore, in my opinion, the Census website should have been load tested to 5 million forms per hour. The extra 2 million is called a “safety factor”. Civil engineers regularly incorporate a safety factor into the design of bridges and buildings, and so should software engineers in their designs.

So whoever thought that 1 million form submissions per hour was sufficient didn’t appear to know what they were doing.

What was actually load tested?

My next question was – what was actually load tested?

    • Was it just the web servers, perhaps put into an artificial testing environment and network?
    • OR was it the entire system, including the live production routers and complete network?

Often, testing is done in artificial environments. It’s possible that the web servers were load tested, but not the production network. We don’t know, so that’s only speculation. However, if the entire network were tested adequately, I don’t believe the problems should have occurred.

Questions unanswered…

IBM were paid $9.6m to develop the eCensus solution. The last few days have shown that creating robust cloud services are not quite as easy as baking a cake.

As we all know, cloud computing is not immune to outages. For #CensusFail, the apparent root cause was tracked back to flawed network design and protection mechanism. In essence, the incident was more of a performance issue than a security issue that isn’t inherent to a well-designed cloud network.

But this should get everyone thinking: what would have happened if there actually was a DDoS attack on the Census website? What if there was a privacy breach, and data was either deleted maliciously or copied?

What is your safety factor?

When you use the cloud to store, share and communicate, do you have a safety factor?

What would happen if there were a hacking or DDoS attack on your cloud provider? Organizations must not treat cloud vendors as the last line of defence against security, privacy and even performance related risks.

Advanced tools to backup critical data locally ensure the required information is always available, even during cloud outages.

Let’s treat #CensusFail as a learning experience for us all.