“. .this is the first time this problem has been encountered anywhere in the world,” said acting CIO Steve Hamilton of the Australian Tax office (ATO). Except that maybe it has happened before. It’s a reasonable assumption that he was merely repeating a line that was fed to him by somebody he believed. I wonder who that could be?
The ATO transferred their data storage capability from end-of-life EMC/HPE equipment to a new HPE 3PAR SAN (Storage Area Network) ‘as-a-service’ model in Nov 2015. It failed “catastrophically” just over a year later. The ATO lost 1 Petabyte of data because the automatic failover to the second SAN did not come online. Corrupted storage blocks on the main SAN had been faithfully copied to the second SAN.
The ATO does have another backup source, so the data loss is not total or permanent. It knocked out the operation of a large portion of a nation state’s government department for two days. Undoubtedly there will be financial repercussions for HPE and it’s another major blow to the Australian government’s reputation for technical capability in a short period of time after the recent census disaster.
Who are HPE and what is 3PAR?
Hewlett Packard Enterprise invites large enterprises to outsource: “We deliver high-quality, high-value products, consulting, and support services in a single package. That’s one of our principal differentiators.” HPE and Dell fought a bidding war to acquire storage systems supplier 3PAR in 2010. HPE won and paid $2.35 billion. 3PAR SAN systems utilize solid state flash storage that takes advantage of virtualization and cloud resources to promise faster processing speeds.
What went wrong?
The backup design seemingly allowed undetected corrupted data storage blocks to be duplicated to the second SAN, which may indicate a lack of data integrity checking. The root cause analysis of high profile incidents like this is rarely made public. The culprit could be a defective firmware upgrade or simple human error. We will probably never know. The symptom has surfaced previously with 3PAR SAN solutions, like this incident two months earlier. Anecdotal evidence seems to indicate other similar occurrences but IT failures at regular commercial enterprises rarely make it into the headlines.
What could have been done to mitigate the extent of the impact?
From what we know, the design revolved around a single data domain. That appears to represent a single point of failure no matter how much redundancy is built in.
Who is the usual victim of incidents like this?
Large enterprises up to government level are key clients to vendors such as HPE. Outsourcing deals of this nature are big budget projects. The ATO installation was part of a $92 million (AUD 1.29 billion) IT investment, to put it in perspective. Finger pointing inevitably occurs but the client typically puts its faith in the perceived capability and reliability of the vendor for technical design and support of a fit-for-purpose delivery.
How can this scenario be avoided?
It all boils down to the robustness of the design and whether or not the client is willing to spend sufficient budget for the safest possible option. That is not to criticize the ATO. The optimum solution could involve multiple vendors and come in at a cost that any financial controller would blanch at. As always, the delivered solution is a compromise between suitability and affordability. The ATO incident generated interesting technical debate on forums such as Whirlpool that sheds some light on SAN design and similar incidents.
Whenever an organization outsources an operational function, it places its reputation in the hands of the supplier. The bigger the organization, the harder the fall if things go wrong. And it does not get much bigger than a government’s reputation and the consequent slap-in-the face to the politicians in charge. The media exposure and feeding frenzy guarantees a major hit to the supplier also. For the client, the old tried and trusted avenues of due diligence and assigning qualified experts to perform rigorous scrutiny of the vendor’s proposed solution remain the best defense.