Australia shows the world how NOT to run an online census as someone stuffed up their load calculations in a BIG WAY. #CensusFail
On 9th August 2016, Australia was meant to have had its first online census.
What happened instead out to be a farce – a non-event as millions of us tried to log in to complete the form, only to find that the website was down.
The lead up to the census was controversial enough, with strong privacy concerns over the retention of personally identifiable information. But that was only a sneak preview of the monumental stuff up that would happen on Census Night when millions were unable to participate in the census because the servers were down.
While hackers were initially blamed for the downtime, could it have been plain incompetence? But before I dive into that, here’s some background for our friends living outside Australia.
The Census 2016 had already drawn oppositions from politicians who raised strong privacy concerns over sharing sensitive personally identifiable information with the Australian Bureau of Statistics (ABS). The problem was twofold: the retention of names for a period of 4 years (previously only 18 months) and the assigning of a unique ID to each person to personally track individuals over the course of their lives in subsequent censuses. Thus the census would change from being a snapshot view into a longitudinal study.
However, it is compulsory to answer every question in the census form, including names and addresses. The government had announced to fine every citizen $180 per day cumulatively for every day of non-compliance. Despite this, several senators announced they would refuse to divulge their name and address in the census form. The hope was that if enough Australians followed suit, it would become exceedingly difficult to fine the large number who failed to comply with census regulations.
The Government’s counter argument to privacy concerns were rather nonsensical: statements that the collection of personal data by the ABS was no worse than “Facebook” or a “supermarket loyalty card”.
On census night things worked for a bit… and then fell over…
census.abs.gov.au – that’s where Australians had to go to complete their census forums.
About 2 million census forms were submitted on census night, before the system fell over. Australians looking to complete the census later in the night, myself included, were greeted with this wonderful page.
And a day later it got worse – the server appeared to be completely offline.
And now, another day later, the server is back online but unavailable:
Predictably, the ABS was forced to relax the $180-per-day punishment of non-compliance.
The 2016 census, described as the “worst-handled census in history”, will cost Australian taxpayers $470 million.
The Census Blame Game
Predictably, the entire census farce has sparked a blame game amongst politicians, ABS spokespersons and the wider tech world.
- Early reports from the ABS were that organized cybercriminals from outside the country were responsible for bringing down the website.
- Security experts were quick to dig deep into the alleged Denial of Service (DoS) attack and identified little evidence of a cyber-attack.
- The government eventually acknowledged the ABS network failed not because of hackers or a malicious attack, but because of an “overcautious” response to the sudden influx of traffic perceived as a possible DoS attack. The issue was tracked down to overloaded routers and false alarm mechanisms.
That’s right: according to the latest explanations, the ABS servers received so much traffic it looked like a DDoS attack, so the overloaded routers were shut down.
Is that actually true? It wouldn’t be surprising if further changes in explanation were given. But let’s run with this argument for a while and see where it leads…
How bad at statistics are the Australian Bureau of Statistics?
The ABS claimed that they had load tested the website to handle 1 million forms per hour, paying a total of $504,017.50 for load testing services, scripts and licences in the last 12 months.
Let’s think that through – 1 million form submissions per hour.
Australia has a population of approximately 24.1 million people, with an average household size of 2.6 people per household – meaning approximately 9.27 million households. At one census form per household, that means about 9.27 million census forms.
All Australians are supposed to complete the census on the night of 9th August, with fines for late submissions. Let’s say that gives a window of 6 hours – between 6pm and midnight – where we come home from work and get busy filling in the forms. (And yes, Perth is 2 hours behind, so technically I should allow for that, but we’re just talking averages here.)
Even if all the census forms submissions were evenly distributed across those 6 hours, that’s an average of just over 1.5 million forms per hour – or 50% higher than the ABS claimed their load testing.
However, anyone who understands statistical modelling (yes, ABS, that should be you) understands that things just don’t happen uniformly. During times of peak demand, the load can spike to several times the average. If TV stations understand that there are most viewers in prime time, how could the ABS not have predicted the same?
Factors of Safety
Without going into the statistical modelling (let’s leave that to the ABS), it wouldn’t surprise me if at peak times, 3 million forms per hour (twice the average) would have been submitted if the website was actually working.
Therefore, in my opinion, the Census website should have been load tested to 5 million forms per hour. The extra 2 million is called a “safety factor”. Civil engineers regularly incorporate a safety factor into the design of bridges and buildings, and so should software engineers in their designs.
So whoever thought that 1 million form submissions per hour was sufficient didn’t appear to know what they were doing.
What was actually load tested?
My next question was – what was actually load tested?
- Was it just the web servers, perhaps put into an artificial testing environment and network?
- OR was it the entire system, including the live production routers and complete network?
Often, testing is done in artificial environments. It’s possible that the web servers were load tested, but not the production network. We don’t know, so that’s only speculation. However, if the entire network were tested adequately, I don’t believe the problems should have occurred.
IBM were paid $9.6m to develop the eCensus solution. The last few days have shown that creating robust cloud services are not quite as easy as baking a cake.
As we all know, cloud computing is not immune to outages. For #CensusFail, the apparent root cause was tracked back to flawed network design and protection mechanism. In essence, the incident was more of a performance issue than a security issue that isn’t inherent to a well-designed cloud network.
But this should get everyone thinking: what would have happened if there actually was a DDoS attack on the Census website? What if there was a privacy breach, and data was either deleted maliciously or copied?
What is your safety factor?
When you use the cloud to store, share and communicate, do you have a safety factor?
What would happen if there were a hacking or DDoS attack on your cloud provider? Organizations must not treat cloud vendors as the last line of defence against security, privacy and even performance related risks.
Advanced tools to backup critical data locally ensure the required information is always available, even during cloud outages.
Let’s treat #CensusFail as a learning experience for us all.