Magazine

How hard is it to count people?

Graph

Census forms are being filled in the length and breadth of the UK but how hard can it be to count people, asks Michael Blastland in his regular column.

How wrong can the Census go? Not that I want to dampen anyone's enthusiasm. Actually, the reverse, I have a weird admiration for people who count people. They're grappling with one of the most fiendish problems - us.

The graph above is the most extreme example I know of what can go wrong. The graphic is from the US Census Bureau - an impressive outfit. It shows the attempt to find out how many people were aged over 100 at each 10-yearly Census - and it shows two figures.

The first, the enumerated, is the number who said they were over 100. The second, the preferred estimate, is the number the Census Bureau thought really were over 100.

What happened in 1970 is anyone's guess. My hunch is that it is 1970 and they were all on acid. "Yeah, I'm 100 man. I was 100 in the last life too."

Lies and mistakes

In truth, I don't know what went wrong. Maybe the form was confusing that year, though how hard it can be to ask for an age, or answer the question, I'm not sure.

Maybe new benefits were announced which encouraged people to be vague about their birth date. Maybe there was a TV show the night before the Census celebrating the hip lifestyle of the new centenarian, maybe an organised conspiracy by grey pressure groups to increase healthcare provision.

People who campaign for open data - the easy availability of official and unofficial statistics of all kinds - often hate the fact that the people who gather and release it like to present it their own way.

Image caption Raw data is hard to work with

These campaigners say things like "just give us the data!" Whole conferences have chanted that phrase. I'm with them, but only so far. Raw data is hellish hard work. It includes lies and mistakes and gaps that require endless cross-checking, investigation, weighting and adjustment.

Can we just extrapolate from the data we gathered successfully and assume the same pattern applies to the households that didn't reply? Not necessarily. Maybe a fair proportion of those who didn't reply were all from one group, like young men who couldn't give a... maybe.

But how would we know for sure who didn't reply? How do you count the stuff that wasn't counted? That's why the real work of counting starts when Census day is done.

If the US Census Bureau had just given us the raw data in 1970, we could have produced some beautiful graphics about the astonishing, breathtaking, apocalyptic, budget-busting rise in the very old. Raw data isn't fact, still less is it information.

Problems and lies

So, in extremis, raw data might even produce something like 21 times too many centenarians. Sorting this out is often called data cleaning. Some react to that phrase as if it concealed the black arts of statistical fiddling. But it's usually just the recognition that counting people is tough - people who don't always co-operate, who lie, who are confused, who can't be bothered, who don't understand, who think it's hilarious to invent new religions, who lost the form, who…

Sure, the Census is an evil conspiracy to pry, so that they, whoever they are, can know all about us. Until you see raw data. A good antidote to the evil-empire view is to come face to face with real-life counting. You soon realise that governments know half as much as they like to pretend, largely because gathering information is a bigger, messier, pig-sty of labour and guesswork, than often assumed.

Which is why they do it. Because they know a lot less than you probably think and always will. Every source of data is riddled with problems. For a sound guide to the travails of harvesting simple numbers, try Information Generation, a book by David Hand, a great insight into a simple business.

So, done your form? How was it for you? Nothing, I tell you, to how hard it'll be for them.