Tag-stacking: happy days

Today I undertook some more awesome analytics.

I received some data at lunchtime courtesy of one of my more distant Twitter connections.  I’d been recommended to her by one of her Twitter-hungry ex-colleagues as a trusted member of the government community (I think), and importantly as someone who knows their way around Excel.

The data was very high profile—its basis has adorned the front page of BBC News in the not-too-distant past—but I won’t divulge any more information about its contents, nor those requesting the analysis.  Suffice to say it had nothing to do with school repairs.

Here’s the challenge I faced.

There were over 10,000 rows of data.  And there was a single column at the end that contained comma-separated tags for each record.  And the most tag-heavy record was furnished with a smidgeon over 200 tags.  My mission was to create a single list of tags, together with their frequencies.

The first step was to get each tag into a separate cell.  Data | Text to Columns allowed me to declare the comma as the separator, and what was a single column became 209 columns of data.  But there wasn’t an easy way to stack these that I could think of.

So I went for a wee.  And suddenly thought of an ideal solution.

The ADDRESS function brings back the A1-style (or even an R1C1-style, if you choose) cell reference.  So =ADDRESS(2,3) will bring back the cell reference of row 2, column 3, i.e. “C2″.

But it brings this back as text.  Which is where the INDIRECT function comes into play.  INDIRECT takes some text and interprets it as a formula.  So =INDIRECT(C2) will give you the contents of C2.

Combine the two, and you have:

=INDIRECT(ADDRESS(2,3))

This will bring back the contents of cell C2.  But if you make the two arguments of the ADDRESS formula themselves into cell references, you have dynamite.

So I created two columns, one to contain the row references of the tags and one containing their column references.  The rows data contained 209 1s followed by 209 2s, 209 3s etc. until we hit 10,000.  The columns data contained the numbers 1 through 209, repeated 10,000 times.

2,144,967 rows of data (to be precise) were too much for Excel 2007, so I had to break it into three chunks.  A minor inconvenience.

Applying the above formula to each of those row/column combinations brought back every single tag in a list, with zeros where there was no tag.  (I.e. a record with five tags would have those listed followed by 204 zeros.  Not the most efficient mechanism maybe, but it worked.)

The three tag-lists were sorted, the zeros removed and the resulting lists were stacked upon one another.  A simple PivotTable then gave me the frequency count for each tag.  Over 25,000 tags in total (the zeros having been removed), made up of over 6,000 unique values.

One satisfied customer.  And one satisfied analyst.  Happy days indeed.

The UK’s tax distribution

Someone on Question Time the other night—I forget who—stated that the only way to recoup significant levels of tax was to tax the poor.  I decided to do some analysis, and found that it’s largely true.

As of 2004/5, according to HMRC, the mean income of taxpaying individuals was £22,800, the median being a good deal lower than this—61% of taxpayers earned £20,000 or less.  If every pound of income generated the same amount of tax, then 52.3% of tax would have come from those earning £30,000 or less six years ago.  And less than 25% of tax would have come from those earning £50,000 or more.

Now arguably, the greater one’s disposable income, the more tax those pounds generate—some of the essentials in life are exempt from VAT.  But this effect will be marginal.

So the panelist was pretty much right.  To have a significant bearing on the amount of tax brought into HM Treasury (via the front door of 100 Parliament Street), you have to go after the low-end of the pay scale.

Distances as the crow flies: the solution

Yesterday I posted about joyful Tuesday, a day I spent lost in Excel, data and maps.  And as requested by @lesteph, here’s the solution I developed.  I’ve sanitised it, but the principles still stand.

The spreadsheet takes six places: Edinburgh, Bristol, Cambridge, London, Cardiff and Halifax.  It shows their latitude and longitude and uses the following formula to calculate the distance in kilometres to one decimal place between each pairing.

=ROUNDDOWN(6371.0072*ACOS(COS(RADIANS(90-lat_from))*COS(RADIANS(90-lat_to))+SIN(RADIANS(90-lat_from))*SIN(RADIANS(90-lat_to))*COS(RADIANS(long_from-long_to))),1)

lat_from is the latitude of the “from” location, and you can guess the other three such references.  The references in the spreadsheet itself are cell references to allow easy copy/paste action.

Conditional formatting hides the lower half of the square, and emphasises those places that are closest together.

I made the schoolboy error of hardcoding the earth’s radius (6,371.0072km), meaning that the model does not scale to other planets or moons in our solar system.  Apologies.

Mapping on a grand scale: poetry

I had a riot yesterday.  At work.  I was truly in my element.

I was given a task to map the 300+ offices that my client had across the UK.  And from that to identify clusters.  Although they relied on the same underlying data, I treated these two tasks largely in isolation to one another.  After all, identifying clusters from a visual map is rather an inexact science.

The mapping was sublime.  I used a Google Spreadsheet to convert postcodes to latitude and longitude coordinates (using the advice I’d previously documented here), used this to generate a KML file (through some fancy Excel formulae) which I could then import into Google Maps to create a visual representation of the locations.  Awesome.

Then on to the distances.  I’d had a vision the previous evening that I wanted to create one of those triangles that you used to see in the back of road atlases, across which you could trace your finger to figure out the distance from Plymouth to Edinburgh, for example.  So I went about creating one.

I made column headings out of the locations in Excel, transposing the data to duplicate them as row headings, together with their respective latitude and longitude coordinates.  I then used trigonometry (with the earth’s radius as a constant) to find the distance between each pair of locations (as the crow flies rather than driving distance).  A lovely piece of conditional formatting (if row number is greater than column number) allowed me to hide all the lower half of the triangle—a number format of ;;; hides a cell’s contents without affecting the value of the cell.

I was lost in Excel and Google Maps for an entire morning.  The subject matter had me gripped.  The quality of the output was top drawer; and the importance of the messages therein was right up there.  I love days like that.

155.6 swimming pools of oil

I love data.  And I love it when people take abstract concepts and try to put them in terms that people can relate to.  The most recent example has been the oil spill.  It’s nigh-on impossible to visualise the quantity of oil that has leaked from the Deepwater Horizon well since 22 April.  So many have tried to give estimates that we can relate to.  All of the quotes below are from different sources, found courtesy of Alan.

[The] flow rate […] would fill more than 50 swimming pools per day.

The Coast Guard [said] earlier it was at least 1.6 million gallons [since the spill started]—equivalent to about 2½ Olympic-sized swimming pools.

BP’s estimated 5,000 barrels a day […] would be approximately 150,000 barrels (or 6,300,000 gallons). That’s barely enough to fill 286 swimming pools: sixteen feet, by thirty-two feet, by eight and a half feet deep.

Once the pipe has been cut, the oil will spew into the Gulf of Mexico unobstructed, enough to fill an average swimming pool every hour

In just one day, the oil leaked would be enough to fill up six Olympic-sized swimming pools .

The only consistency here is the unit of measurement.  (It seems that the SI unit for oil is the Olympic-size swimming pool.)  The quantities are either formed of very varied guesswork or extremely poor arithmetic.  Probably a combination of the two.

My own estimate, based on an average between a low and a high estimate of the volume, was that as of 3 June, the oil would be sufficient to fill 155 Olympic-size swimming pools, with enough left over to start filling the 156th pool.  But this meant nothing to me.  So I changed the frame of reference.

If the oil was solid and laid evenly over the entirety of Hyde Park, it would form a paste 40 metres deep, equivalent to nine London buses stacked vertically.  Sounds a lot, doesn’t it?

Yet if the oil was to fall as snow over the entirety of the UK, it would amount to 0.2 seconds of snowfall.  Nothing at all, right?

While units that we are more familiar with help us visualise things that are intangible to us, they also serve to create dramatic headlines, headlines that could equally be replaced with innocuous ones if the newspapers were driven by ambivalence.

And who’d want to swim in oil anyway?

Killer Sudoku Pro: spreadsheet helper

My dad called yesterday, leaving a voicemail informing me that he had a technical problem.  I returned the call, expecting to have to solve some form of driver compatibility issue or the like, but instead he gave me the following conundrum.

There is a variant of Sudoku called Killer Sudoku, in which areas of the grid, either square, rectangular or some other irregular shape, are caged off, a small number in the corner indicating the value to which the numbers therein must sum.  My dad informed me that there is a further variant (Killer Sudoku Pro) in which these cages are further marked to indicate whether the number in the corner represents the sum of the numbers therein, their product, or their subtraction.  (For subtraction, this is taken as the highest number minus all other numbers.)  There is also a divide operator, saved for the cages two squares in size—this equates to the higher number divided by the lower number. He asked whether there was a way to understand what options were valid in cages three-squares in size (squares are considered sufficiently basic for the human brain to fathom immediately) for each of the three valid operators.

I turned to Excel.  (No shit, Sherlock.)  And here is the result.  It’s a thing of beauty, if I may say so, the most poetic moment being when I established that the result of the minus operator could be established through the following formula:

(2 * MAX(cells)) – SUM(cells)

Hope you enjoy playing with it as much as I did creating it.

From what point do they measure age?

I took our daughter to the London Aquarium today to celebrate her third birthday.  Quite fitting really, given that it sits in the shadows of St. Thomas’ hospital, the wonderful hospital at which she came into the world on what will always remain the most fabulous day of my life.

The Aquarium charges £15.50 for three-year-olds, but two-year-olds get in for free.  We joined the queue at 12.50pm, at which time she was twelve minutes shy of being three.  Our tickets were stamped at 1.10pm, but I lied about her age, claiming she was at least eight minutes younger than she was, saving £15.50 in the process.

Interestingly, the regular queue was about 90 minutes in length.  But you could join the priority queue (which contained no one) if you were willing to pay an extra £3 per ticket, taking an adult ticket price from £17.50 to £20.50.  (That’s £2 per person-hour saved.)

I was the only person I noticed that took the lady up on the offer as she advertised it to the people in the queue.  However I bet that if the proposition was changed, the respective queues would be very different.  If a regular ticket cost £20.50, but you could save £3 by taking the alternative queue, I bet lots more would go for the default, more expensive option, despite it being less appealing do to its, er, appeal.  I find this conundrum very fascinating.

March 14, 1592

With American pi day fast approaching (this Sunday), I got to thinking that on March 14, 1592, this was a big, big, geeky event.  Not that America had been invented by then, of course.

Proof by induction: Tube carriages are infinitely large

Take a standard Northern Line Tube carriage.

However many people are on board that carriage, at least one more adult can squeeze on.  This step has been proven at every station on the northbound southern leg of the Northern Line for many a year.

The smallest adult weighs 4.5kg, and with the average human having a body density of around one gram per cubic centimetre, that equates to a lower bound of 0.0045 cubic metries occupied by a person.

Therefore the lower bound for the volume taken up by a Northern Line Tube carriage is 0.0045 cubic metres multiplied by the number of occupants.

Given that the number of occupants can always be increased by one, a Northern Line Tube carriage is infinitely large.

QED.

It’s a lottery

Let’s assume I bought a Euromillions lottery ticket in the UK on Friday.  And I didn’t check the result that evening, but knew that 300 million likeminded people across Europe also bought a ticket for that draw.

I am then told that the £90m jackpot is to be split between the owners of two tickets, both of which were bought in the UK.  On knowing that information, do the odds of me having the winning ticket increase?

I then find out that one of the winning tickets is owned by someone who lives on my street.  (This piece is fiction, btw.)  Do my chances of winning increase?

It’s like the Monty Hall problem.  And no.  They don’t.

Next Page →