Tag-stacking: happy days

Today I undertook some more awesome analytics.

I received some data at lunchtime courtesy of one of my more distant Twitter connections.  I’d been recommended to her by one of her Twitter-hungry ex-colleagues as a trusted member of the government community (I think), and importantly as someone who knows their way around Excel.

The data was very high profile—its basis has adorned the front page of BBC News in the not-too-distant past—but I won’t divulge any more information about its contents, nor those requesting the analysis.  Suffice to say it had nothing to do with school repairs.

Here’s the challenge I faced.

There were over 10,000 rows of data.  And there was a single column at the end that contained comma-separated tags for each record.  And the most tag-heavy record was furnished with a smidgeon over 200 tags.  My mission was to create a single list of tags, together with their frequencies.

The first step was to get each tag into a separate cell.  Data | Text to Columns allowed me to declare the comma as the separator, and what was a single column became 209 columns of data.  But there wasn’t an easy way to stack these that I could think of.

So I went for a wee.  And suddenly thought of an ideal solution.

The ADDRESS function brings back the A1-style (or even an R1C1-style, if you choose) cell reference.  So =ADDRESS(2,3) will bring back the cell reference of row 2, column 3, i.e. “C2″.

But it brings this back as text.  Which is where the INDIRECT function comes into play.  INDIRECT takes some text and interprets it as a formula.  So =INDIRECT(C2) will give you the contents of C2.

Combine the two, and you have:

=INDIRECT(ADDRESS(2,3))

This will bring back the contents of cell C2.  But if you make the two arguments of the ADDRESS formula themselves into cell references, you have dynamite.

So I created two columns, one to contain the row references of the tags and one containing their column references.  The rows data contained 209 1s followed by 209 2s, 209 3s etc. until we hit 10,000.  The columns data contained the numbers 1 through 209, repeated 10,000 times.

2,144,967 rows of data (to be precise) were too much for Excel 2007, so I had to break it into three chunks.  A minor inconvenience.

Applying the above formula to each of those row/column combinations brought back every single tag in a list, with zeros where there was no tag.  (I.e. a record with five tags would have those listed followed by 204 zeros.  Not the most efficient mechanism maybe, but it worked.)

The three tag-lists were sorted, the zeros removed and the resulting lists were stacked upon one another.  A simple PivotTable then gave me the frequency count for each tag.  Over 25,000 tags in total (the zeros having been removed), made up of over 6,000 unique values.

One satisfied customer.  And one satisfied analyst.  Happy days indeed.

Excel-based site map generator

A good six or seven years back, I developed an Excel spreadsheet that took a hierarchical data feed of pages (page name, level in the IA, page type) and generated a visual representation for you on the fly. A crop of the visual is shown below.

It used conditional formatting up the ying-yang, with particular complexities around developing the interconnects between the boxes. While there were some macros to re-hash the incoming data, the visuals were all based on conditional formats which were at the time limited to a maximum of three (plus the default format) per cell.

All in all, it was awesome. (If I may say so.) If anyone out there has a use for it, let me know.

Sitemap

Distances as the crow flies: the solution

Yesterday I posted about joyful Tuesday, a day I spent lost in Excel, data and maps.  And as requested by @lesteph, here’s the solution I developed.  I’ve sanitised it, but the principles still stand.

The spreadsheet takes six places: Edinburgh, Bristol, Cambridge, London, Cardiff and Halifax.  It shows their latitude and longitude and uses the following formula to calculate the distance in kilometres to one decimal place between each pairing.

=ROUNDDOWN(6371.0072*ACOS(COS(RADIANS(90-lat_from))*COS(RADIANS(90-lat_to))+SIN(RADIANS(90-lat_from))*SIN(RADIANS(90-lat_to))*COS(RADIANS(long_from-long_to))),1)

lat_from is the latitude of the “from” location, and you can guess the other three such references.  The references in the spreadsheet itself are cell references to allow easy copy/paste action.

Conditional formatting hides the lower half of the square, and emphasises those places that are closest together.

I made the schoolboy error of hardcoding the earth’s radius (6,371.0072km), meaning that the model does not scale to other planets or moons in our solar system.  Apologies.

Mapping on a grand scale: poetry

I had a riot yesterday.  At work.  I was truly in my element.

I was given a task to map the 300+ offices that my client had across the UK.  And from that to identify clusters.  Although they relied on the same underlying data, I treated these two tasks largely in isolation to one another.  After all, identifying clusters from a visual map is rather an inexact science.

The mapping was sublime.  I used a Google Spreadsheet to convert postcodes to latitude and longitude coordinates (using the advice I’d previously documented here), used this to generate a KML file (through some fancy Excel formulae) which I could then import into Google Maps to create a visual representation of the locations.  Awesome.

Then on to the distances.  I’d had a vision the previous evening that I wanted to create one of those triangles that you used to see in the back of road atlases, across which you could trace your finger to figure out the distance from Plymouth to Edinburgh, for example.  So I went about creating one.

I made column headings out of the locations in Excel, transposing the data to duplicate them as row headings, together with their respective latitude and longitude coordinates.  I then used trigonometry (with the earth’s radius as a constant) to find the distance between each pair of locations (as the crow flies rather than driving distance).  A lovely piece of conditional formatting (if row number is greater than column number) allowed me to hide all the lower half of the triangle—a number format of ;;; hides a cell’s contents without affecting the value of the cell.

I was lost in Excel and Google Maps for an entire morning.  The subject matter had me gripped.  The quality of the output was top drawer; and the importance of the messages therein was right up there.  I love days like that.

Excel is like life

Life is complex.  Problems are complex.  And it’s rare that an intricate problem can be solved by a simple solution.  Or indeed that a single solution is the only one.  Excel is a good analogy here.

It has rows and columns and it was invented to organise data.

Yet in its 2007 version, it comes with eight standard ribbons.  (Further ribbons present themselves in specific situations.)  The Home ribbon alone has 42 separate items within it.  Twenty of these have dropdowns from which further options can be selected.  A very conservative estimate would be that an average of five sub-options are available for each of these.  If the other seven ribbons are similar, then that’s 976 options, and that doesn’t account for the plethora of formulae that can be written in each of the cells, and the canvas of colours available.  (As a slight aside, that would allow 16 trillion possible actions across the cells of a single worksheet.)

Admittedly, Excel was not designed to solve a single problem.  But the arsenal of tools available merely highlights the huge variety of ways in which problems can be addressed.  If you give ten people the same problem and ask them to solve it in Excel, each one will address it in a different way, sometimes subtly different, sometimes wildly different.  If you don’t tell them which tool to use, then your range of solutions will widen further.

Killer Sudoku Pro: spreadsheet helper

My dad called yesterday, leaving a voicemail informing me that he had a technical problem.  I returned the call, expecting to have to solve some form of driver compatibility issue or the like, but instead he gave me the following conundrum.

There is a variant of Sudoku called Killer Sudoku, in which areas of the grid, either square, rectangular or some other irregular shape, are caged off, a small number in the corner indicating the value to which the numbers therein must sum.  My dad informed me that there is a further variant (Killer Sudoku Pro) in which these cages are further marked to indicate whether the number in the corner represents the sum of the numbers therein, their product, or their subtraction.  (For subtraction, this is taken as the highest number minus all other numbers.)  There is also a divide operator, saved for the cages two squares in size—this equates to the higher number divided by the lower number. He asked whether there was a way to understand what options were valid in cages three-squares in size (squares are considered sufficiently basic for the human brain to fathom immediately) for each of the three valid operators.

I turned to Excel.  (No shit, Sherlock.)  And here is the result.  It’s a thing of beauty, if I may say so, the most poetic moment being when I established that the result of the minus operator could be established through the following formula:

(2 * MAX(cells)) – SUM(cells)

Hope you enjoy playing with it as much as I did creating it.

Tubewhacking

Paul Clarke introduced me today to the pastime of Tubewhacking.  Similar to Googlewhacking, it involves finding an English word none of whose letters appears in the name of a Tube station, and for that station to be unique in that quality.  The most famous example I know is St. John’s Wood, none of the letters of the word mackerel appearing in its name, a claim that no other station can make.

I wondered whether any stations were themselves Tubewhacks of other stations.  So I got to work.

Fortunately, the number of columns in Excel has increased recently—I needed 7,616 columns to complete my logic, along with a tidy 11.8Mb.  And below is a summary of the results.

There are 59 stations that have Tubewhacks, although their Tubewhacks come from only 22 unique stations.  Bank is the most common, being the Tubewhack of a whopping (not Wapping) nine stations. Each of Vauxhall and Woodford accounts for seven Tubewhacks.

Below is the full list—station on the left, Tubewhack on the right.

Barbican: Temple
Becontree: Vauxhall
Bermondsey: Vauxhall
Bond Street: Vauxhall
Boston Manor: Chigwell
Brent Cross: Vauxhall
Burnt Oak: Chigwell
Camden Town: Ruislip
Canary Wharf: Temple
Canning Town: Shepherd’s Bush
Charing Cross: Temple
Chorleywood: Bank
Colliers Wood: Bank
Dagenham Heathway: Ruislip
East Putney: Woodford
Eastcote: Kilburn
Edgware: Pimlico
Elephant & Castle: Woodford
Elm Park: St. John’s Wood
Farringdon: Temple
Fulham Broadway: Epping
Gants Hill: Woodford
Goldhawk Road: Upney
Green Park: Dollis Hill
Hainault: Woodford
Hampstead: Kilburn
Highgate: Woodford
Holloway Road: Epping
Ickenham: Woodford
Kew Gardens: Pimlico
Knightsbridge: Oval
Leyton: Chiswick Park
Limehouse: Bank
Liverpool Street: Bank
Mill Hill East: Woodford
Mornington Crescent: Vauxhall
Newbury Park: Dollis Hill
Perivale: St. John’s Wood
Piccadilly Circus: Kenton
Plaistow: Debden
Putney Bridge: Oval
Queensway: Pimlico
Richmond: St. Pauls
Rotherhithe: Bank
Royal Victoria: Debden
Shadwell: Brixton
Shoreditch: Bank
Southfields: Bank
St. James’s Park: Hillingdon
Tooting Bec: Vauxhall
Tower Hill: Bank
Upminster Bridge: Oval
Upton Park: Chigwell
Warren Street: Pimlico
West Brompton: Vauxhall
West Finchley: Moor Park
West Ham: Kilburn
West Hampstead: Kilburn
West Ruislip: Bank

Liverpool Street and Bank form the only pairing in the above list that are one stop away from one another.

Formatted hashtags

Paul Clarke speculates here about the usefulness of Twitter, with a specific focus on whether its penetration can be used to local effect: in reporting weather conditions or traffic disruption, for example.

He cites a couple of examples of hashtags (#uksnow, #uktraffic), both of which require(d) a specific syntax following the hashtag.  For example, #uktraffic needs to of the form:

#uktraffic [road] [where: jcts or place] [direction] [description] RT&seewhathappens

By submitting hashtags in that form, it enables them to be parsed for use in other applications, allowing them to be mapped, summarised etc.

With all due respect, I don’t think the Twitter community is sufficiently diligent to obey what seem like simple instructions such as this.  Yes, they can probably remember the hashtag.  But I’m sceptical as to whether they can remember the specific items that follow, let alone the order in which they need to appear.

For syntax-dependent hashtags like this to work, there needs to be a mechanism by which they can be registered, complete with the required syntax.  That way, end users can subscribe to such hashtags, allowing their client applications to prompt them with the required syntax once they’ve entered that hashtag.  Much in the way that Excel prompts you with the arguments needed for a formula that you’ve started to use.  (Hell, even I never remember the order in which SUMIF arguments should be entered.)

Without this, the hashtags might be used, but the data they generate will be next-to-useless.

Monday 6 July, 2009. Statistically significant

I’m sure you’re all aware of the double significance of Monday.  But I thought I’d document it for completeness and for the heathens among you.

Monday 6 July, 2009 will mark the fifth anniversary of this humble little blog.  And it will also be the 40,000th day since the start of the 20th century. *

The latter fact is, of course, significant given that Excel calculates time in days using 1 January, 1900 as its point of reference.  (50,000 will occur on 21 November, 2036, in case you want to book a venue in advance.)

Happy fifth birthday for Monday, little blog.  And happy 40,000th birthday, Excel.

* Actually, Tuesday 7 July, 1900 is the 40,000th day since the start of the 20th century.  But Excel purposely included a bug making 1900 a leap year (it wasn’t; remember?) to comply with other flawed date systems.  So Excel thinks 6 July, 2009 is represented as 40,000, even though it should be 39,999.

Re-categorising: I’m scared

I’ve been putting Google Spreadsheets posts in the All things Excel category for want of a better place to put them.  Should I change the name of the category to Spreadsheets?  It seems a bit like the end of an era to get rid of the Excel word.  But times are achangin’, and I think I need to change with them.  Thoughts?

Next Page →