Tag-stacking: happy days

Today I undertook some more awesome analytics.

I received some data at lunchtime courtesy of one of my more distant Twitter connections.  I’d been recommended to her by one of her Twitter-hungry ex-colleagues as a trusted member of the government community (I think), and importantly as someone who knows their way around Excel.

The data was very high profile—its basis has adorned the front page of BBC News in the not-too-distant past—but I won’t divulge any more information about its contents, nor those requesting the analysis.  Suffice to say it had nothing to do with school repairs.

Here’s the challenge I faced.

There were over 10,000 rows of data.  And there was a single column at the end that contained comma-separated tags for each record.  And the most tag-heavy record was furnished with a smidgeon over 200 tags.  My mission was to create a single list of tags, together with their frequencies.

The first step was to get each tag into a separate cell.  Data | Text to Columns allowed me to declare the comma as the separator, and what was a single column became 209 columns of data.  But there wasn’t an easy way to stack these that I could think of.

So I went for a wee.  And suddenly thought of an ideal solution.

The ADDRESS function brings back the A1-style (or even an R1C1-style, if you choose) cell reference.  So =ADDRESS(2,3) will bring back the cell reference of row 2, column 3, i.e. “C2”.

But it brings this back as text.  Which is where the INDIRECT function comes into play.  INDIRECT takes some text and interprets it as a formula.  So =INDIRECT(C2) will give you the contents of C2.

Combine the two, and you have:

=INDIRECT(ADDRESS(2,3))

This will bring back the contents of cell C2.  But if you make the two arguments of the ADDRESS formula themselves into cell references, you have dynamite.

So I created two columns, one to contain the row references of the tags and one containing their column references.  The rows data contained 209 1s followed by 209 2s, 209 3s etc. until we hit 10,000.  The columns data contained the numbers 1 through 209, repeated 10,000 times.

2,144,967 rows of data (to be precise) were too much for Excel 2007, so I had to break it into three chunks.  A minor inconvenience.

Applying the above formula to each of those row/column combinations brought back every single tag in a list, with zeros where there was no tag.  (I.e. a record with five tags would have those listed followed by 204 zeros.  Not the most efficient mechanism maybe, but it worked.)

The three tag-lists were sorted, the zeros removed and the resulting lists were stacked upon one another.  A simple PivotTable then gave me the frequency count for each tag.  Over 25,000 tags in total (the zeros having been removed), made up of over 6,000 unique values.

One satisfied customer.  And one satisfied analyst.  Happy days indeed.

Comments

Leave a Reply