Paying it forward

I received a request yesterday for some Excel help. It came via Twitter from an intern working for quite a big internet company in San Francisco.

There was nothing in it for me beyond shits and giggles. But I helped.

I received a dataset and did half an hour’s worth of analysis last night. We were going to speak about it this morning UK time, but his internet failed him, so that didn’t happen. But I instead spent another half hour cleaning up the analysis and polishing things off, then I was done. I spent another hour or so on a video call tonight explaining the steps I went through.

To me, it was simple stuff. Some text manipulation and data cleansing, some PivotTables, that was pretty much it.

To him, he was blown away. (“This is incredible.”) The first draft of the output was well-received, and it looks like the second iteration has put it to bed.

The commitment for me was relatively trivial. A couple of hours’ effort, a few moments wasted waiting for a Skype call, a little electricity. But to him, I get the feeling that the output is immense, and that it will allow him to shine that bit brighter in his internship.

I’ll likely never meet the guy. But helping him feels good. Really good. I guess it’s what the Americans would call “paying it forward”.

Excel KML creator

So. I created an Excel KML creator. For those not in the know, KML files are used by Google Maps as the basis for automatically uploading icons.

The KML creator allows you to create your data in a familiar setting, the associated KML generating itself in the background. This version allows for up to 1,000 points to be created.

Here’s what you do to create the KML:

To create the associated map:
  • Go to Google Maps,  logged in with a Google account
  • Hit My places
  • Hit Create map
  • Give it a name and description and hit Import
  • Choose the txt file you just created.
If you need help creating latitude and longitude coordinates from postcodes or zip codes, read this post. It describes how this can be done in Google Docs, albeit 50 locations at a time.

Wizard of Excel

I have a new website. It’s called Wizard of Excel. And it can be found here.

A few weeks back, my friend Steph suggested that I create a platform on which to share my Excel knowledge and experience. As well as offering advice to anyone who might benefit, it might, in the future, provide an advertising platform, as well as hopefully bringing in some consultancy and training business.

So far, I’ve written 32 posts that have attracted 1,100 viewings. There’s lots of content still to be written, and my hope is that traffic continues to grow.

The highlight to date has been a spreadsheet designed to work out which Olympic tickets you might have received from Lord Coe. Try it out.

Huge credit to Steph for the design and hosting. He’s done a fabulous job to get it looking fab and to put it on what appears to be a solid platform.

Enjoy!

Please let me know what you think of the new site

Tag-stacking: happy days

Today I undertook some more awesome analytics.

I received some data at lunchtime courtesy of one of my more distant Twitter connections.  I’d been recommended to her by one of her Twitter-hungry ex-colleagues as a trusted member of the government community (I think), and importantly as someone who knows their way around Excel.

The data was very high profile—its basis has adorned the front page of BBC News in the not-too-distant past—but I won’t divulge any more information about its contents, nor those requesting the analysis.  Suffice to say it had nothing to do with school repairs.

Here’s the challenge I faced.

There were over 10,000 rows of data.  And there was a single column at the end that contained comma-separated tags for each record.  And the most tag-heavy record was furnished with a smidgeon over 200 tags.  My mission was to create a single list of tags, together with their frequencies.

The first step was to get each tag into a separate cell.  Data | Text to Columns allowed me to declare the comma as the separator, and what was a single column became 209 columns of data.  But there wasn’t an easy way to stack these that I could think of.

So I went for a wee.  And suddenly thought of an ideal solution.

The ADDRESS function brings back the A1-style (or even an R1C1-style, if you choose) cell reference.  So =ADDRESS(2,3) will bring back the cell reference of row 2, column 3, i.e. “C2”.

But it brings this back as text.  Which is where the INDIRECT function comes into play.  INDIRECT takes some text and interprets it as a formula.  So =INDIRECT(C2) will give you the contents of C2.

Combine the two, and you have:

=INDIRECT(ADDRESS(2,3))

This will bring back the contents of cell C2.  But if you make the two arguments of the ADDRESS formula themselves into cell references, you have dynamite.

So I created two columns, one to contain the row references of the tags and one containing their column references.  The rows data contained 209 1s followed by 209 2s, 209 3s etc. until we hit 10,000.  The columns data contained the numbers 1 through 209, repeated 10,000 times.

2,144,967 rows of data (to be precise) were too much for Excel 2007, so I had to break it into three chunks.  A minor inconvenience.

Applying the above formula to each of those row/column combinations brought back every single tag in a list, with zeros where there was no tag.  (I.e. a record with five tags would have those listed followed by 204 zeros.  Not the most efficient mechanism maybe, but it worked.)

The three tag-lists were sorted, the zeros removed and the resulting lists were stacked upon one another.  A simple PivotTable then gave me the frequency count for each tag.  Over 25,000 tags in total (the zeros having been removed), made up of over 6,000 unique values.

One satisfied customer.  And one satisfied analyst.  Happy days indeed.

Excel-based site map generator

A good six or seven years back, I developed an Excel spreadsheet that took a hierarchical data feed of pages (page name, level in the IA, page type) and generated a visual representation for you on the fly. A crop of the visual is shown below.

It used conditional formatting up the ying-yang, with particular complexities around developing the interconnects between the boxes. While there were some macros to re-hash the incoming data, the visuals were all based on conditional formats which were at the time limited to a maximum of three (plus the default format) per cell.

All in all, it was awesome. (If I may say so.) If anyone out there has a use for it, let me know.

Sitemap

Distances as the crow flies: the solution

Yesterday I posted about joyful Tuesday, a day I spent lost in Excel, data and maps.  And as requested by @lesteph, here’s the solution I developed.  I’ve sanitised it, but the principles still stand.

The spreadsheet takes six places: Edinburgh, Bristol, Cambridge, London, Cardiff and Halifax.  It shows their latitude and longitude and uses the following formula to calculate the distance in kilometres to one decimal place between each pairing.

=ROUNDDOWN(6371.0072*ACOS(COS(RADIANS(90-lat_from))*COS(RADIANS(90-lat_to))+SIN(RADIANS(90-lat_from))*SIN(RADIANS(90-lat_to))*COS(RADIANS(long_from-long_to))),1)

lat_from is the latitude of the “from” location, and you can guess the other three such references.  The references in the spreadsheet itself are cell references to allow easy copy/paste action.

Conditional formatting hides the lower half of the square, and emphasises those places that are closest together.

I made the schoolboy error of hardcoding the earth’s radius (6,371.0072km), meaning that the model does not scale to other planets or moons in our solar system.  Apologies.

Mapping on a grand scale: poetry

I had a riot yesterday.  At work.  I was truly in my element.

I was given a task to map the 300+ offices that my client had across the UK.  And from that to identify clusters.  Although they relied on the same underlying data, I treated these two tasks largely in isolation to one another.  After all, identifying clusters from a visual map is rather an inexact science.

The mapping was sublime.  I used a Google Spreadsheet to convert postcodes to latitude and longitude coordinates (using the advice I’d previously documented here), used this to generate a KML file (through some fancy Excel formulae) which I could then import into Google Maps to create a visual representation of the locations.  Awesome.

Then on to the distances.  I’d had a vision the previous evening that I wanted to create one of those triangles that you used to see in the back of road atlases, across which you could trace your finger to figure out the distance from Plymouth to Edinburgh, for example.  So I went about creating one.

I made column headings out of the locations in Excel, transposing the data to duplicate them as row headings, together with their respective latitude and longitude coordinates.  I then used trigonometry (with the earth’s radius as a constant) to find the distance between each pair of locations (as the crow flies rather than driving distance).  A lovely piece of conditional formatting (if row number is greater than column number) allowed me to hide all the lower half of the triangle—a number format of ;;; hides a cell’s contents without affecting the value of the cell.

I was lost in Excel and Google Maps for an entire morning.  The subject matter had me gripped.  The quality of the output was top drawer; and the importance of the messages therein was right up there.  I love days like that.

Excel is like life

Life is complex.  Problems are complex.  And it’s rare that an intricate problem can be solved by a simple solution.  Or indeed that a single solution is the only one.  Excel is a good analogy here.

It has rows and columns and it was invented to organise data.

Yet in its 2007 version, it comes with eight standard ribbons.  (Further ribbons present themselves in specific situations.)  The Home ribbon alone has 42 separate items within it.  Twenty of these have dropdowns from which further options can be selected.  A very conservative estimate would be that an average of five sub-options are available for each of these.  If the other seven ribbons are similar, then that’s 976 options, and that doesn’t account for the plethora of formulae that can be written in each of the cells, and the canvas of colours available.  (As a slight aside, that would allow 16 trillion possible actions across the cells of a single worksheet.)

Admittedly, Excel was not designed to solve a single problem.  But the arsenal of tools available merely highlights the huge variety of ways in which problems can be addressed.  If you give ten people the same problem and ask them to solve it in Excel, each one will address it in a different way, sometimes subtly different, sometimes wildly different.  If you don’t tell them which tool to use, then your range of solutions will widen further.

Killer Sudoku Pro: spreadsheet helper

My dad called yesterday, leaving a voicemail informing me that he had a technical problem.  I returned the call, expecting to have to solve some form of driver compatibility issue or the like, but instead he gave me the following conundrum.

There is a variant of Sudoku called Killer Sudoku, in which areas of the grid, either square, rectangular or some other irregular shape, are caged off, a small number in the corner indicating the value to which the numbers therein must sum.  My dad informed me that there is a further variant (Killer Sudoku Pro) in which these cages are further marked to indicate whether the number in the corner represents the sum of the numbers therein, their product, or their subtraction.  (For subtraction, this is taken as the highest number minus all other numbers.)  There is also a divide operator, saved for the cages two squares in size—this equates to the higher number divided by the lower number. He asked whether there was a way to understand what options were valid in cages three-squares in size (squares are considered sufficiently basic for the human brain to fathom immediately) for each of the three valid operators.

I turned to Excel.  (No shit, Sherlock.)  And here is the result.  It’s a thing of beauty, if I may say so, the most poetic moment being when I established that the result of the minus operator could be established through the following formula:

(2 * MAX(cells)) – SUM(cells)

Hope you enjoy playing with it as much as I did creating it.

Tubewhacking

Paul Clarke introduced me today to the pastime of Tubewhacking.  Similar to Googlewhacking, it involves finding an English word none of whose letters appears in the name of a Tube station, and for that station to be unique in that quality.  The most famous example I know is St. John’s Wood, none of the letters of the word mackerel appearing in its name, a claim that no other station can make.

I wondered whether any stations were themselves Tubewhacks of other stations.  So I got to work.

Fortunately, the number of columns in Excel has increased recently—I needed 7,616 columns to complete my logic, along with a tidy 11.8Mb.  And below is a summary of the results.

There are 59 stations that have Tubewhacks, although their Tubewhacks come from only 22 unique stations.  Bank is the most common, being the Tubewhack of a whopping (not Wapping) nine stations. Each of Vauxhall and Woodford accounts for seven Tubewhacks.

Below is the full list—station on the left, Tubewhack on the right.

Barbican: Temple
Becontree: Vauxhall
Bermondsey: Vauxhall
Bond Street: Vauxhall
Boston Manor: Chigwell
Brent Cross: Vauxhall
Burnt Oak: Chigwell
Camden Town: Ruislip
Canary Wharf: Temple
Canning Town: Shepherd’s Bush
Charing Cross: Temple
Chorleywood: Bank
Colliers Wood: Bank
Dagenham Heathway: Ruislip
East Putney: Woodford
Eastcote: Kilburn
Edgware: Pimlico
Elephant & Castle: Woodford
Elm Park: St. John’s Wood
Farringdon: Temple
Fulham Broadway: Epping
Gants Hill: Woodford
Goldhawk Road: Upney
Green Park: Dollis Hill
Hainault: Woodford
Hampstead: Kilburn
Highgate: Woodford
Holloway Road: Epping
Ickenham: Woodford
Kew Gardens: Pimlico
Knightsbridge: Oval
Leyton: Chiswick Park
Limehouse: Bank
Liverpool Street: Bank
Mill Hill East: Woodford
Mornington Crescent: Vauxhall
Newbury Park: Dollis Hill
Perivale: St. John’s Wood
Piccadilly Circus: Kenton
Plaistow: Debden
Putney Bridge: Oval
Queensway: Pimlico
Richmond: St. Pauls
Rotherhithe: Bank
Royal Victoria: Debden
Shadwell: Brixton
Shoreditch: Bank
Southfields: Bank
St. James’s Park: Hillingdon
Tooting Bec: Vauxhall
Tower Hill: Bank
Upminster Bridge: Oval
Upton Park: Chigwell
Warren Street: Pimlico
West Brompton: Vauxhall
West Finchley: Moor Park
West Ham: Kilburn
West Hampstead: Kilburn
West Ruislip: Bank

Liverpool Street and Bank form the only pairing in the above list that are one stop away from one another.

Next Page →