Distributed editorial for data.gov

At GovCamp on Saturday, Hadley Beeman explored the concept of reward when it came to editing government datasets.

Discussion soon turned to the Wikipedia model, and this fascinated me—the idea that people out there in the street might contribute to enhancing datasets without a tangible reward.

Firstly, is a distributed editorial model possible? I don’t see why not.

Wikipedia has evolved with rules about grammar, content, structure within pages; and about relevance across pages. The former set is, as I understand it, largely self-policing. I’ve switched many a hyphen to an en dash in my time, and clarified grammar where needed. I once added a sentence detailing Hazel Blears’ height (4’10”) in the Trivia section of her entry. Someone added its metric equivalent (147cm) soon thereafter, before the entire sentence was removed about six months later.

Page inventory is more closely monitored and managed. And necessarily so. If Wikipedia was full of half-baked articles about everything and anything, then it would not be half as successful as it is. Hence why my page containing a half-baked list of UK government web domains was pulled, albeit quite some months after its original submission. (Quite how Wikipedia’s original set of articles was defined I’m not sure.)

In the data world, the grammar and structure rules can be replaced by field definitions, file formats and the like. But my feeling is that relevancy is a huge issue, one that needs centralised coordination. There is so much data in government. And I feel that it would be wholly wrong to release all of the non–protectively marked data to the public for analysis. Not because it would be dangerous, but because it would cause chaos and confusion. Data requests should be advertised, and responses to such requests should be sponsored by the departments or authorities best placed to source them. But those requests should be limited.

Without this central control, data.gov.uk will become akin to the internet, with lots of tosh out there obscuring the rare nuggets. But good, useful data is much more difficult to identify than is useful content, for the moment at least—hence the need for some centralised control.

Once defined, I see no reason why the creation and enhancement of those datasets cannot be outsourced to the community, with well-placed subject-matter experts dotted around for quality control.

Would that work?

Comments

Leave a Reply