We’ve discussed why to use CSVW, but how can you do that?
Let’s annotate a csv table with csvw metadata to accurately describe it’s contents. The annotation will improve the machine-readability of the dataset and connect it to a wider knowledge graph spread across the web.
We’ll start with something simple.
Data Mill North publishes a list of Grit Bins on behalf of Leeds City Council.
The first five lines of the csv looks like this:
On it’s own, this is of course a little tricky to interpret.
Fortunately the page we got the file from tells us that the columns are:
location as an identifier for the spot where the bin is placed - it serves to distinguish the location, not to locate it (i.e. on a map). The second two columns locate these location with Ordnance Survey National Grid coordinates.
We can use the csvw standard to make a machine-readable definition of how to interpret the csv. This is done with a json file.
Here’s how we would give the columns names:
We’ve also specified here that the data table doesn’t have column names specified in a header row.
We can then add simple datatypes:
We can also explain that the second and third are grid references.
We could describe the datatype of these columns as being in metres, which are the units for eastings and northings (and yes the data does seem rather more precise than it needs to be). This wouldn’t be terribly helpful as we’re not really talking about distance per se, but a 2-dimensional distance from the National Grid’s origin point (
0, 0 is somewhere south-west of the Isles of Scilly). Indeed it’s only possible to interpret those distances when you know the origin and the projection used to get from the curved surface of the globe to a flat 2D map.
Fortunately, we don’t need to worry about transforming between coordinate systems or projections. We can just say that we’re using the Ordnance Survey system by using the identifiers that they provide for this purpose.
We can find those identifiers with the Linked Open Vocabularies search engine. When we search for “easting”, the Ordnance Survey’s Spatial Relations Ontology is the top hit. If we browse to the ontology, we can see that it also declare the “northing” property.
We declare them by setting the “propertyUrl” for each column.
Now applications that know what to do with eastings and northings can interpret the values correctly.
We also want to identify the entities in our data - the grit bins themselves. This lets other uses refer to them unambigiously.
We can tell the data user that the first column serves as an identifier to distinguish each row.
We can do that by saying that each row is about the location - or what csvw calls “aboutUrl”. We don’t just want to set this on the location column, as it applies to all 3 columns. Instead we set it for the whole table schema. We also want to include the value from the location column in our URIs. To do this we can use a URI template, this standard lets us describe a URI in terms of some variables.
That is basically all the description we really need to provide for the schema.
Finally, we can connect our annotation to the original csv file. We can do this by including the URL of the table. There’s a few other options described in the standard but typically you’ll want to publish the json alongside the csv like this.
We also need to say that this json file is specified in terms of the csvw standard, we set this using a “@context” key which follows the lead of the json-ld standard (a json serialisation for linked-data).
Our final metadata annotation for the table looks like this:
You may be wondering how you can use this newly annotated table…
Ideally we’d be able to open this up in a mapping tool like QGIS, and have it automatically recognise which columns to use as X and Y coordinates so we could create a map of grit bins.
Sadly we’re not quite at that point yet. This standard is still new and hasn’t been adopted by applications like this yet.
What we do have is a good deal of libraries which implement csvw in various programming languages.
These can be used to validate the compatibility of csv and it’s json annotations. This is useful either to check the annotations are correct, or to validate that a given csv file conforms to a schema.
They can also translate the csv into either json or rdf formats (for example, using a csv2rdf tool). This is what a couple of lines of our csv look like as linked data:
- ★★★★ using identifiers to denote things, so that people can talk about your resources unambigiously
- ★★★★★ linking your data to other data to provide context
We’ve given the bins URIs. Here location 696 get’s the URI
http://opendata.leeds.gov.uk/downloads/gritting/grit_bins.csv#696. People can talk about this anywhere on the web and know exactly what we’re talking about.
We’ve also adopted URIs from other vocabularies both the Ordnance Survey’s Spatial Relations Ontology and the venerable XSD Datatypes, linking this csv file into the web of data. Any tools that know how to deal with Ordnance Survey points or decimal datatypes will know how to interpret this csv.
This example was nice and simple. It’s almost certain that the data you’re working with is considerably more complex! It mightn’t be immediately obvious how to apply this lesson to your data. Which vocabularies should you use? How should you combine them? How should you structure the table? We’ll provide answers to these questions in our next guide in this series, with further examples of how to publish csvw.