Why use CSVW?

What’s wrong with CSV?

The CSV format has proven enormously popular in the open data world. Historically, faced by so much data being published in PDFs, the CSV format was a byword for machine-readability.

It’s enduring popularity is no doubt helped by the fact that csv files are near universally accessible. Practically everyone has access to a spreadsheet program - like Microsoft Excel - to read and write csv files. Unlike Excel’s proprietary xls format, the humble csv is also easy to use programmatically - you don’t even really need to use a library.

CSV does have some problems, however.

Although the IETF published a standard - RFC4180 - for CSV in 2005, there are still a wide range of interpretations to be found in the wild. These different dialects of csv take different approaches to quoting fields and escaping special characters. That’s not to mention differences in encoding that plague any text format.

It’s also not possible to say in advance how the fields in CSV file ought to be interpreted - i.e. their data type. You have to read a few lines of a file and take a guess from these initial rows - i.e. “I’ve not seen any letters in this column, only numbers, and no decimal points so it’s probably an integer variable”. There are clever tools for doing this guesswork for you, but whenever you’re ingesting a csv file you will likely need to devote a few lines to interpreting the data. Typically this will be to parse dates or give the columns syntactically-valid variable names.

CSVW solves this problem by adding a metadata annotation to a CSV file that instructs parsers on how it should be interpreted. This specifies a dialect and describes a table schema. The dialect defines how the characters of text should be read. CSVW provides sensible defaults, if you follow this convention you can effectively forget about this problem! The table schema explains how to interpret the data values in the CSV so that users don’t need to waste any time preparing the data. They get syntactically-valid variable names and cells cast into the appropriate data types.

Patching-up the problems with CSV is really only the beginning of what’s possible with CSVW. It’s real strength lies in putting CSV on the web…

Putting CSV on the Web

CSVW lets you connect your dataset to others on the web and them connect with you

Despite these problems, CSV has become the lingua franca of open data, and not without good reason. It’s an incredibly simple way to reach the third star of the 5 star deployment scheme for open data:

★ you can publish csv on the web (with an open license)
★★ csv is a machine-readable table
★★★ csv is non-proprietary

But what about the final 2 stars? On it’s own, CSV doesn’t really help you to acheive those:

★★★★ using identifiers to denote things, so that people can talk about your resources unambigiously
★★★★★ linking your data to other data to provide context

These are the distinguishing characteristics of linked-open-data. These features enable data from different sources to be connected and queried, forming a web of data. This network is most strikingly depicted in the linked-open-data cloud.

Publishers wishing to contribute to this cloud, or leverage the informative context it can provide for their own data, have to shift their perspective from a tabular view of the world with it’s rows, columns and cells to one of graphs, nodes and edges.

They have also had to abandon CSV along the way and instead learn a new set of technologies like RDF ontologies and SPARQL queries.

It’s against this backdrop that a W3C working group set out to provide recommendations for working with CSV on the Web. This “CSVW” standard provides a way to resolve the problems with CSV (standardising dialects and expressing types) and to extend it with identifiers to make 5 star linked-data in CSV format. In practical terms, this means associating your CSV file with a JSON document that provides some addition metadata to describe and clarify the content of the CSV table.

The UK Government has recently adopted the CSVW standard and recommend that government organisations “Use the CSV on the Web (CSVW) standard to add metadata to describe the contents and structure of comma-separated values (CSV) data files”.

We’re delighted to see the power of linked-data being brought to the venerable CSV format and hope that this will ensure it’s continued popularity for many years to come.

Now that you know why you should use CSVW, you might like to learn how to make CSVW.