Standards

The Core CSVW Specification

The CSVW specification is quite deep, building on multiple standards. It can be quite overwhelming to read from start to finish. This page provides an overview to help you find the parts most relevant to your interests.

The CSVW syntax specification describes a data model for tabular data. That is to say, it defines that a table is a collection of cells arranged into columns and rows. It also describes how to annotate a table with a metadata file to help processors parse and interpret the contents.

The schema for the metadata file is provided by the CSVW metadata vocabulary. The vocabulary defines the properties that can be used in an annotation. This includes things like a schema of column descriptions, datatypes and foreign key relations.

CSVW also defines how a table may be translated into RDF via the csv2rdf procedure or into JSON via the csv2json procedure.

Building on a deep stack of standards

The specification depends upon a range of other technical standards which may be of interest themselves.

CSV Dialect provides a standard for describing what “kind” of CSV file you have - i.e. the terminator strings, quoting rules, and escape rulesetc. This allows you to use tab-separated instead of comma-separated values for example or to deal with the often invisible differences between applications and operating systems (like encoding or line endings). The CSVW syntax specification provides some sensible Best Practice recommendations for standardising your CSV serialisation (e.g. use UTF-8 encoding and CRLF line endings).

JSON-LD is a serialisation for linked-data in JSON which is used for the CSVW metadata file. This standard gives JSON documents an @context property which defines how to get from keys and values to URIs through processes called “expansion” and “compaction”. In CSVW the context is fixed for compatibility with non-JSON-LD-aware processors.

XML Schema Datatypes lists the various datatypes like string, decimal, and datetime etc. The Datatypes section of the CSVW syntax specification builds upon this to allow annotated datatypes that might restrict the range of a number or format strings for parsing dates.

Compact URIs is a standard which describes how to make URIs shorter by using prefixes (i.e. if I tell you dcterms means “http://purl.org/dc/terms/” then I can write dcterms:title or dcterms:publisher without having to spell the whole URI out each time). Compact URIs are much easier for humans to read!

URI Templates can be used to create URIs by interpolate variables from table cells. A template like http://example.net/thing/{id} when combined with the cell value “123456”, for example, will expand to the URI http://example.net/thing/123456. This is particularly useful for aboutUrl or valueUrl properties.

CSV Fragment Identifiers make it possible to refer to parts of a CSV file identified by row, column, or cell e.g. http://example.net/data.csv#row=5-7. These are used by the default metadata to give any CSV file a linked-data translation without requiring any configuration.

Vocabularies for interoperability

CSVW suggests a set of ontologies you can adopt to integrate with other data on the web.

The CSVW namespace defines a set of ontologies that may be used in an annotation in their prefixed form (i.e. it provides a JSON-LD @context). This means that instead of writing:

{“http://purl.org/dc/terms/title”: “My Great Table Group”}
you’re able to write:
{“dcterms:title”: “My Great Table Group”}
which is much more legible.

These ontologies aren’t required in an notation but the fact that the authors chose to include them in the CSVW Namespace implies a sort of tacit recommendation. Indeed these ontologies are a great place to find useful properties for annotating your tables.

  • Data Catalog Vocabulary (DCAT) defines Catalogs and Datasets it also makes heavy use of Dublin Core Terms;
  • Dublin Core Terms (DCTERMS) defines fundamental datasets metadata terms like title, modified date, publisher and license;
  • RDF Data Cube (QB) provides a vocabulary for describing multi-dimensional statistical data building upon the SDMX standard. This defines things like observations, dimensions, and measures.
  • Simple Knowledge Organisation System (SKOS) can be used to express concept schemes such as “thesauri, classification schemes, subject heading lists, taxonomies, folksonomies, and other similar types of controlled vocabulary”.
  • Good Relations (gr) is designed for e-commerce and defines things like Business Entities, Products or Services and Price Specifications.
  • PROV let’s you explain the provenance of a dataset with reference to Entities and Activities.

This is just a shortlist, consult the namespace document itself for the full details.

Indeed if you’re looking for URIs to use in your annotations then you might also like to try searching with the Linked Open Vocabularies project or browsing around the prefix.cc namespace lookup (just bear in mind that if the prefixes aren’t used in the CSVW namespace then you’ll need to spell out the URIs in full).