Skip to content

Latest commit

 

History

History
130 lines (103 loc) · 6.29 KB

csvw-vision.md

File metadata and controls

130 lines (103 loc) · 6.29 KB

An aligned vision for CSVW and Linked Data

NOTE you may want to read: "What is the annotated table model?" first.

One of CSVW's stated use cases was to provide annotations on tabular data, in particular data in the form of a CSV file.

We believe annotations on tables are an important feature of CSVW which we want to surface to publishers and users alike. The reasoning here is that both publishers and consumers are familiar with tables, so tables and their components (e.g. columns, rows, cells) are a natural and familiar place to add extra metadata annotations.

We believe the components of a table provide a familiar structure which users can arbitrarily extend and annotate. In practice these extensions mean annotating various locations within the table (subjects) with pairs of predicates and objects; and in this way CSVW provides a more familiar on ramp to the world of linked data.

Similarly we'd like these tables (datasets) to actually be linked data; that is they should be identified by their locations on the web. Visiting a "table" would then be dereferencing the table into an appropriate representation; for example a HTML interface to the annotated table model, or a text/csv representation of the data, depending on content negotiation.

In this model, there would be one table with many representations, and those representations would be isomorphic with each other. For example a row in the csv, would be the same row in the annotated table model, and that row in the csv would substantially contain the same observation in the RDF cube. Representations would substantially be different perspectives on the same thing, and not represent different things entirely.

What we want to happen

Before we look at the problems, we should first define what outcome we would like from integrating CSVW with linked data.

Below is a prototype UI for a CSVW interface. It may not be apparent from looking at it, but this UI demonstrates some interesting properties that attempt to try and harmonise CSVW and linked data.

Firstly we anticipate that the dataset would be identified by a format independent @id, that is the datasets URI would literally be the same as the CSV's @id, and that visiting it in a browser like below would return a HTML representation of the data:

CSVW Preview

Logically for this to occur, the URI of the resource would be independent of the CSV file itself, and the @id would identify the annotated csvw:Table, which would be an abstraction over the CSVW.

The URI </data/life-expectancy> would essentially then provide a uniform interface to the resource and the representation(s) people want. For example if you ask for application/csvm+json you would be directed to the metadata file, text/csv the CSV, whilst text/html or an RDF serialisation such as application/n-triples would combine the two documents to yield the expected representation.

Similarly it would be extremely desirable for all of the URI's to align appropriately, such that dereferencing an observation by its aboutUrl would return an appropriate representation of it in context of the table:

Row dereferencing

The above feels highly intuitive, and brings the combined benefits of linked data and CSVW to more typical data users. However it's worth noting that to make this work, we need to unpick some subtle issues in the specification and clarify our terms of engagement such that this can occur.

In particular it's worth noting that there are substantial differences in CSVW between the RDFization of data and the representation of the source data in the CSV. In CSVW the RDF outputs of csv2rdf are not typically thought to be tabular, but belong to the world of graphs, rather than tables. However in the cases of CSVW we'd like to present the derived graph in terms of the table.

This means that in this view csvw:Column definitions in the csvw:TableSchema are used as a lens through which we can view the projected RDF graph. Is a csvw:Column the same as an rdf:Property? No, at least not always, but they are in some cases so closely linked that for practical purposes it is worth treating them as highly related, and in the case of dataset specific properties they could for brevity share the same @id and be maintained in the same place.

The UI could for example incorporate affordances for accessing metadata on the columns themselves:

Column metadata

In particular exposing annotations on the csvw:TableSchema and csvw:Columns gives us structural locations for publishers and users to access knowledge in the DSD. For CSVW cubes a tableSchema could share the same @id as the cubes DSD, exploiting this would benefit maintainance and understanding, and minimise the need to develop whole new features to handle artificially distinct structures.

I'd like to encourage the view that the same csvw:TableSchema is the most useful lens through which to view the input CSV, the output RDF, and arbitrary internal stages of processing (such as viewing validation errors as annotations on what is substantially the same table). Having publishers and users alike work with the same homoiconic representation is highly beneficial to understanding, and lets users leverage all representations simultaneously as extensions within the same model.

In order to do this, and to use the csvw:TableSchema as a lens for viewing the RDF output; there is one small complication, which is that a csv row may itself yield multiple subjects (aboutUrl's). This can be solved by mapcat/flatMaping over the outputs to remove the layer of nesting that results. This may in some circumstances result in one input row becoming several output rows, and may increase the likelyhood that some columns containing null values.

It's also worth noting that typically for statistical data cubes we would not expect Tidy data (essentially 3rd normal form) to have multiple subjects, as that would typically imply a level of denormalisation.

A prototype

A prototype showing some of these ideas and others is available at:

https://platform-prototype.vercel.app/