[Discussion] Better aggregate metric for dataset comparison #324

patcon · 2016-11-28T16:56:44Z

Context

The City of Toronto has led the way in Canada with it's initial creation of its data portal and its dataset offerings. In recent years, the open data community has felt that there's been some stagnation in the City open data policies, and the departmental embrace of the underlying principles. There is renewed debate in Toronto about how the City can do better.

During these conversations, City staff stakeholders (in particular Harvey Low) have repeatedly expressed frustration at the metric with which the community shallowly compares progress between cities -- often via dataset counts. They've rightfully brought up that dataset organization greatly colours any comparison. For example, it's frequently mentioned that City of Toronto packages city-wide data together as a dataset, whereas NYC after releases borough-specific datasets.

To be clear, the community critique of City of Toronto open data policy is more nuanced than criticism of the dataset count. (e.g. value of datasets to citizens, rather than numerical criteria). But the city staff definitely have a point: only having dataset count as the overall metric with which to compare between cities does a disservice to the conversation.

It would be great to use the Data Package Spec as a launch point to discuss better metrics, so that criticism can be accounted for in the comparison of open data policy between cities.

Solution

I feel the following would work to resolve the above concerns for tabular data package:

Add a boolean property to describe specific columns as dataColumn.
Add a integer dataPointCount property to resource metadata (and perhaps summed in overall data package metadata).

Since the columns that contain significantly countable data are labelled as such, we can easily script the generation of the data point count. At the portal level, we could then have a much better basis of comparison both within cities (ie. city departments, districts, stewards, etc.) and between cities themselves.

Would the above suggestion be something we'd consider adding to the spec? Obviously, I'm interested in further conversation and other ideas :)

The text was updated successfully, but these errors were encountered:

pwalsh · 2017-02-05T06:46:18Z

Hey @patcon

We are doing lots of work on data quality tooling and specs, which I know you know as you are using goodtables.

I'm super interested in codifying other data points than raw count of published data sets as part of a much wider discussion around open data portals and so on.

In terms of what can be specified in these specs, let's continue this discussion over at #364 and I'll close this for now as a duplicate.

pwalsh closed this as completed Feb 5, 2017

patcon mentioned this issue Feb 7, 2017

Data Summary and Quality description spec #364

Closed

roll added this to Open Knowledge Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Better aggregate metric for dataset comparison #324

[Discussion] Better aggregate metric for dataset comparison #324

patcon commented Nov 28, 2016 •

edited

Loading

pwalsh commented Feb 5, 2017

[Discussion] Better aggregate metric for dataset comparison #324

[Discussion] Better aggregate metric for dataset comparison #324

Comments

patcon commented Nov 28, 2016 • edited Loading

Context

Solution

pwalsh commented Feb 5, 2017

patcon commented Nov 28, 2016 •

edited

Loading