Thoughts on data SEO, curation and more from researching a contemporary data question (military support for Ukraine) #1182

rufuspollock · 2024-06-09T21:29:11Z

rufuspollock
Jun 9, 2024
Maintainer

Wanted to see which countries have been providing the most military support for the Ukraine.

Did a search for:

military support for ukraine by gdp per capita

Here's the journey i went on and some reflections from that. Quick tl;dr

Micro-slicing data is a good idea for SEO (and even for UX). And something we've started on in DataHub but could take much further.
Data intermediaries matter: taking data from raw sources, especially analysts and research and curating and presenting it is a valuable thing to do
That's one major reason DataHub and Data Collective are here

First, reflections on data SEO (and how Statista have mastered this)

First results are specifically tagged as datasets and come from Statista:

Statista seem to have kind of mastered the after of "content-spamming" (? data spamming) search results very effectively. Essentially huge numbers of results each of which is a small (micro) slice of a larger dataset (? or do they really collect and manage data at this level).

Aside: Something one to look into, perhaps. And that we were already doing in a way with the github.com/datasets project (but that aspect of micro-slicing is another level - and something we thought about in 2016-2017 e.g. for a dataset like GDP per year and country we would not have the main page but sub pages for each country and each year and even each country in each year).

Statista has a nice simple page - just the data i want in visual form

Aside: another things to looke at - this creation of specific small datasets and graphs. This simplicity of data (and presentation) is a direction that DataHub has been going in (to an extent) for a decade now.

The limitations of Statista (IMO)

Once you get to Statista site from google though everything is locked down. Clicking on almost anything, certainly downloading anything, and even just checking the source of the data leads to this ...

And the irony is that Statista data is coming from Kiel academic research institute

It looks like Statista data is actually getting sourced from Kiel (i can't tell for sure as the Statista sources tab is only available with a paid account)

Which was actually the 4th result on that page (and the first non-dataset one)

First page is not the data, it's the analysis

As is usual for a think-tank / research institute.

And a couple of clicks onwards we do get ... yay 🎉

And here's the data

https://www.ifw-kiel.de/fileadmin/Dateiverwaltung/IfW-Publications/fis-import/f34881d0-26f2-4a47-885e-542fe168f9ad-Ukraine_Support_Tracker_Release_17.xlsx

Here's a cached version f34881d0-26f2-4a47-885e-542fe168f9ad-Ukraine_Support_Tracker_Release_17.xlsx

But ... in a "nice" xlsx, disaggregated and large, and not machine readable ...

So it's 6.7Mb xlsx. Even opening this is a hassle for me - it takes seconds, opens an app i never use (numbers on mac) i don't have a very good app for this (default numbers is so-so)

Plus its basically a "dataset in xlsx" replete with a "README" sheet, "updates and corrections" and data pipeline - there's a raw data sheet and then various computed subsheets derived from that and with additional integrated data e.g. GDP so that we can do aid as a percentage of GDP.

And finally a bunch of tables are messed up with human readable metadata ...

Dataset notes ... in Excel 😉

README .... in Excel 😉

And because

And of course, because this is prepped by analysts not data folks (engineers / wranglers / scientists) we have a nice chunk of human-readable metadata in our table - nicely merged cell over two rows and 4 columns!

Good for humans to read .... not so good for machines. (cf https://rufuspollock.com/2013/11/19/bad-data-real-world-examples-of-how-not-to-do-data/)

And buried in their xlsx there is the figure we wanted

It's sheet 10 afaict. Note how Statista have literally reproduced this on a single page that is SEO-able with better UX e.g. you can hover on the graph, it's just a graph ...)

Human-readable again ...

Note again how we have human-readable not machine-readable with mixed table and graph. Plus table is offset with title etc ...

Why we need data curators

For me, this is a classic demonstration of the value of data intermediaries who curate / refine / prepeare -- like DataHub / Data Collective (or Statista).

Kiel are a research institute. Their job is to do research. Their main output there is shiny PDFs (or shiny HTML if we're lucky). Internally they'll use xlsx or similar and if we're lucky in this "open science" / "open knowledge" day and age they'll dump out their xlsx as they do here. Any graphs will be in the PDF etc.

Their job is not to publish data. It's to publish research.

So it's the job of someone else to take that excellent raw data and make it accessible e.g. graphs and consumable e.g. nice simple CSV.

Of course, in the long-run i hope we get more "data literate" research -- and more research-literate data. But for now, this division of labor makes sense.

And it means there is a big role for DataHub and associated Data Collective.

And in conclusion ... here's a nice dataset with chart on DataHub ...

TODO 😉

And in conclusion ... what about our original question?

Which countries have been providing the most military support for the Ukraine?

Answer: based on 2022-2024 it's Estonia at 1.6% followed by Denmark.

And the US is actually lower than Germany (and Canada). Overall the Nordics and EU countries in general are the largest contributors based on GDP.

rufuspollock · 2024-06-12T06:28:59Z

rufuspollock
Jun 12, 2024
Maintainer Author

@davidgasquez let me know if you have any thoughts 🙂

2 replies

davidgasquez Jun 14, 2024

Don't have any large thoughts but let me comment on some interesting things you shared!

Aside: Something one to look into, perhaps. And that we were already doing in a way with the github.com/datasets project (but that aspect of micro-slicing is another level - and something we thought about in 2016-2017 e.g. for a dataset like GDP per year and country we would not have the main page but sub pages for each country and each year and even each country in each year).

This fits well with the each dataset is also an API viewpoint. You can have datasets like GDP or mortage rates, partitioned by country and, when exposing them, also expose subpages (built at buildtime like Evidence does) where you can get a /country/uk or /uk/2022 with prepopulated charts, text and metrics.

Their job is not to publish data. It's to publish research. So it's the job of someone else to take that excellent raw data and make it accessible e.g. graphs and consumable e.g. nice simple CSV. And it means there is a big role for DataHub and associated Data Collective.

Love this angle and 100% agree with it! The main challenge I see in this example is that the dataset is very isolated. In this case, say you spent a few hours digging into this and end up publishing the dataset in Datahub. Since is not tied to a larger community that cares about this specific dataset, it will get outdated.

That doesn't mean there isn't value in the work of opening up the dataset and making the UX smoother though. That's very useful on it's own!

Organizations like OWID or PUDL spend lots of person hours doing similar manual work, but each dataset is linked in some way (country, year, indicator in OWID, energy in PUDL).

Not sure if I explained it well enough as I don't have the point still clear on my head. I'd say the TLDR would be that opening smaller datasets like this one is a valuable thing to do but if they're not connected to a community (with more dataset) and/or automated (up to date-ish) the value will fade quickly and the incentives for folks to work on them won't be strong.

Thanks for sharing the entire process! What do you think about making a small Twitter thread with these findings/open questions? Would love to hear more folks thoughts.

rufuspollock Jun 18, 2024
Maintainer Author

@davidgasquez

This fits well with the each dataset is also an API viewpoint. You can have datasets like GDP or mortage rates, partitioned by country and, when exposing them, also expose subpages (built at buildtime like Evidence does) where you can get a /country/uk or /uk/2022 with prepopulated charts, text and metrics.

Exactly 👍

Love this angle and 100% agree with it! The main challenge I see in this example is that the dataset is very isolated. In this case, say you spent a few hours digging into this and end up publishing the dataset in Datahub. Since is not tied to a larger community that cares about this specific dataset, it will get outdated.

Totally agree - this is a major "social" point that we've discussed and i very much agree with. You want a community of interest if you can.

That doesn't mean there isn't value in the work of opening up the dataset and making the UX smoother though. That's very useful on it's own!

Also true ... and e.g. Statista works thanks to pure SEO and random traffic -- without the community of interest.

Thanks for sharing the entire process! What do you think about making a small Twitter thread with these findings/open questions?

Great idea - would love your help 😉

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thoughts on data SEO, curation and more from researching a contemporary data question (military support for Ukraine) #1182

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Thoughts on data SEO, curation and more from researching a contemporary data question (military support for Ukraine) #1182

rufuspollock Jun 9, 2024 Maintainer

First, reflections on data SEO (and how Statista have mastered this)

Statista has a nice simple page - just the data i want in visual form

The limitations of Statista (IMO)

And the irony is that Statista data is coming from Kiel academic research institute

Which was actually the 4th result on that page (and the first non-dataset one)

First page is not the data, it's the analysis

And a couple of clicks onwards we do get ... yay 🎉

And here's the data

But ... in a "nice" xlsx, disaggregated and large, and not machine readable ...

Dataset notes ... in Excel 😉

README .... in Excel 😉

And because

And buried in their xlsx there is the figure we wanted

Human-readable again ...

Why we need data curators

And in conclusion ... here's a nice dataset with chart on DataHub ...

And in conclusion ... what about our original question?

Replies: 1 comment · 2 replies

rufuspollock Jun 12, 2024 Maintainer Author

davidgasquez Jun 14, 2024

rufuspollock Jun 18, 2024 Maintainer Author

rufuspollock
Jun 9, 2024
Maintainer

Replies: 1 comment 2 replies

rufuspollock
Jun 12, 2024
Maintainer Author

rufuspollock Jun 18, 2024
Maintainer Author