Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenRefine+Wikidata quick demo #5

Closed
wetneb opened this issue Feb 6, 2017 · 8 comments
Closed

OpenRefine+Wikidata quick demo #5

wetneb opened this issue Feb 6, 2017 · 8 comments

Comments

@wetneb
Copy link

wetneb commented Feb 6, 2017

I have been working on a tool that sounds quite relevant for the event:

https://tools.wmflabs.org/openrefine-wikidata/

It helps align datasets to Wikidata in OpenRefine, a super cool software to deal with messy data. If you are still looking for lightning talks during the event, I would be happy to give a quick demo of the tool. I'd love it if we could then play with the tool on some research data (and I'm sure some attendees will know of many interesting datasets).

@Daniel-Mietchen
Copy link
Collaborator

Yes, your reconciliation tool looks great and would be a good thing to demo, play with and hack on. We haven't fully figured out how to organize the lightning talks, other than the generic slot for them in the program. We'll update that as things become more concrete.

@Daniel-Mietchen
Copy link
Collaborator

I just gave this a try.

I downloaded the results of this SPARQL query

#Find common strings for authors of scientific articles
SELECT * WHERE {
  {
    SELECT ?authorstring (COUNT(?paper) AS ?count) WHERE { { ?paper wdt:P2093 ?authorstring . }}
    GROUP BY ?authorstring
  }

   FILTER(?count > 30)
}

and fed them into OpenRefine, which resulted in 351 matching rows that I then converted to the format of the new author resolver, which gave this list, from where I then picked cases to look at in more detail, which resulted in ca. 1k replacements of P2093 statements with the corresponding P50 statements. That looks promising.

Things that still need attention in this workflow:

  • The transitions from SPARQL to Open Refine and from there to the author resolver are currently manual, and some information gets lost along the way. For instance, Open Refine comes up with a good set of potential matches between the P2093 text strings from the query and item labels of instances of Q5 but the author resolver is not aware of these potential matches and basically repeats the same search (though it struggles with periods in names).

Pinging @magnusmanske

I also tried to tackle another problem by way of a similar pipeline:

# scientific journal (Q5633421) as main subject (P921)
SELECT ?item ?topic ?topicLabel WHERE {
  ?item wdt:P921 ?topic .
  ?topic wdt:P31 wd:Q5633421 . 
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY ?topic

This is where I got stuck. Can the tool be used at all in any way that would help with replacing those journal items in P921 statements with the corresponding items about the actual topics, as per the mapping here?

@wetneb
Copy link
Author

wetneb commented Feb 11, 2017

@Daniel-Mietchen Thanks for giving it a spin! But your links to OpenRefine refer to your local instance of the tool, that we cannot access. Screenshots?

Concerning your mapping, I have some ideas to make this work if the mapping is stored on Wikidata (as journal to topic statements). I'll add the relevant endpoint and make screenshots to explain how to use it.

@Daniel-Mietchen
Copy link
Collaborator

This was my first try with Open Refine, so I'm still trying to find my way around. Is there no way to make my OpenRefine projects open, perhaps even by default? Would be nice to have them synced with Zenodo or so for every "release".

In this specific case, though, I don't think it matters too much (and screenshots wouldn't make much of a difference), since I simply took the outputs of both SPARQL queries (in csv format) and imported them into OpenRefine.

@Daniel-Mietchen
Copy link
Collaborator

Re lightning talks, we now have #13 to get them organized.

@ekoner
Copy link
Contributor

ekoner commented Mar 4, 2017

+1

@kshamash
Copy link

kshamash commented Mar 4, 2017

Very cool! I'm testing it out on this dataset https://figshare.com/articles/COAF_Jisc_and_RCUK_APC_data_2013-2015/3462620

@wetneb
Copy link
Author

wetneb commented Mar 5, 2017

So, we've done a lot of things on this topic:

So many thanks to all who got involved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants