-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Related work: Wikidata "subsetting" collaborations #1
Comments
Hi @danbri! Thanks for the note. It's lovely to hear that you're a fan, and thanks for sharing it on the wikidata page. We actually created this while working on Bootleg--for much the same reasons you've cited. We frequently wanted to pull triples for a certain entity, or find all entities which had a certain property (e.g. the alias "Lincoln"). Using query.wikidata.org started becoming inefficient. Please let me know if there are any updates or features you think could make this more useful to a larger community! |
There's talk of having more meetings around subsetting - will let you know if anything comes of it Oh, if you dig around in https://github.com/google/schemarama/tree/main/kgx you'll find SPARQL queries that pull out some lifescience-related pieces of Wikidata (intern work that I should finish opensource releasing!). The goal we were pursuing there was to try to extract from Wikidata, only those entities/relationships corresponding to Figure 1 in https://elifesciences.org/articles/52614 . This shouldn't be rocket science but turns out to be fiddly: the data dumps are huge and unwieldy, as you note. And the official SPARQL endpoint is heavily loaded. Some related work https://addshore.com/2019/10/your-own-wikidata-query-service-with-no-limits/ might help there, but definitely still fiddly! |
Interesting! I'll poke around -- thanks for the pointer! |
Hi @neelguha! this is neat. It is close to the concerns of some in the Wikidata community around "subsetting", so I've linked it from the Tools section in https://www.wikidata.org/wiki/Wikidata:WikiProject_Schemas/Subsetting#Tools_and_Data
One of the reasons folk are interested in Wikidata subsets is that it can be too large a dataset to work with comfortably - so pulling out just the bits most relevant to some application is appealing. There's also a concern to encourage offsite usage of the data so that the load on query.wikidata.org remains manageable while the project and datasets grow. In both cases, tools like yours seem relevant, although the problem of characterising what goes in the subset can be tricky.
The text was updated successfully, but these errors were encountered: