Research: ability/feasibility of sentence import from WikiData #260

Adrijaned · 2019-10-16T21:06:40Z

WikiData is somewhat new project by Wikimedia Foundation, providing database-style data on about everything, while being CC0. One of the specific subsets of data they are providing over there are lexemes, which sometimes include example sentences. Those might be fit for inclusion in the Common Voice dataset.
Pros:

Everything is CC0
As those are example sentences for all kinds of lexemes, the dataset can get to include even less usual words
Those are already individual sentences, so there should be no issues with splitting them up as is the case when harvesting complete works

Cons:

That project of wikimedia foundation is not really that succesfull - at the moment of writing, there are 130 english sentences, and some of them are duplicates, or, occasionaly, swedish
In case of multiple consecutive imports from that source, sentences could get included multiple times

Expected outcome of this:
Unless someone gets some geeky pleasure from implementing this, or the wikidata project really takes off, this should probably be left alone as long as there are other things to do. Making this issue just in case though.

nukeador · 2019-10-17T10:56:26Z

Thanks for bringing this up. Let's definitely keep an eye on how this project evolves.

MichaelKohler · 2020-05-03T23:00:33Z

Let's track this in common-voice/cv-sentence-extractor#92

MichaelKohler added the question Further information is requested label Dec 3, 2019

Adrijaned mentioned this issue Apr 7, 2020

Scrape sentences from wikidata common-voice/cv-sentence-extractor#92

Open

MichaelKohler closed this as completed May 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research: ability/feasibility of sentence import from WikiData #260

Research: ability/feasibility of sentence import from WikiData #260

Adrijaned commented Oct 16, 2019

nukeador commented Oct 17, 2019

MichaelKohler commented May 3, 2020

Research: ability/feasibility of sentence import from WikiData #260

Research: ability/feasibility of sentence import from WikiData #260

Comments

Adrijaned commented Oct 16, 2019

nukeador commented Oct 17, 2019

MichaelKohler commented May 3, 2020