Skip to content
This repository has been archived by the owner on May 10, 2023. It is now read-only.

Research: ability/feasibility of sentence import from WikiData #260

Closed
Adrijaned opened this issue Oct 16, 2019 · 2 comments
Closed

Research: ability/feasibility of sentence import from WikiData #260

Adrijaned opened this issue Oct 16, 2019 · 2 comments
Labels
question Further information is requested

Comments

@Adrijaned
Copy link
Contributor

WikiData is somewhat new project by Wikimedia Foundation, providing database-style data on about everything, while being CC0. One of the specific subsets of data they are providing over there are lexemes, which sometimes include example sentences. Those might be fit for inclusion in the Common Voice dataset.
Pros:

  • Everything is CC0
  • As those are example sentences for all kinds of lexemes, the dataset can get to include even less usual words
  • Those are already individual sentences, so there should be no issues with splitting them up as is the case when harvesting complete works

Cons:

  • That project of wikimedia foundation is not really that succesfull - at the moment of writing, there are 130 english sentences, and some of them are duplicates, or, occasionaly, swedish
  • In case of multiple consecutive imports from that source, sentences could get included multiple times

Expected outcome of this:
Unless someone gets some geeky pleasure from implementing this, or the wikidata project really takes off, this should probably be left alone as long as there are other things to do. Making this issue just in case though.

@nukeador
Copy link

Thanks for bringing this up. Let's definitely keep an eye on how this project evolves.

@MichaelKohler
Copy link
Member

Let's track this in common-voice/cv-sentence-extractor#92

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants