-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrape sentences from wikidata #92
Comments
Looks like these are indeed CC0. I don't think we need to ask legal for this. @nukeador do you agree? Would love to see a selection of these sentences. Also, I assume you are aware of the scraper capabilities for other resources? As long as we can get it into a parseable state, it then can directly be integrated in the scraper to use the rules and everything. More details in the last part of the README. Also happy to explain further if needed. |
This was exactly what I was thinking. Right now the example sentences for a datatype called "lexemes" are relatively new. They exists since 2018. But they are planing to move all wiktionary data into wikidata, so we will likely have more sentences in the future. Wikidata is huge, I am sure that there are more data types that contain sentences.
I always wanted to learn wikidata queries, this is a nice little project to finally do it. I will post some examples tomorrow or so. |
Note only these 4 namespaces is CC0.
Do we have data on how many sentences do we have for each language? |
I've already suggested using P5831 earlier in the sentence-collector project (common-voice/sentence-collector#260), but, as per this query, there is currently only about 4000 sentences in P5831, some of which are probably repetitions. (After uncommenting the first line of the query you should be able to filter sentences by language using the query helper (accesible by clicking the (i) on the left sidebar)). All of those should be in the Lexeme namespace, so license-wise should be of no issue. |
Wikidata is completely under CC0, this makes it very attractive for the project. In contains both, sentences and sometimes audio, but for this Issue I want to focus on sentences.
This Issue is work in progress, I want to collect possible sources for sentences in Wikidata:
The next step would be to write a script to scrap these sentences.
The text was updated successfully, but these errors were encountered: