GitHub - CooperData/SafeLanguage: Bot to clean wikipedia from grammar mistakes

SafeLanguage

SafeLanguage has the goal to improve orthography in articles of the Spanish Wikipedia. Young people are a big part of readers of Wikipedia. By improving the quality of articles you can help to improve their use of the language. On the contrary, articles with misspellings of badly written could give young people the idea that language usage is a minor issue in communication.

Goal

SafeLanguage aims to improve the global quality of articles in Spanish Wikipedia by improving the orthography of the articles. Although the read and improve one article at a time approach is probably the most effective way, a reviewer of an article has to deal with both: minor and major problems. In order to ease corrections over minor problems (like orthography) SafeLanguage could be run over an article before human intervention is needed.

For several years automatized tools -called bots- have been used to help improving the orthography of articles. However, these changes have to be reviewed one by one by a human. In particular, the bot called CEM-bot has over one million supervised corrections to the Spanish Wikipedia with less than 4% of wrong changes that has to be reverted by a human.

The goal of this project is to reduce the percentage of changes that a human has to revert after the execution of the bot.

Actual steps in the execution of the bot

Download the last backup of the Spanish Wikipedia.
Transform it’s XML format into a grep’able format i.e one in which only content lines of articles are present, prepended by the name of the article.
Filter lines with “radicals”, i.e. lines in which one of the correction rules can be applied.
Filter to split paragraphs in sentences.
Correct selected articles with CEM-bot.
Manually check each of the changes made by the bot and revert any incorrect change.

Given that the number of articles with radicals is very high, steps 3 and 4 are normally applied over subsets of the rules given a slightly more complicated sequence of steps:

Download the last backup of the Spanish Wikipedia.
Transform it’s XML format into a grep’able format i.e one in which only content lines of articles are present, prepended.
Filter to split paragraphs in sentences.
Create an empty set of checked articles.
Create an empty set of applied rules.
While there are rules to be applied
1. Select a subset of the non applied rules.
2. Filter lines with radicals of the selected rules in non previously checked articles.
3. Correct selected articles with CEM-bot.
4. Include corrected files among the checked files (corrected files is a subset of the filtered files)
5. Include selected rules among the applied rules.
6. Manually check each of the changes made by the bot and revert any incorrect change.

Problems with this approach

The selector uses a line by line approach and sometimes reasons for not correcting are outside of the scope of a line. For instance, a quote that spans over several lines (quotes are not corrected). The result is an article selected and not corrected.

In a selected file, the bot introduces an error when changing a corrected word in Spanish (for instance, it incorrectly changes trabajo into trabajó).
In a selected file, the bot introduces an error when changing a corrected word in another language (for instance, it incorrectly changes poesia into poesía in a phrase written in Catalan).

Proposed steps: First phase

Download the last backup of the Spanish Wikipedia.
Transform it’s XML format into a grep’able format i.e one in which only content lines of articles are present, prepended by the name of the article.
Filter to split paragraphs in sentences.
Filter lines with “radicals”, i.e. lines in which one of the correction rules can be applied.
Filter lines with the Spanish language selector.
Correct selected articles with CEM-bot.
Manually check each of the changes made by the bot and revert any incorrect change.

This sequence could reduce the corrections of phrases in other languages.

Proposed steps: Second phase

Download the last backup of the Spanish Wikipedia.
Transform it’s XML format into a grep’able format i.e one in which only content lines of articles are present, prepended by the name of the article.
Filter to split paragraphs in sentences.
Filter lines with “radicals”, i.e. lines in which one of the correction rules can be applied.
Filter radicals with the Spanish language selector.
Filter out radical over correct phrases.
Correct selected articles with CEM-bot.
Manually check each of the changes made by the bot and revert any incorrect change.

This sequence could additionally reduce the introduction of errors in correct phrases.

Meassure of impact

The impact of the project can be meassured comparing the number of reverts once uno or both filters are applied.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
.idea		.idea
CEM		CEM
doc		doc
safelanguage		safelanguage
setup		setup
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SafeLanguage

Goal

Actual steps in the execution of the bot

Problems with this approach

Proposed steps: First phase

Proposed steps: Second phase

Meassure of impact

About

Releases

Packages

Contributors 2

Languages

License

CooperData/SafeLanguage

Folders and files

Latest commit

History

Repository files navigation

SafeLanguage

Goal

Actual steps in the execution of the bot

Problems with this approach

Proposed steps: First phase

Proposed steps: Second phase

Meassure of impact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages