Skip to content

Commit

Permalink
Improve Wiktionary parser
Browse files Browse the repository at this point in the history
Change-Id: Id22ba1ad117c640a146f025ba6521db5ec938202
  • Loading branch information
sentinelt committed Jul 10, 2021
1 parent 885c660 commit 8ff9a8b
Show file tree
Hide file tree
Showing 13 changed files with 578 additions and 242 deletions.
73 changes: 56 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,63 @@ About

Evo Inflector implements English pluralization algorithm based on ["Damian Conway's"](https://en.wikipedia.org/wiki/Damian_Conway) paper ["An Algorithmic Approach to English Pluralization"](http://www.csse.monash.edu.au/~damian/papers/HTML/Plurals.html).

The tests performed (May 2014) based on data from [Wiktionary](http://dumps.wikimedia.org/enwiktionary/latest/) show that:
- for entire set of 163518 words from Wiktionary, Evo Inflector returns correct answer for 68.4% of them,
- for 979 words marked as basic words almost all answers are correct, the sole exception being the word ['worse'](https://en.wiktionary.org/wiki/worse) which when used as a noun does not have a plural form,
- for 24.9% of all words Evo Inflector returns some form, but the word is marked as uncountable in Wiktionary,
- for 4.1% of all words Wiktionary does not specify the plural form for given word so whatever Evo Inflector returns will always be wrong,
- for 2.6% Evo Inflector returns an answer which is different than the one provided in Wiktionary.
Usage
=====

The usage is pretty simple:

```java
English.plural("word") == "words"
```

Additionaly you can use provide a required count to select singular or plural form automatically:

```java
English.plural("foot", 1)) == "foot"
English.plural("foot", 2)) == "feet"
```


Features
========
The algorithm tries to preserve the capitalization of the original word, for instance:

```java
English.plural("NightWolf") == "NightWolves"
```

Limitations:
============

* The algorithm cannot reliably detect uncountable words. It will pluralize them anyway.
* There are words which have the same singular form and multiple plural forms, ex:
die (plural dies) - The cubical part of a pedestal; a plinth.
die (plural dice) - An isohedral polyhedron, usually a cube

Tests
=====

As part of the unit tests the results of the algorithm are compared with data from Wiktionary.

There are (July 2021) 282070 single word english nouns in the English Wiktionary of which:
- 71.81% (202551) are countable nouns,
- 25.00% (70532) are uncountable nouns,
- for 2.91% (8212) nouns plural is unknown,
- for 0.27% (775) nouns plural is not attested.

Evo Inflector returns correct answer for 96.28% (195034) of all countable nouns.


There are (2021-07-10) 276574 single word english nouns in the English Wiktionary of which:
- 69.26971% (191582) are countable nouns,
- 27.56839% (76247) are uncountable nouns,
- for 2.8863885% (7983) nouns plural is unknown,
- for 0.27551398% (762) nouns plural is not attested.

Evo Inflector returns correct answer for 96.19432% (184291) of all countable nouns,
but only for 8.56296% (6529) of uncountable nouns
In overall it returns correct answer for 68.994194% (190820) of all nouns

(If you are curious this test is part of the [unit tests](https://github.com/atteo/evo-inflector/blob/master/src/test/java/org/atteo/evo/inflector/EnglishInflectorTest.java).)

Changes
=======
Expand All @@ -37,16 +86,6 @@ Changes

1.0 Initial revision

Usage
=====

```java
System.out.println(English.plural("word")); // == "words"

System.out.println(English.plural("word", 1)); // == "word"
System.out.println(English.plural("word", 2)); // == "words"
```

License
=======

Expand Down
48 changes: 43 additions & 5 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<parent>
<artifactId>parent</artifactId>
<groupId>org.atteo</groupId>
<version>1.16</version>
<version>1.19</version>
</parent>
<artifactId>evo-inflector</artifactId>
<version>1.0-SNAPSHOT</version>
Expand Down Expand Up @@ -32,13 +32,42 @@
</scm>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.ant</groupId>
<artifactId>ant</artifactId>
<groupId>org.assertj</groupId>
<artifactId>assertj-core</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-compress</artifactId>
<version>1.20</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>2.12.4</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.12.4</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.dataformat</groupId>
<artifactId>jackson-dataformat-xml</artifactId>
<version>2.12.4</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.assertj</groupId>
<artifactId>assertj-core</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
Expand All @@ -51,6 +80,15 @@
<source>1.6</source>
<target>1.6</target>
</configuration>
<executions>
<execution>
<id>default-testCompile</id>
<configuration>
<source>8</source>
<target>8</target>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</pluginManagement>
Expand Down
27 changes: 27 additions & 0 deletions src/main/java/org/atteo/evo/inflector/CategoryRule.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
package org.atteo.evo.inflector;

class CategoryRule implements Rule {
private final String[] list;
private final String singular;
private final String plural;

public CategoryRule(String[] list, String singular, String plural) {
this.list = list;
this.singular = singular;
this.plural = plural;
}

@Override
public String getPlural(String word) {
String lowerWord = word.toLowerCase();
for (String suffix : list) {
if (lowerWord.endsWith(suffix)) {
if (!lowerWord.endsWith(singular)) {
throw new RuntimeException("Internal error");
}
return word.substring(0, word.length() - singular.length()) + plural;
}
}
return null;
}
}
31 changes: 31 additions & 0 deletions src/main/java/org/atteo/evo/inflector/RegExpRule.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
package org.atteo.evo.inflector;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

class RegExpRule implements Rule {
private final Pattern singular;
private final String plural;

RegExpRule(Pattern singular, String plural) {
this.singular = singular;
this.plural = plural;
}

RegExpRule(String singular, String plural) {
this.singular = Pattern.compile(singular);
this.plural = plural;
}

@Override
public String getPlural(String word) {
StringBuffer buffer = new StringBuffer();
Matcher matcher = singular.matcher(word);
if (matcher.find()) {
matcher.appendReplacement(buffer, plural);
matcher.appendTail(buffer);
return buffer.toString();
}
return null;
}
}
5 changes: 5 additions & 0 deletions src/main/java/org/atteo/evo/inflector/Rule.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
package org.atteo.evo.inflector;

interface Rule {
String getPlural(String singular);
}
68 changes: 9 additions & 59 deletions src/main/java/org/atteo/evo/inflector/TwoFormInflector.java
Original file line number Diff line number Diff line change
Expand Up @@ -15,62 +15,13 @@

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public abstract class TwoFormInflector {
private interface Rule {
String getPlural(String singular);
}

private static class RegExpRule implements Rule {
private final Pattern singular;
private final String plural;

private RegExpRule(Pattern singular, String plural) {
this.singular = singular;
this.plural = plural;
}

@Override
public String getPlural(String word) {
StringBuffer buffer = new StringBuffer();
Matcher matcher = singular.matcher(word);
if (matcher.find()) {
matcher.appendReplacement(buffer, plural);
matcher.appendTail(buffer);
return buffer.toString();
}
return null;
}
}
import static java.lang.Character.toLowerCase;
import static java.lang.Character.toUpperCase;

private static class CategoryRule implements Rule {
private final String[] list;
private final String singular;
private final String plural;

public CategoryRule(String[] list, String singular, String plural) {
this.list = list;
this.singular = singular;
this.plural = plural;
}
public abstract class TwoFormInflector {

@Override
public String getPlural(String word) {
String lowerWord = word.toLowerCase();
for (String suffix : list) {
if (lowerWord.endsWith(suffix)) {
if (!lowerWord.endsWith(singular)) {
throw new RuntimeException("Internal error");
}
return word.substring(0, word.length() - singular.length()) + plural;
}
}
return null;
}
}

private final List<Rule> rules = new ArrayList<Rule>();

protected String getPlural(String word) {
Expand All @@ -89,15 +40,14 @@ protected void uncountable(String[] list) {

protected void irregular(String singular, String plural) {
if (singular.charAt(0) == plural.charAt(0)) {
rules.add(new RegExpRule(Pattern.compile("(?i)(" + singular.charAt(0) + ")" + singular.substring(1)
+ "$"), "$1" + plural.substring(1)));
rules.add(new RegExpRule("(?i)(" + singular.charAt(0) + ")" + singular.substring(1) + "$",
"$1" + plural.substring(1)));
} else {
rules.add(new RegExpRule(Pattern.compile(Character.toUpperCase(singular.charAt(0)) + "(?i)"
+ singular.substring(1) + "$"), Character.toUpperCase(plural.charAt(0))
+ plural.substring(1)));
rules.add(new RegExpRule(Pattern.compile(Character.toLowerCase(singular.charAt(0)) + "(?i)"
+ singular.substring(1) + "$"), Character.toLowerCase(plural.charAt(0))
rules.add(new RegExpRule(toUpperCase(singular.charAt(0)) + "(?i)" + singular.substring(1) + "$",
toUpperCase(plural.charAt(0))
+ plural.substring(1)));
rules.add(new RegExpRule(toLowerCase(singular.charAt(0)) + "(?i)" + singular.substring(1) + "$",
toLowerCase(plural.charAt(0)) + plural.substring(1)));
}
}

Expand Down
Loading

0 comments on commit 8ff9a8b

Please sign in to comment.