Hyphenation
neural network
for Icelandic
🇮🇸

This is a JavaScript-based neural network that recognizes word boundaries in Icelandic compound words and nested compound words. By default it hyphenates on “orðskiptingar–vélmenni” rather than “orðskiptingarvél–menni”, which can in many situations aid the reader to parse the word.

Use it online here.

Documentation

This program adds soft hyphens to text. It can run in the browser or Node.js with the use of TensorFlow.js. The program is 700kB gzipped. Initialization takes about a second but the hyphenation itself is quick, hyphenating a 30,000 word book takes about 7 seconds. Hyphenation should be applied to text during a pre-processing step rather than being applied by end-users.

In the browser:

<script type="text/javascript" src="https://cdn.jsdelivr.net/gh/ylhyra/icelandic-hyphenation-neural/build/core.js"></script>

<script type="text/javascript">

/* Example 1: Hyphenate text */
IcelandicHyphenation
  .HyphenateText('forsætisráðherra')
  .then(text => {
    console.log(text);
  })

/* Example 2: Add hyphenation to a DOM element */
var element = document.getElementById('root')
IcelandicHyphenation.HyphenateElement(element)

</script>

Configuration

min_word_length
- Only hyphenate words that are this long.
- Default: 6.
min_left_letters
- Hyphenation must be at least this many letters from the left edge of the word.
- Default: 3.
min_right_letters
- Hyphenation must be at least this many letters from the right edge of the word.
- Default: 3.
min_subword_length
- A sub-word is a word inside the longer word, the subwords in "sjávar–líf–fræði" are sjávar, líf, fræði and líffræði. This particular option only applies to words in the middle, so here líf. If min_subword_length is set to 3, the hyphenation can be "sjávar–líf–fræði" (if the below option regarding secondary splits is also set low), if it's more than that it will be "sjávar–líffræði".
- Default: 3.
min_distance_from_a_primary_to_secondary_split
- In the word "for|sætis·ráð|herra–sumar·bú|staðurinn", primary splits are marked with "–", secondary splits are marked with "·", and tertiary splits are marked with "|". Here we have two sub-words: "forsætis**·ráðherra" and "sumar·**bústaðurinn". These words are too long still and we usually want to split them even further. If min_distance_from_a_primary_to_secondary_split is 5 or less, our hyphenated output will be "forsætis–ráðherra–sumar–bústaðurinn". If it is more than 8, the hyphenated output will be "forsætisráðherra–sumarbústaðurinn".
- Default: 7

Examples:

/* Returns „Jökul-á ó-fær“ */
const text = await IcelandicHyphenation.HyphenateText('Jökulá ófær', {
  min_left_letters: 1,
  min_right_letters: 1,
})

Methods

HyphenateText(input, options)
- Returns a promise. Returns text with hyphenations.
HyphenateDOM(element, options)
- Adds hyphenation directly to element. Returns nothing.

Accuracy

The model only splits words when it is reasonably sure that there should be a hyphenation. The odds of a given string containing a false positive hyphenation point is <0.2%, the odds of a word containing a false negative hyphenation point (not returning a hyphenation point where there should be one) is 15%. False negatives are most common in ambiguous words („torf-stunga“ vs. „torfs-tunga”).

Usage notes

Preventing hyphens from being copied

Credits

Built upon the HypheNN-de project by Markus Siemens (MIT).

License

Code:

Model: CC0 (public domain)

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
WIP/chrome-plugin		WIP/chrome-plugin
build		build
core		core
docs		docs
keras		keras
scripts		scripts
website		website
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hyphenation
neural network
for Icelandic
🇮🇸

Documentation

Configuration

Methods

Accuracy

Usage notes

Credits

License

About

Languages

ylhyra/icelandic-hyphenation-neural

Folders and files

Latest commit

History

Repository files navigation

Hyphenation neural network for Icelandic 🇮🇸

Documentation

Configuration

Methods

Accuracy

Usage notes

Credits

License

About

Topics

Resources

Stars

Watchers

Forks

Languages

Hyphenation
neural network
for Icelandic
🇮🇸