This repository contains prebuild textual analysis functions (analyzers) which are composed of smaller modules (tokenizers), each tokenizer performs actions such as transforming, filtering and enriching word tokens.
Analyzers are available as functions and can be called like any regular function, the input is a single string and the output is also a single string:
var street = require('./analyzer/street')
var analyzer = street()
analyzer('main str s')
// Main Street South
Analyzers also accept a 'context' object which is available throughout the analysis pipeline:
var analyzer = street({ locale: 'de' })
analyzer('main str s')
// Main Strasse Sued
Tokenizers are intended to be used as part of an analyzer, but can also be used independently by calling Array.reduce
on an array of tokens:
var tokenizer = require('./tokenizer/diacritic')
[ 'žůžo', 'Cinématte' ].reduce( tokenizer, [] )
// [ 'zuzo', 'Cinematte' ]
Tokenizers are functions with the interface expected by Array.reduce.
In their simplest form a tokenizer is written as:
// a delete-all tokenizer emits no words
var tokenizer = function( res, word, pos, arr ){
// you must always return $res
return res
}
For a tokenizer to have no effect on the token stream it must res.push()
on to the response array each word it took in:
// a no-op tokenizer emits words verbatim as they were taken in
var tokenizer = function( res, word, pos, arr ){
// push the word on to the response array unmodified
res.push( word )
// you must always return $res
return res
}
A tokenizer can choose which words are pushed downstream, it can also modify words and push more than one word on to the response array:
// a split tokenizer cuts a string on word boudaries, producing multiple words
var tokenizer = function( res, word, pos, arr ){
// split the input word on word boundaries
var parts = word.split(/\b/g)
// push each part downstream
parts.forEach( function( part ){
res.push( part )
})
// you must always return $res
return res
}
Using these techniques, you can write tokenizers which delete, modify or create new words.
More advanced tokenizers require information about the context in which they were run, for example, knowing the locale of your input tokens might allow you to vary its functionality accordingly.
Context is provided to tokenizers by using Function.bind
to bind the context to the tokenizer. This information will then be available inside the tokenizer using the this
keyword:
// an abbreviation tokenizer converts the contracted form of a word to its equivalent expanded form
var tokenizer = function( res, word, pos, arr ){
// detect the input locale (or default to english)
var locale = this.locale || 'en'
if( 'str.' === word ){
switch( locale ){
case 'de':
// transform to German expansion
res.push( 'strasse' )
return res
case 'en':
// transform to English expansion
res.push( 'street' )
return res
}
}
// push the word on to the response array unmodified
res.push( word )
// you must always return $res
return res
}
You can then control the runtime context of the analyzer using Function.bind
:
var english = tokenizer.bind({ locale: 'en' })
[ 'str.' ].reduce( english, [] )
// [ 'street' ]
var german = tokenizer.bind({ locale: 'de' })
[ 'str.' ].reduce( german, [] )
// [ 'strasse' ]
there is an included CLI script which allows you to easily pipe in files for testing an analyzer:
# test a single input
$ node cli.js en street <<< "n foo st w"
North Foo Street West
# test multiple inputs
$ echo -e "n foo st w\nw 16th st" | node cli.js en street
North Foo Street West
West 16 Street
# test against the contents of a file
$ node cli.js en street < nyc.names
100 Avenue
100 Drive
100 Road
... etc
# test against openaddresses data
$ cut -d',' -f4 /data/oa/de/berlin.csv | sort | uniq | node cli.js de street
Aachener Strasse
Aalemannufer
Aalesunder Strasse
... etc
using the linux diff
command you can view a side-by-side comparison of the data before and after analysis:
$ diff \
--side-by-side \
--ignore-blank-lines \
--suppress-common-lines \
--width=100 \
--expand-tabs \
nyc.names \
<(node cli.js en street < nyc.names)
ZEBRA PL | Zebra Place
ZECK CT | Zeck Court
ZEPHYR AVE | Zephyr Avenue
... etc
units test are run with:
$ npm test
functional tests are run with:
$ npm run funcs