Skip to content

Phonetic encoders - bmpm, caverphone, soundex, metaphone

License

Notifications You must be signed in to change notification settings

f1monkey/phonetic

Repository files navigation

Phonetic

Set of different phonetic encoders' implementations.

Installion

To install:

$ go get -v github.com/f1monkey/phonetic

Usage

Soundex

The fastest algorithm in this library. Soundex is used to encode words into a phonetic code for matching similar sounding words with different spellings. It was developed for indexing English language names. Wiki page.

Code example:

package main

import (
	"fmt"

	"github.com/f1monkey/phonetic/soundex"
)

func main() {
	e := soundex.NewEncoder()
	result := e.Encode("orange")
	fmt.Println(result)
	// prints: O652
}

Metaphone

The Metaphone encoder converts words into a phonetic code that represents their pronunciation for comparing words based on their phonetic properties, rather than their spelling. The Metaphone encoder was designed for English. Wiki page

Code example

package main

import (
	"fmt"

	"github.com/f1monkey/phonetic/metaphone"
)

func main() {
	e := metaphone.NewEncoder()
	result := e.Encode("orange")
	fmt.Println(result)
	// prints: ORNJ
}

Cologne phonetics

Cologne phonetics (Kölner Phonetik) is a phonetic algorithm used for indexing German words by their sound, allowing for name and word matching in German language databases. Wiki page

Code example:

package main

import (
	"fmt"

	"github.com/f1monkey/phonetic/cologne"
)

func main() {
	e := cologne.NewEncoder()
	result := e.Encode("Großtraktor")
	fmt.Println(result)
	// prints: 47827427
}

Caverphone2

Caverphone2 is a phonetic algorithm used for indexing and matching names, particularly in English and New Zealand languages. Wiki page

package main

import (
	"fmt"

	"github.com/f1monkey/phonetic/caverphone2"
)

func main() {
	e := caverphone2.NewEncoder()
	result := e.Encode("orange")
	fmt.Println(result)
	// prints: ARNK111111
}

Beider-Morse

It's a Go port of the original PHP library BMPM is a phonetic algorithm used for indexing and matching names in multiple languages. Contains a huge amount of different rules to transform a word to it's phonetic representation. Current implementation is relatively slow.

To reduce outcoming binary size, the three rulesets were split into different packages:

  • github.com/f1monkey/phonetic/beidermorse - generic rules (for general usage)
  • github.com/f1monkey/phonetic/beidermorse/beidermorseash - ashkenazi rules
  • github.com/f1monkey/phonetic/beidermorse/beidermorsesep - sephardic rules

Each package contains exact and approx (default) rulesets. To use exact ruleset, you should pass a special option to encoder (see in example).

Code examples:

  • generic ruleset with approx accuracy
     import (
     	"fmt"
     	"github.com/f1monkey/phonetic/beidermorse"
     )
    
     func main() {
     	encoder, _ := beidermorse.NewEncoder()
     	result := encoder.Encode("orange")
     	fmt.Println(result)
     	// prints: [orangi oragi orongi orogi orYngi Yrangi Yrongi YrYngi oranxi oronxi orani oroni oranii oronii oranzi oronzi urangi urongi]
     }
  • generic ruleset with exact accuracy
     import (
     	"fmt"
     	"github.com/f1monkey/phonetic/beidermorse"
     )
    
     func main() {
     	encoder, _ := beidermorse.NewEncoder(beidermorse.WithAccuracy(beidermorse.Exact))
     	result := encoder.Encode("orange")
     	fmt.Println(result)
     	// prints: [orange oranxe oranhe oranje oranZe orandZe]
    
     }
  • generic ruleset with exact accuracy and english language with buffer reusing (to reduce GC pressure)
     import (
     	"fmt"
     	"github.com/f1monkey/phonetic/beidermorse"
     )
    
     func main() {
     	encoder, err = beidermorse.NewEncoder(
     		beidermorse.WithAccuracy(beidermorse.Exact),
     		beidermorse.WithLang(beidermorse.English),
     		beidermorse.WithBufferReuse(true),
     	)
     	result := encoder.Encode("orange")
     	fmt.Println(result)
     	// prints: [orenk orenge orendS orendZe oronk oronge orondS orondZe orank orange orandS orandZe arenk arenge arendS arendZe aronk aronge arondS arondZe arank arange arandS arandZe]
    
     }
  • ashkenazi ruleset with approx accuracy
     	import (
     		"fmt"
     		"github.com/f1monkey/phonetic/beidermorseash"
     	)
    
     	func main() {
     		encoder, _ := beidermorseash.NewEncoder()
     		result := encoder.Encode("orange")
     		fmt.Println(result)
     		// prints: [orangi orongi orYngi Yrangi Yrongi YrYngi oranzi oronzi orani oroni oranxi oronxi urangi urongi]
     	}
  • sephardic ruleset with approx accuracy
     	import (
     		"fmt"
     		"github.com/f1monkey/phonetic/beidermorsesep"
     	)
    
     	func main() {
     		encoder, _ := beidermorsesep.NewEncoder()
     		result := encoder.Encode("orange")
     		fmt.Println(result)
     		// prints: [uranzi uranz uranS uranzi uranz uranhi uranh]
     	}

Benchmarks

  • Soundex
     goos: linux
     goarch: amd64
     pkg: github.com/f1monkey/phonetic/soundex
     cpu: AMD Ryzen 9 6900HX with Radeon Graphics
     Benchmark_Encoder_Encode-16    	14173989	        99.21 ns/op	       8 B/op	       1 allocs/op
     PASS
     ok  	github.com/f1monkey/phonetic/soundex	1.497s
    
  • Metaphone
     goos: linux
     goarch: amd64
     pkg: github.com/f1monkey/phonetic/metaphone
     cpu: AMD Ryzen 9 6900HX with Radeon Graphics        
     Benchmark_Encoder_Encode-16    	 6451292	       267.1 ns/op	      48 B/op	       3 allocs/op
     PASS
     ok  	github.com/f1monkey/phonetic/metaphone	1.916s
    
  • Cologne phonetics
     goos: linux
     goarch: amd64
     pkg: github.com/f1monkey/phonetic/cologne
     cpu: AMD Ryzen 9 6900HX with Radeon Graphics
     Benchmark_Encoder_Encode-16    	 3737944	       374.8 ns/op	     104 B/op	       3 allocs/op
     PASS
     ok  	github.com/f1monkey/phonetic/cologne	1.729s
    
  • Caverphone2
     goos: linux
     goarch: amd64
     pkg: github.com/f1monkey/phonetic/caverphone2
     cpu: AMD Ryzen 9 6900HX with Radeon Graphics
     Benchmark_Encoder_Encode-16    	 1864532	       641.7 ns/op	      40 B/op	       3 allocs/op
     PASS
     ok  	github.com/f1monkey/phonetic/caverphone2	1.851s
    
  • Beider-Morse
     goos: linux
     goarch: amd64
     pkg: github.com/f1monkey/phonetic/beidermorse
     cpu: AMD Ryzen 9 6900HX with Radeon Graphics
     Benchmark_Encoder_Encode_En_Approx-16                	    5769	    219152 ns/op	   21264 B/op	     146 allocs/op
     Benchmark_Encoder_Encode_En_Exact-16                 	   13203	     82072 ns/op	    9199 B/op	      84 allocs/op
     Benchmark_Encoder_Encode_Ru_Approx-16                	   30060	     54323 ns/op	    6093 B/op	      48 allocs/op
     Benchmark_Encoder_Encode_Ru_Exact-16                 	   37522	     28353 ns/op	    2657 B/op	      26 allocs/op
    
    With buffer reuse:
     goos: linux
     goarch: amd64
     pkg: github.com/f1monkey/phonetic/beidermorse
     cpu: AMD Ryzen 9 6900HX with Radeon Graphics
     Benchmark_Encoder_Encode_BufferReuse_En_Approx-16    	   10000	    129346 ns/op	    6126 B/op	     130 allocs/op
     Benchmark_Encoder_Encode_BufferReuse_En_Exact-16     	   23198	     48813 ns/op	    2297 B/op	      72 allocs/op
     Benchmark_Encoder_Encode_BufferReuse_Ru_Approx-16    	   48902	     29909 ns/op	    1297 B/op	      41 allocs/op
     Benchmark_Encoder_Encode_BufferReuse_Ru_Exact-16     	   65834	     16260 ns/op	     485 B/op	      22 allocs/op