Skip to content

dohliam/pingyam-rb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pingyam-rb - Cantonese romanization conversion in Ruby

This repository contains a Ruby library and example conversion tool that makes use of the open-licensed Pingyam Database to convert between 11 different Cantonese romanization systems and variants.

Features

  • Converts to and from any Cantonese romanization scheme (including IPA)
  • Can convert single and multiple words / whole lines of romanized text
  • Handles mixed input (non-Cantonese text is ignored)
  • Converter script ready to use on the command-line -- or include the library in your own code

Included romanization systems

In total 11 Cantonese romanization systems are available for conversion using this library. Each variant is identified by a number (0-10); this number is also used for identifying the "to" and "from" romanizations to use while converting text.

Index Name Chinese Variant
0 Yale 耶魯拼音 Tone numbers
1 Yale Tone diacritics
2 Cantonese Pinyin 教院拼音
3 S.L. Wong 黃錫凌 Tone numbers
4 S.L. Wong Tone diacritics
5 International Phonetic Alphabet 國際音標
6 Jyutping 粵拼
7 Canton 廣州拼音
8 Sidney Lau 劉錫祥
9 Penkyamp 粵語拼音字 Tone numbers
10 Penkyamp Tone diacritics

Note: A modified 9-tone Yale system is used by default. However, this library includes a method to convert the Yale transcription to the more traditional 6-tone system (see below for details).

Requirements

This library makes use of the latest version of the Pingyam database, and expects a file called pingyambiu containing the conversion data to be located in a pingyam folder in the project root directory. There a number of ways to do this:

  • Easiest method: Run the update_database.rb script to get the latest version of the script
    • Instructions: In the project root directory, enter the following command: ./update_database.rb
    • If the current version of the database is different than the one on your machine, your local copy will be updated
  • Download the file directly from the Pingyam project here.
    • Make sure to create a directory called pingyam in the project root and copy the file to that directory
  • If you have git installed, you can clone the database into the root project folder using the following command: `git clone https://github.com/kfcd/pingyam.git
  • Download the Pingyam project into a separate location and create a symlink in the current project directory

There are no other special requirements other than a working version of Ruby.

Usage

This project can be used either as a library (lib_pingyam.rb) or as a command-line script (convert_pingyam.rb). Details for both types of usage can be found below.

lib_pingyam

To use the library, make sure to require the library file, e.g.:

require_relative 'lib_pingyam.rb'

Before you can convert text, you need to initialize a Converter object:

conv = Converter.new

By default, this initializes a conversion dictionary that works from Yale to any other romanization system.

To use a different source romanization system, just specify the corresponding index number as an argument when initializing the Converter object, e.g.:

conv = Converter.new(6)
# => This converts from Jyutping to any other system

You can then convert any string of text using the convert_line method, which takes a string and an integer representing the target romanization system as arguments:

pingyam = "Yale to Jyutping conversion: yut9 yu5 jyun2 wun6"
puts conv.convert_line(pingyam, 6)
# => Yale to Jyutping conversion: jyut6 jyu5 zyun2 wun6

Tip: If you provide 11 as the index number when converting, the string will be translated into all of the available systems sequentially, e.g.:

pingyam = "yut9 yu5 ping3 yam1 fong1 on3 yat7 laam4"
puts conv.convert_line(pingyam, 11)
# => yut9 yu5 ping3 yam1 fong1 on3 yat7 laam4 
# => yuht yúh ping yām fōng on yāt làahm 
# => jyt9 jy5 ping3 jam1 fong1 on3 jat7 laam4 
# => jyt⁹ jy⁵ pɪŋ³ jɐm¹ fɔŋ¹ ɔn³ jɐt⁷ lam⁴ 
# => _jyt ˏjy ¯pɪŋ 'jɐm 'fɔŋ ¯ɔn 'jɐt ˌlam 
# => jyːt˨ jyː˩˧ pʰɪŋ˧ jɐm˥ fɔːŋ˥ ɔːn˧ jɐt˥ laːm˨˩ 
# => jyut6 jyu5 ping3 jam1 fong1 on3 jat1 laam4 
# => yud6 yu5 ping3 yem1 fong1 on3 yed1 lam4 
# => yuet⁶ yue⁵ ping³ yam¹ fong¹ on³ yat¹ laam⁴ 
# => yeud6 yeu5 penk3 yamp1 fong1 on3 yat1 lam4 
# => yeùd yeú pênk yämp föng ôn yät lam

The Converter class has a built-in method for checking if a given string is a valid syllable in any of the available Cantonese romanization systems:

conv = Converter.new
# checks against syllables in Yale (numerals) by default

word = "heung1"
puts conv.check_syllable(word)
# => true

word = "heungg1"
puts conv.check_syllable(word)
# => false

To check syllables in any other romanization system, just specify it when initializing the Converter class:

conv = Converter.new(6)
# checks valid Jyutping syllables

word = "heung1"
puts conv.check_syllable(word)
# => false

word = "hoeng1"
puts conv.check_syllable(word)
# => true

converting syllables

You can convert individual syllables using the convert_syllable method of the Converter class. This method requires two arguments: a string consisting of a single romanized syllable and an integer representing the index number of the target romanization system.

For example, to convert a syllable in Yale into IPA:

conv = Converter.new
p conv.convert_syllable("heung1", 5)
# => "hœːŋ˥"

To convert from a different source transcription system, just provide the corresponding index number when initializing the Converter object.

For example, to convert Jyutping into IPA:

@conv = Converter.new(6)
p @conv.convert_syllable("hoeng1", 5)
# => "hœːŋ˥"

If 11 is passed as the final argument to the convert_syllable method, it will return an array containing all of the possible transcriptions of the given syllable:

conv = Converter.new
p conv.convert_syllable("heung1", 11)
# => ["heung1", "heūng", "hoeng1", "hœŋ¹", "'hœŋ", "hœːŋ˥", "hoeng1", "hêng1", "heung¹", "heong1", "heöng"]

convert_pingyam

The convert_pingyam.rb file found in the root directory is a simple script that demonstrates the use of the lib_pingyam library. It allows for quick and easy conversion between arbitrary Cantonese romanization systems on the command-line.

Basic usage

./convert_pingyam.rb -i "This is a test: Yut9 yu5 ping3 yam1 jyun2 wun6"
# => This is a test: yuht yúh ping yām jyún wuhn

The above example converts the Cantonese romanization in the provided sentence from Yale (with numerals) into Yale with diacritics. All of the text that is not recognizable as Cantonese romanization (e.g., all of the English text before the colon in the provided sentence) is ignored.

To convert the text into Jyutping instead, just provide the index number for Jyutping (i.e., 6 -- see list above) using the -t (--target) option:

./convert_pingyam.rb -i "This is a test: Yut9 yu5 ping3 yam1 jyun2 wun6" -t 6
# => This is a test: jyut6 jyu5 ping3 jam1 zyun2 wun6

As can be seen, the text has now been converted into Jyutping romanization. Conversion into other systems is equally easy -- just replace 6 above with the index number of the system you wish to use for output.

To convert from a different source romanization system (e.g., to convert from Jyutping to Yale, or from S.L. Wong to Jyutping), provide the source system index number as a parameter using the -s (--source) option. The example below converts from Jyutping to Yale with diacritics:

./convert_pingyam.rb -i "This is a test: jyut6 jyu5 ping3 jam1 zyun2 wun6" -s 6 -t 1
# => This is a test: yuht yúh ping yām jyún wuhn

Checking input validity

Invalid romanization syllables can be identified using the -c (--check) option. This checks each word in the input string and outputs a list of words that are not recognizable as valid Cantonese syllables in the given romanization system:

./convert_pingyam.rb -i "This is a test: Yut9 yu5 ping3 yam1 jyun2 wun6" -c
# => This
# => is
# => a
# => test:

The output in the above example contains words that are not valid syllables in Yale romanization (the default, since no other system was specified). To use a different romanization system just provide the appropriate index number using the -s option. For example, the command below checks for invalid syllables in Jyutping:

./convert_pingyam.rb -i "This is a test: Yut9 yu5 ping3 yam1 jyun2 wun6" -c -s 6
# => This
# => is
# => a
# => test:
# => Yut9
# => yu5
# => yam1

In the example above, the output contains apart from English the Yale syllables Yut9, yu5, and yam1, because these are not valid syllables in Jyutping.

Modifying the output

The output transcription can be further modified using optional command-line flags, for example to convert regular tone numerals to superscript numerals (Unicode), or to revert to the traditional 6-tone Yale system.

  • Superscript numerals: Several romanization systems use numerals to indicate tones in Cantonese. These are often represented in superscript form to increase readability of romanized text. To use superscript numerals, use the -S (--superscript) option with any numeral-using transcription system. For example, this will convert siu2 chak7 si3 to siu² chak⁷ si³.
  • Yale normalization: To use the older 6-tone Yale transcription instead of the default 9-tone modified version, use the -Y (--yale) option. For example, this will convert yat7 jek8 kek9 to yat1 jek3 kek6.

These modifications can be combined -- the example below both normalizes the Yale transcription and converts the numerals to superscript:

./convert_pingyam.rb -i "yat7 jek8 kek9" -t 0 -YS
# => yat¹ jek³ kek⁶

Options

The following options can be provided to convert_pingyam.rb to control the conversion process:

  • -c, --check: Check if input contains invalid Cantonese romanization
  • -i, --input STRING: Input string to be converted
  • -f, --filename FILE: Provide file for conversion
  • -s, --source INDEX: Provide index number of romanization to convert from
  • -S, --superscript: Print tone numerals as superscript
  • -t, --target INDEX: Provide index number of romanization to convert into
  • -Y, --yale: Normalize Yale to 6-tone traditional system

To do

  • Support for traditional 6-tone Yale (with numerals)
  • Conversion of tone numbers to superscript
  • Optional HTML output
  • Handle files and pipes as input

See also

License

  • Romanization data: CC BY
  • All other code: MIT