Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More bug fixes and features #3

Open
wants to merge 149 commits into
base: master
Choose a base branch
from
Open

More bug fixes and features #3

wants to merge 149 commits into from

Conversation

rhdunn
Copy link

@rhdunn rhdunn commented Jun 30, 2015

This makes the following additional improvements to cmudict-tools:

  • improved README.md file
  • support outputting the sphinx dictionary format
  • support unicode sorting (as air sorting, but using the Unicode Collation Algorithm for sensible ordering of accented characters)
  • --remove-stress -- remove stress markers from vowel phones (collapsing duplicate pronunciations)
  • --remove-context-entries -- remove WORD(CONTEXT) based entries
  • --remove-syllable-breaks -- remove syllable break (-) markers
  • -Wmissing-primary-stress -- warn on non-weak entries without a primary stress
  • -Wmultiple-primary-stress -- warn on entries with more than one primary stress

phonesets

  • incorporate the cmudict.0.7a.phones classification into the accents CSV files
  • arpabet: support the X phone (for Scottish loch)

metadata

  • support metadata=@T:name metadata declarations (where T is s for strings, i for integers and f for reals)
  • support for JSON-based metadata
  • implement a select=QUERY command to extract data from the dictionary (e.g. for porter stemmer tests)
  • use format=, accent=, phoneset=, order-from=, encoding= and sorting= file-based metadata when parsing cmudict files
  • various bugfixes for metadata handling

vim syntax files

  • support using - as a syllable break in the arpabet phoneset
  • highlighting support for the acronym files
  • automatic format detection for current known versions of cmudict (in the ftdetect file)
  • support for accent=, phoneset= and format= file-based metadata

This attempts to detect the dictionary format based on the filename.
It supports the following filenames:

| cmudict.0.1 .. cmudict.0.6e | `weide` format |
| cmudict.dict                | `new` format   |
| cmudict.vp                  | `new` format   |
| cmudict*                    | `air` format   |

This matches the usage in the various cmudict releases and the
cmusphynx svn repository.
This makes it easier to share the descriptions with the VIM syntax
file and file-based metadata options.
This uses the `cmudict`, `cmudict-new` and `cmudict-weide` values
for `b:cmudict_format`.

This also updates the VIM syntax documentation in the README file
to use the Configuration Options documentation.
It is not possible to determine if a new buffer is intended to be
a cmudict file.
This uses the phone classifications from `cmudict.0.7a.phones` for
the consonant phones.
This checks for common stress patterns to identify errors:

  1. A word with only one vowel which is a weak (schwa) vowel.

     This is for weak forms like 'a', 'the' and 'had'.

  2. A non-weak form without a primary vowel.

  3. A non-weak form with more than one primary vowel.

NOTE: Weak form detection requires 'AX' to be used instead of 'AH0'.
In the missing-primary-stress check, allow entries like:

	SHHH  SH
	ZZZZ  Z

These should really be syllabic consonants, but that would introduce
non-standard phones that would only be used in isolated cases.
rhdunn and others added 30 commits April 23, 2016 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants