More bug fixes and features #3

rhdunn · 2015-06-30T08:36:05Z

This makes the following additional improvements to cmudict-tools:

improved README.md file
support outputting the sphinx dictionary format
support unicode sorting (as air sorting, but using the Unicode Collation Algorithm for sensible ordering of accented characters)
--remove-stress -- remove stress markers from vowel phones (collapsing duplicate pronunciations)
--remove-context-entries -- remove WORD(CONTEXT) based entries
--remove-syllable-breaks -- remove syllable break (-) markers
-Wmissing-primary-stress -- warn on non-weak entries without a primary stress
-Wmultiple-primary-stress -- warn on entries with more than one primary stress

phonesets

incorporate the cmudict.0.7a.phones classification into the accents CSV files
arpabet: support the X phone (for Scottish loch)

metadata

support metadata=@T:name metadata declarations (where T is s for strings, i for integers and f for reals)
support for JSON-based metadata
implement a select=QUERY command to extract data from the dictionary (e.g. for porter stemmer tests)
use format=, accent=, phoneset=, order-from=, encoding= and sorting= file-based metadata when parsing cmudict files
various bugfixes for metadata handling

vim syntax files

support using - as a syllable break in the arpabet phoneset
highlighting support for the acronym files
automatic format detection for current known versions of cmudict (in the ftdetect file)
support for accent=, phoneset= and format= file-based metadata

This makes it easier to share the descriptions with the VIM syntax file and file-based metadata options.

This uses the `cmudict`, `cmudict-new` and `cmudict-weide` values for `b:cmudict_format`. This also updates the VIM syntax documentation in the README file to use the Configuration Options documentation.

It is not possible to determine if a new buffer is intended to be a cmudict file.

This uses the phone classifications from `cmudict.0.7a.phones` for the consonant phones.

This checks for common stress patterns to identify errors: 1. A word with only one vowel which is a weak (schwa) vowel. This is for weak forms like 'a', 'the' and 'had'. 2. A non-weak form without a primary vowel. 3. A non-weak form with more than one primary vowel. NOTE: Weak form detection requires 'AX' to be used instead of 'AH0'.

In the missing-primary-stress check, allow entries like: SHHH SH ZZZZ Z These should really be syllabic consonants, but that would introduce non-standard phones that would only be used in isolated cases.

…sctinct to the modal verb concept.

…in to the POS concepts.

…pt only has one skos:notation

…elatedMatch

…tion and python version

rhdunn added 30 commits February 23, 2015 21:08

vim: support syllable breaks in the arpabet phoneset

f7063c2

vim: use windows-1252 as the default encoding

89be853

vim: support highlighting acronym files

3876893

vim: support file-based metadata detection

b9eaf40

README: Move the usage option value descriptions.

6786210

This makes it easier to share the descriptions with the VIM syntax file and file-based metadata options.

vim: use FORMAT values for the file format

eb0f043

This uses the `cmudict`, `cmudict-new` and `cmudict-weide` values for `b:cmudict_format`. This also updates the VIM syntax documentation in the README file to use the Configuration Options documentation.

README: use the FORMAT docs for the format metadata values

5c90cf9

README: Fix the table of contents

393fb95

cmudict-tools: support Windows line endings

e01abfe

vim: don't specify cmudict content for new buffers

903ffc8

It is not possible to determine if a new buffer is intended to be a cmudict file.

vim: reworking filetype detection -- 0.1-0.6d

da642ad

vim: pick up cmudict_format from the ftdetect file

89ba90e

vim: detect air-based cmudict files

2e81107

vim: acronym file detection

ba37d56

vim: cmudict-new dictionary detection

105330f

vim: support other cmudict formatted dictionaries

419f835

README: document the new format detection logic

b6630f9

Support type formatted metadata.

4b54a61

Implement a 'select' command.

ac29654

cmudict.py: fix relative paths for metadata

2733002

arpabet: support the X phoneme (for Scottish lock, etc.)

e6aaec3

Fix formatting metadata without a comment.

fb8ea16

README: fix a typo

ba733c8

Generate an error on metadata when no mapping is provided.

24a6e0b

Extend the consonant phoneme type classification.

9b5b4de

This uses the phone classifications from `cmudict.0.7a.phones` for the consonant phones.

Annotate syllabic consonants.

6e4ff43

Support classifying a phoneme's stress type.

cb30c47

Allow single fricative phonemes.

7dca698

In the missing-primary-stress check, allow entries like: SHHH SH ZZZZ Z These should really be syllabic consonants, but that would introduce non-standard phones that would only be used in isolated cases.

rhdunn and others added 30 commits April 23, 2016 13:09

Add the pos-tags files to the install set.

8db0fbc

Support using named tagsets for context-format.

3f32b61

Rename SkosValidator to TagsetValidator.

2c5102b

Document the context-format file-based metadata.

e3fba43

README: Fix a typo.

71e3490

Correct the various mapping relationships.

7464492

festlex: update the v_p concept label

6b608f6

cainteoir: correct the Cainteoir Part-of-Speech concept namespace

fa68990

Map between Cainteoir and festlex concepts.

6e69b12

WP20: fix mapping conjections to the Part-of-Speech data model.

1cccf0b

WP20: classify existential there as a pronoun.

b9cf46f

WP20: classify of as a preposition.

751664c

WP20: make the verb concept a broad match to the POS verb as it is di…

b6d23d8

…sctinct to the modal verb concept.

WP20: use skos:relatedMatch instead of skos:related for mapping wp20:…

7c4e6d5

…in to the POS concepts.

upenn: fix the xsd namespace

9f653cb

Remove unused xsd namespace references.

36a91a9

festlex: split out 'vl' and 'y' into separate concepts, so each conce…

0205540

…pt only has one skos:notation

festlex: better model the concepts that are typos using related and r…

c594155

…elatedMatch

Support loading concept mappings from SKOS metadata.

9db2027

Merge commit 'c59415506b55b1362388ffa7664ed47b65516f79'

0b5b20c

Support converting context tagsets (e.g. cainteoir to festlex).

91a604c

Factor out context mapping into a helper function.

36041ac

Implement a remove duplicate contexts option.

a1707ea

Fix printing key/value pairs when the value is a string.

ea83b80

Add standard GNU targets to the Makefile and building documentation.

419297b

Ignore *~ temporary files.

9524335

Don't count errors as words in the statistics.

7f3795c

Add adj@attr and adj@pred concepts to the cainteoir pos tagset.

0d5115c

Merge commit '0d5115c6bb246e6346bfacd21d650dde9228f711'

fae3f42

Let the user provide environment variables for custom prefix installa…

fe97e68

…tion and python version

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More bug fixes and features #3

More bug fixes and features #3

rhdunn commented Jun 30, 2015

More bug fixes and features #3

Are you sure you want to change the base?

More bug fixes and features #3

Conversation

rhdunn commented Jun 30, 2015

phonesets

metadata

vim syntax files