Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grammar support for Russian way names #102

Merged
merged 2 commits into from
Oct 4, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
# Change Log
All notable changes to this project will be documented in this file. For change log formatting, see http://keepachangelog.com/

## master

- Added grammatical cases support for Russian way names [#102](https://github.com/Project-OSRM/osrm-text-instructions/pull/102)

## 0.7.1 2017-09-26

- Added Castilian Spanish localization. [#163](https://github.com/Project-OSRM/osrm-text-instructions/pull/163)
Expand Down Expand Up @@ -73,7 +77,7 @@ All notable changes to this project will be documented in this file. For change

## 0.1.0 2016-11-17

- Improve chinese translation
- Improve Chinese translation
- Standardize capitalizeFirstLetter meta key
- Change instructions object customization to options.hooks.tokenizedInstruction

Expand Down
53 changes: 53 additions & 0 deletions Grammar.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
## Grammar support

Many languages - all Slavic (Russian, Ukrainian, Polish, Bulgarian, etc), Finnic (Finnish, Estonian) and others - have [grammatical case feature](https://en.wikipedia.org/wiki/Grammatical_case) that could be supported in OSRM Text Instructions too.
Originally street names are being inserted into instructions as they're in OSM map - in [nominative case](https://en.wikipedia.org/wiki/Nominative_case).
To be grammatically correct, street names should be changed according to target language rules and instruction context before insertion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the various cases always regular and easy to determine from the nominative case? I noticed a few features have been tagged name:dative in OpenStreetMap – would it be desirable or necessary for OSRM to pass such tags along to OSRM Text Instructions?

Copy link
Contributor Author

@yuryleb yuryleb Sep 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it's not deal of OSM - it's impossible to store all case variants in all languages for each name. Perhaps CLDR will be the better place (if will 😉 )


Actually grammatical case applying is not the simple and obvious task due to real-life languages complexity.
It even looks so hard so, for example, all known native Russian navigation systems don't speak street names in their pronounceable route instructions at all.

But fortunately street names have restricted lexicon and naming rules and so this task could be relatively easily solved for this particular case.

### Implementation details

The quite universal and simplier solution is the changing street names with the prepared set of regular expressions grouped by required grammatical case.
The required grammatical case should be specified right in instruction's substitution variables:

- `{way_name}` and `{rotary_name}` variables in translated instructions should be appended with required grammar case name after colon: `{way_name:accusative}` for example
- [languages/grammar](languages/grammar/) folder should contain language-specific JSON file with regular expressions for specified grammar case:
```json
{
"v5": {
"accusative": [
["^ (\\S+)ая-(\\S+)ая [Уу]лица ", " $1ую-$2ую улицу "],
["^ (\\S+)ая [Уу]лица ", " $1ую улицу "],
...
```
- All such JSON files should be registered in common [languages.js](languages.js)
- Instruction text formatter ([index.js](index.js) in this module) should:
- check `{way_name}` and `{rotary_name}` variables for optional grammar case after colon: `{way_name:accusative}`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this syntax – it leaves open the possibility of additional transformations like capitalization, which could replace the "capitalizeFirstLetter" meta property currently set on some localizations.

- find appropriate regular expressions block for target language and specified grammar case
- call standard [string replace with regular expression](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace) for each expression in block passing result from previous call to the next; the first call should enclose original street name with whitespaces to make parsing words in names a bit simplier.
- Strings replacement with regular expression is available in almost all other programming language and so this should not be the problem for other code used OSRM Text Instructions' data only.
- If there is no regular expression matched source name (that's for names from foreign country for example), original name is returned without changes. This is also expected behavior of standard [string replace with regular expression](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace). And the same behavior is expected in case of missing grammar JSON file or grammar case inside it.

### Example

Russian _"Большая Монетная улица"_ street from St Petersburg (_Big Monetary Street_ in rough translation) after processing with [Russian grammar rules](languages/grammar/ru.json) will look in following instructions as:
- _"Turn left onto `{way_name}`"_ => `ru`:_"Поверните налево на `{way_name:accusative}`"_ => _"Поверните налево на Большую Монетную улицу"_
- _"Continue onto `{way_name}`"_ => `ru`:_"Продолжите движение по `{way_name:dative}`"_ => _"Продолжите движение по Большой Монетной улице"_
- _"Make a U-turn onto `{way_name}` at the end of the road"_ => `ru`:_"Развернитесь в конце `{way_name:genitive}`"_ => _"Развернитесь в конце Большой Монетной улицы"_
- _"Make a U-turn onto `{way_name}`"_ => `ru`:_"Развернитесь на `{way_name:prepositional}`"_ => _"Развернитесь на Большой Монетной улице"_

### Design goals

- __Cross platform__ - uses the same data-driven approach as OSRM Text Instructions
- __Test suite__ - has [prepared test](test/grammar_tests.js) to check available expressions automatically and has easily extendable language-specific names testing pattern
- __Customization__ - could be easily extended for other languages with adding new regular expressions blocks into [grammar support](languages/grammar/) folder and modifying `{way_name}` and other variables in translated instructions only with necessary grammatical case labels

### Notes

- Russian regular expressions are based on [Garmin Russian TTS voices update](https://github.com/yuryleb/garmin-russian-tts-voices) project; see [file with regular expressions to apply to source text before pronouncing by TTS](https://github.com/yuryleb/garmin-russian-tts-voices/blob/master/src/Pycckuu__Milena%202.10/RULESET.TXT).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for dedicating this rule set to the public domain!

- There is another grammar-supporting module - [jquery.i18n](https://github.com/wikimedia/jquery.i18n) - but unfortunately it has very poor implementation in part of grammatical case applying and is supposed to work with single words only.
- Actually it would be great to get street names also in target language not from default OSM `name` only - there are several multi-lingual countries supporting several `name:<lang>` names for streets. But this the subject to address to [OSRM engine](https://github.com/Project-OSRM/osrm-backend) first.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea – please file an issue in osrm-backend.

4 changes: 4 additions & 0 deletions Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ OSRM Text Instructions transforms [OSRM](http://www.project-osrm.org/) route res

OSRM Text Instructions has been translated into [several languages](https://github.com/Project-OSRM/osrm-text-instructions/tree/master/languages/translations/). Please help us add support for the languages you speak [using Transifex](https://www.transifex.com/project-osrm/osrm-text-instructions/).

OSRM Text Instructions could support [grammatical cases](https://github.com/Project-OSRM/osrm-text-instructions/tree/master/Grammar.md) for street names for [some languages](https://github.com/Project-OSRM/osrm-text-instructions/tree/languages/grammar/).

Grammatical cases and other translated strings customization after [Transifex](https://www.transifex.com/project-osrm/osrm-text-instructions/) is handled by [override scripts](https://github.com/Project-OSRM/osrm-text-instructions/tree/master/languages/overrides/).

[![NPM](https://nodei.co/npm/osrm-text-instructions.png)](https://npmjs.org/package/osrm-text-instructions/)

### Design goals
Expand Down
35 changes: 31 additions & 4 deletions index.js
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
var languages = require('./languages');
var instructions = languages.instructions;
var grammars = languages.grammars;

module.exports = function(version, _options) {
var opts = {};
Expand Down Expand Up @@ -104,7 +105,6 @@ module.exports = function(version, _options) {
switch (type) {
case 'use lane':
laneInstruction = instructions[language][version].constants.lanes[this.laneConfig(step)];

if (!laneInstruction) {
// If the lane combination is not found, default to continue straight
instructionObject = instructions[language][version]['use lane'].no_lanes;
Expand Down Expand Up @@ -199,10 +199,37 @@ module.exports = function(version, _options) {

return this.tokenize(language, instruction, replaceTokens);
},
grammarize: function(language, name, grammar) {
// Process way/rotary name with applying grammar rules if any
if (name && grammar && grammars && grammars[language] && grammars[language][version]) {
var rules = grammars[language][version][grammar];
if (rules) {
// Pass original name to rules' regular expressions enclosed with spaces for simplier parsing
var n = ' ' + name + ' ';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's currently possible for clients to manipulate a step's name before compilation. For example, the Mapbox Directions API currently inserts SSML markup around numbers within the name so that speech synthesizers pronounce them more casually. I think that could potentially interfere with this feature. That would be a good argument for implementing #52, so that clients can reliably manipulate the name after grammaticalization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, if clients will not damage {way_name:*} vars - for example, "Restore names highlighting" PR adds HTML tags around these vars and all works as on snapshot above.
But if Mapbox Directions inserts SSML tags right instead {way_name}, this looks wrong - for example, Nuance uses \tn=address\ tags in its Vocalizer voices to specify start of address text specially to pronounce numbers and known street abbreviations as address but doesn't change original address/name text.

Copy link
Contributor Author

@yuryleb yuryleb Sep 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But actually it's no problem, even if Mapbox will catch and replace {way_name} and skip {way_name:accusative} etc - this allows further "grammarization" to work 😉 SSML tags are actually important only for English addresses, Russian & many others have no special requirements for them.

Copy link
Member

@1ec5 1ec5 Oct 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m referring to a JavaScript port of mapbox/mapbox-navigation-ios#552 for the Directions API’s (not yet documented) voice_instructions parameter. Essentially we’re working around a problem where Amazon Polly assumes an address is abbreviated, whereas OSM never abbreviates road names. The workaround is to wrap only numbers in <say-as interpret-as="address"> tags, since we still want numbers to be optimized for speech even if the rest of the words can’t be treated as part of an address. For the JavaScript port, we’re changing the value of way_name itself, so that by the time compile() runs, way_name contains SSML code. 😬

I’ll be the first to admit it’s a hack, but I wanted to point this out because clients may be inclined to do similar things to the value of way_name unless we advise against doing so in this project’s documentation. If this PR forces us to implement #52, that’d be a wonderful outcome in my opinion. 😉

/cc @allierowan

var flags = grammars[language].meta.regExpFlags || '';
rules.forEach(function(rule) {
var re = new RegExp(rule[0], flags);
n = n.replace(re, rule[1]);
});

return n.trim();
}
}

return name;
},
tokenize: function(language, instruction, tokens) {
var output = Object.keys(tokens).reduce(function(memo, token) {
return memo.replace('{' + token + '}', tokens[token]);
}, instruction)
// Keep this function context to use in inline function below (no arrow functions in ES4)
var that = this;
var output = instruction.replace(/\{(\w+):?(\w+)?\}/g, function(token, tag, grammar) {
var name = tokens[tag];
if (typeof name !== 'undefined') {
return that.grammarize(language, name, grammar);
}

// Return unknown token unchanged
return token;
})
.replace(/ {2}/g, ' '); // remove excess spaces

if (instructions[language].meta.capitalizeFirstLetter) {
Expand Down
12 changes: 10 additions & 2 deletions languages.js
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
// Load all language files excplicitely to allow integration
// Load all language files explicitly to allow integration
// with bundling tools like webpack and browserify
var instructionsDe = require('./languages/translations/de.json');
var instructionsEn = require('./languages/translations/en.json');
Expand All @@ -19,6 +19,8 @@ var instructionsUk = require('./languages/translations/uk.json');
var instructionsVi = require('./languages/translations/vi.json');
var instructionsZhHans = require('./languages/translations/zh-Hans.json');

// Load all grammar files
var grammarRu = require('./languages/grammar/ru.json');

// Create a list of supported codes
var instructions = {
Expand All @@ -42,7 +44,13 @@ var instructions = {
'zh-Hans': instructionsZhHans
};

// Create list of supported grammar
var grammars = {
'ru': grammarRu
};

module.exports = {
supportedCodes: Object.keys(instructions),
instructions: instructions
instructions: instructions,
grammars: grammars
};
Loading