Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grammar support for Russian way names #102

Merged
merged 2 commits into from
Oct 4, 2017
Merged

Conversation

yuryleb
Copy link
Contributor

@yuryleb yuryleb commented Apr 6, 2017

New feature for OSRM Text Instructions - grammatical cases support for way names. This is quite important for many languages (Russian, Ukrainian, Polish, Bulgarian, Finnish, Estonian and many others) - otherwise translated strings with inserted way names (in nominative case as in OSM) look grammatically incorrect in most cases.

This feature extends {way_name} and {rotary_name} in instructions with optional grammar case label and so requires update of other applications used OSRM Text Instructions data (Ruby #26 and Swift #53).

Unfortunately this proposal is incompatible with Transifex (see #101) and so can't be easily merged. This PR is supposed mostly for proposed changes review in details and to illustrate this problem for public.

@bsudekum bsudekum requested review from 1ec5 and freenerd April 6, 2017 16:10
@freenerd
Copy link
Member

Thanks @yuryleb. As mentioned before, we need to do some more thinking here.

Blockers seem to be:

@yuryleb yuryleb changed the title [WIP] Grammar support for way names (for Russian instructions as example) Grammar support for Russian way names Sep 27, 2017
@yuryleb
Copy link
Contributor Author

yuryleb commented Sep 27, 2017

@freenerd, @1ec5, @oxidase, finally I solved to add grammatical case labels in new languages\overrides\ru.js script to post-process Transifex localization data - translated strings parsing seems quite simple and could be easily patched for new context phrases or even re-used for other languages if it will be necessary.

I don't think Transifex or any other localization service will allow arbitrary keywords expansion. Actually languages\overrides post-processing used by OSRMTI for final customization looks the better idea.

So let's meet the first route instructions with proper grammatical cases at least in Russian 🎉
osrmti-grammar

BTW other OSRMTI-related projects (osrm-text-instructions.swift, osrm-text-instructions.java, etc?) should also be updated to expect {way_name:*} and {rotary_name:*} variables at least or completely support grammatical cases (with way described in Grammar.md) at last.

Copy link
Member

@1ec5 1ec5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I like the general approach. It looks like there's a to-do or two in here, so I'll hold off on a final review until you're ready.


Many languages - all Slavic (Russian, Ukrainian, Polish, Bulgarian, etc), Finnic (Finnish, Estonian) and others - have [grammatical case feature](https://en.wikipedia.org/wiki/Grammatical_case) that could be supported in OSRM Text Instructions too.
Originally street names are being inserted into instructions as they're in OSM map - in [nominative case](https://en.wikipedia.org/wiki/Nominative_case).
To be grammatically correct, street names should be changed according to target language rules and instruction context before insertion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the various cases always regular and easy to determine from the nominative case? I noticed a few features have been tagged name:dative in OpenStreetMap – would it be desirable or necessary for OSRM to pass such tags along to OSRM Text Instructions?

Copy link
Contributor Author

@yuryleb yuryleb Sep 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it's not deal of OSM - it's impossible to store all case variants in all languages for each name. Perhaps CLDR will be the better place (if will 😉 )


- Russian regular expressions are based on [Garmin Russian TTS voices update](https://github.com/yuryleb/garmin-russian-tts-voices) project; see [file with regular expressions to apply to source text before pronouncing by TTS](https://github.com/yuryleb/garmin-russian-tts-voices/blob/master/src/Pycckuu__Milena%202.10/RULESET.TXT).
- There is another grammar-supporting module - [jquery.i18n](https://github.com/wikimedia/jquery.i18n) - but unfortunately it has very poor implementation in part of grammatical case applying and is supposed to work with single words only.
- Actually it would be great to get street names also in target language not from default OSM `name` only - there are several multi-lingual countries supporting several `name:<lang>` names for streets. But this the subject to address to [OSRM engine](https://github.com/Project-OSRM/osrm-backend) first.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea – please file an issue in osrm-backend.

```
- All such JSON files should be registered in common [languages.js](languages.js)
- Instruction text formatter ([index.js](index.js) in this module) should:
- check `{way_name}` and `{rotary_name}` variables for optional grammar case after colon: `{way_name:accusative}`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this syntax – it leaves open the possibility of additional transformations like capitalization, which could replace the "capitalizeFirstLetter" meta property currently set on some localizations.


### Notes

- Russian regular expressions are based on [Garmin Russian TTS voices update](https://github.com/yuryleb/garmin-russian-tts-voices) project; see [file with regular expressions to apply to source text before pronouncing by TTS](https://github.com/yuryleb/garmin-russian-tts-voices/blob/master/src/Pycckuu__Milena%202.10/RULESET.TXT).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for dedicating this rule set to the public domain!

var rules = grammars[language][version][grammar];
if (rules) {
// Pass original name to rules' regular expressions enclosed with spaces for simplier parsing
var n = ' ' + name + ' ';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's currently possible for clients to manipulate a step's name before compilation. For example, the Mapbox Directions API currently inserts SSML markup around numbers within the name so that speech synthesizers pronounce them more casually. I think that could potentially interfere with this feature. That would be a good argument for implementing #52, so that clients can reliably manipulate the name after grammaticalization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, if clients will not damage {way_name:*} vars - for example, "Restore names highlighting" PR adds HTML tags around these vars and all works as on snapshot above.
But if Mapbox Directions inserts SSML tags right instead {way_name}, this looks wrong - for example, Nuance uses \tn=address\ tags in its Vocalizer voices to specify start of address text specially to pronounce numbers and known street abbreviations as address but doesn't change original address/name text.

Copy link
Contributor Author

@yuryleb yuryleb Sep 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But actually it's no problem, even if Mapbox will catch and replace {way_name} and skip {way_name:accusative} etc - this allows further "grammarization" to work 😉 SSML tags are actually important only for English addresses, Russian & many others have no special requirements for them.

Copy link
Member

@1ec5 1ec5 Oct 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m referring to a JavaScript port of mapbox/mapbox-navigation-ios#552 for the Directions API’s (not yet documented) voice_instructions parameter. Essentially we’re working around a problem where Amazon Polly assumes an address is abbreviated, whereas OSM never abbreviates road names. The workaround is to wrap only numbers in <say-as interpret-as="address"> tags, since we still want numbers to be optimized for speech even if the rest of the words can’t be treated as part of an address. For the JavaScript port, we’re changing the value of way_name itself, so that by the time compile() runs, way_name contains SSML code. 😬

I’ll be the first to admit it’s a hack, but I wanted to point this out because clients may be inclined to do similar things to the value of way_name unless we advise against doing so in this project’s documentation. If this PR forces us to implement #52, that’d be a wonderful outcome in my opinion. 😉

/cc @allierowan

index.js Outdated
// Pass original name to rules' regular expressions enclosed with spaces for simplier parsing
var n = ' ' + name + ' ';
rules.forEach(function(rule) {
var re = new RegExp(rule[0]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these rules be case insensitive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually no (and no for g option for all replaces not only for the first match), at least for Russian OSM street names this is unnecessary - there are many working street names validators those keep them compliant to Russian street names agreement. And also all current grammar regular expressions take into account potentially capital first word letter.

But for general case it looks useful to add some meta block into grammar JSON to specify required regular expression options explicitly.

['Третье Транспортное кольцо', 'genitive', 'Третьего Транспортного кольца'],
['Третье Транспортное кольцо', 'prepositional', 'Третьем Транспортном кольце']
]
// TODO add your language grammar tests to call grammarize() and check result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what the code below does, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the place to add other language's test strings to check other language's grammar when it will be.

@1ec5
Copy link
Member

1ec5 commented Oct 4, 2017

Thanks @yuryleb! I’m going to merge this PR now, but feel free to iterate on the feature in separate PRs if you have any additional changes to make.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants