Skip to content

Localization Proposal

fred-wang edited this page Feb 21, 2013 · 11 revisions

Localization Proposal -- Draft

Goals

The main goal is to make it possible to present MathJax's user interface elements in languages other than English. This includes things like the MathJax menu, the About MathJax dialog, the loading messages, and the various error messages produced by the input jax. This document describes a proposal for the underlying code and data structures for implementing this in MathJax.

The code must be able to handle the following:

  • expressions with substitution values (e.g., "file xxx not found")
  • plural forms (e.g. "loaded xx file" versus "loaded xx files")
  • number localization (e.g. "100%" versus "۱۰۰٪")
  • multiple forms for a word (e.g., "Post" as a verb versus "Post" as a noun)
  • HTML-snippets as defined in MathJax (since many dialogs are constructed from these)
  • fallback to English when translations are not available
  • translations for dynamically loaded components
  • components that may not all come from the same location
  • third-party translations

The mechanism for specifying the selected language has yet to be determined, but the page author should be able to give a default language, and users should be able to override that if they choose.

Overview

A new Localization object will be added to the MathJax variable to handle localization functions. This will include the data needed for the translations into the selected language, the methods to be called for obtaining those translations, and the methods needed for loading and registering translations.

Currently all messages used in MathJax are in English, and the text of these messages usually are hard-coded as literal strings at the locations the messages are used. (Some messages are constructed on the fly from smaller pieces. These messages may need to be handled differently to allow for easier translation.) This is convenient since it is easy to see what message will be produced at any particular point, but in order to allow MathJax to be localized, these strings will need to be replaced by function calls that obtain the translation appropriate for the selected language.

One approach would be to use these message strings as the keys for looking up the translations, but this would make it harder to modify the English messages if rewording were required, or if spelling errors were found. Instead, each message will have an ID string that will be used to identify the phrase so that the English can be changed without requiring all the translation files to be modified to reflect the change. This also has the advantage the the same word or phrase, when used in different ways, can have different identifiers, so "Post" as a verb and "Post" as a noun can be translated differently, if necessary.

Getting a Translated String

The basic means of obtaining the string to use for a message to display to the user is to call the _() method of the MathJax.Localization object, passing the string id and the English phrase. For example,

MathJax.Message.Set("Typesetting complete");

could be replaced by

MathJax.Message.Set(_("TC","Typsetting complete"));

where "TC" is the identifier for the message "Typesetting complete", and provided you have defined

var _ = function () {return MathJax.Localization._.apply(MathJax.Localization,arguments)}

earlier. (Since most of MathJax is defined within a function closure, making such function shortcuts is straight-forward.)

The advantage of having both the identifier and the English string together is that

  1. You still can see the actual English message at the location in the code where it is used.
  2. The English version is available to use as a fallback if the phrase has not been translated into the selected language.
  3. The English translation doesn't need to be loaded separately (i.e., you don't need to load two language files, the selected one, plus English for fallback, and English users won't need to download any language files at all).

Id's and Domains

Using short identifiers can lead to collisions if not handled carefully. To help avoid this, we introduce identifier domains that are used to isolate collections of identifiers for one component of MathJax from those for another component. For example, each input jax could have its own domain, as could each extension. This means you only have to worry about collisions within your own domain, and so can more easily manage the uniqueness id's in use.

To use a domain with your id, pass _() an array consisting of the domain and the id in place of the id. For example, the TeX input jax could use

TEX.Error(_(["TeX","mb"],"Missing Close Brace"));

to get the message with id "mb" in the domain "TeX". Note that the local definition for _() within the TeX input jax could be

var _ = function (id) {return MathJax.Localization._.apply(MathJax.Localization,[ ["TeX",id] ].concat([].slice.call(arguments,1)))};

in which case the message above could become

TEX.Error(_("mb","Missing Close Brace"));

This lets you avoid having to repeat the domain within every call to _() in the input jax. (It would also be possible for TEX.Error() to call _() for you, but see below for information about obtaining the translation data.)

The default domain is "_".

Substitutions

Many messages need to include words that are not available until run time (like file names, or a token that is causing an error, etc.). To include such values in a message, pass the values to _() following the main message string, and use %1, %2, etc., within the message to indicate where to put the additional strings. For example

MathJax.Message.Set(_("fnf","File %1 not found"));

or

TEX.Error(_("'%1' seen where '%2' was expected",token,delimiter));

Note that the extra arguments can be used in any order (in particular, a translation may put them in a different order), so

TEX.Error(_("'%2' was expected where '%1' was seen",token,delimiter));

would also be valid.

Although it would be rare to need more than 9 additional parameters, you can use %10, %11, etc., to get the 10-th, 11-th, and so on. If you need a parameter to be followed directly by a number, use %{1}0 rather than %10.

A % followed by a non-number (and not matching %\{(\d+|plural:%\d+([+-]\d+|(%\{\d+\}|%.|.)*?)\} as a regular expression) generates just the character following the percent, so %% is a literal %, and %: would generate just :.

Plural Forms

If a message must be represented differently depending on a particular numeric value (say to distinguish between "1 file loaded" and "2 files loaded"), the word or words that need changing can be encoded using a special escape sequence, as in the following

MathJax.Message.Set(_("fl","%1 %{plural:%1|file|files} loaded",n));

The %{plural:%1|file|files} specifies the value that controls the plural form (here %1 indicates the first argument after the message string, which is n in the example), and the | characters separate the variations. This format allows the message to depend on multiple values, as in

MathJax.Message.Set(_("fl", "%1 %{plural:%1|file|files} loaded and %2 {plural:%2|image|images} displayed.", n, m));

Note that the translation string may contain such constructs even if the original English one doesn't. For example

_("alone","We are %1 in this family but alone in this World.", n)

could be translated into French by

"Nous sommes %1 dans cette famille mais %{plural:%1|seul|seuls} en ce monde."

Note that if one of the options for the plural forms requires a literal close brace, it can be quoted with a percent:

%{plural:%1|One {only%}|Two {or more%}}

would produce One {only} when the first argument is 1, and Two {or more} otherwise. If a string needs to include a literal string that looks like one of these selectors, the original % can be quoted. So "%%{plural:%%1|A|B}" would be the literal string %{plural:%1|A|B}.

fred: The treatment for plurals is to use the value n of the argument %1 (an arbitrary real number) to determine which variation to use. In English, if n=1 then the singular form is used ; for any other values of n the plural form is used. This can be much more complex in other languages and for consistency with other formats and to use something localizers are familiar with, we will follow the CLDR rules.

Each language has mnemotechnic terms for plural forms and a way to map n to these terms. For example:

  • English, n=1 maps to the singular form "one" and other values to the plural form "other".
  • French also has two plural forms, but the mapping is different, 0 <= n < 2 maps to the singular form "one" and other values to the plural form "other".
  • Welsh has six plural forms: "zero" (n=0), "one" (n=1), "two", (n=2), "few" (n=3), "many" (n=6), "other".
  • Polish has four plural forms: "one" (n=1), "few" (n mod 10 is 2, 3 or 4 and n mod 100 is not 12, 13, 14) "many" (n is not 1 and n mod 10 is 0 or 1 or n mod 10 is 5, 6, 7, 8, 9 or n mod 100 is 12, 13, 14) and "other".
  • and so on...

It is up to the localizers to ensure that all the forms for their languages are specified in the translated strings. The mapping from n to the form index will be implemented in the localization data (see below) of each language. If the index is out-of-range (perhaps because a plural rule was forgotten), the plural rule is ignored so that localizers can realize the mistake. For example, the default is the English mapping

plural: function(n) {
  if (n == 1) return 1;
  return 2;
}

while the French and Polish would be

plural: function(n) {
  if (0 <= n < 2) return 1;
  return 2;
}

plural: function(n) {
  if (n==1) return 1;
  if (n % 10 >= 2 && n % 10 <= 4 && n%100 < 12 && n%100 > 14) return 2;
  if ((n % 10 >= 0 && n %10 <= 1) || (n%10 >= 5 && n%10 <= 9) || (n%100 >= 12 && n%100 <= 14)) return 3;
  return 4;
}

Below is Davide's initial proposal

The usual treatment for plurals is that the value after the colon is treated as an index into the array of options separated by vertical bars, and if the index is outside the range of the choices, the last choice is used. So

_("om","%{plural:%1|One|Many}",n)

would return One if n is 1, and Many if n is anything else (including 0 or negative numbers).

MathJax.Message.Set(_("fl",[n,"%1 file loaded","%1 files loaded"],n));

If you need a different value for 0, for example, you could use something like

MathJax.Message.Set(_("fl","%{plural:%1+1|No files|%1 file|%1 files} loaded",n));

That is, the specification for the value matches %(\d+)([+-]\d+)? as a regular expression, and the sum of the two numbers is used as the index into the array of choices. The second number acts as a "shift" that determines what the index is for the initial choice (note that it can be negative, as in %1-3).

Some languages have a more complex means of determining forms. For instance, Polish has different forms for 1, 2 through 4, 5 through 21, 22 through 24, 25 through 31, and so on (see the gnu gettext documentation for more examples). So the plural escape must be more complex for these languages. One approach would be to allow the language files to provide their own routine that implements the selection of the form. The routine would be passed the value and the array and would return the proper one. That way, any special treatment could be done on a language-by-language basis. Alternatively, there could be data describing the value-to-index transformation needed for the language.

Numbers

Numbers must be localized in some languages e.g. to use Arabic digits. As for plural forms, the localization data will contain a "number" function to do that conversion. This function will be called by MathJax when doing substitution of numeric arguments. For example for the French localization:

number: function (n)
{
   return n.replace(".", ",");
}

will allow to use comma instead of digits in number. Then _("sum","%1 + %2 = %3", 5.3, 2.45, 7.75) will be localized into "5,3 + 2,45 = 7,75". See https://github.com/wikimedia/jquery.i18n/blob/master/src/jquery.i18n.language.js#L670 for other languages to consider.

HTML Snippets

A number of the dialogs used in MathJax are defined using HTML snippets, which allow you to encode an HTML DOM fragment using JavaScript objects. These can include things like bold and italic indicators, as well as other styling or layout. While it is possible to break these into pieces to pass to _() separately, it may be better to allow the translator to translate the complete snippet, so that styling and layout can be properly adjusted for the target language. Thus _() allows a complete HTML snippet in place of the message string (and will return an HTML snippet rather than a string literal). E.g.,

MathJax.HTML.Element("span",{},_("dtn",["Do this",["b",null,["now!"]]]));

would get the translation for the snippet (that is effectively Do this <b>now!</b>) and put it in a <span>.

Note that parameter substitution and plural form substitution are performed on the strings of the snippet that will become text in the DOM fragment that is generated from the snippet.

Specifying a Form

Some words or phrases may be used in more than one way, and these may require different translations. For example, "Post" may be used as a verb as a button label, while "Post" as a noun could refer to a blog post. These may need to be translated into different words or phrases in another language. Since a translator will be presented with the same word ("Post") in both cases, the translator may need more help in determining how the word will be used. Special comments can be used preceding the line containing the _() call to inform the translator of any such ambiguities. This may be especially important for short or single-word phrases. The format of the comments has yet to be worked out, but the data will be collected by the program that collects the data for the translation files (see the section on translation files below). For example

// Translation::form: noun
_("pn","Post")

or

// Translation::form: verb
_("pv","Post")

Note that the id is different for these two, so there will be two values for the translator; the form tells the translator how the word is used. The value for form can be anything that will help the translator figure out how best to translate the word, e.g.,

// Translation::form: column name
_("pcol","Post")

[Additional meta-data could be supplied this way, but I'm not sure what that might be. It's good to be flexible, however, as I suspect we will find that the situation is more complicated once we get some actual languages involved.]

The Localization Data

The MathJax.Localization object holds the data for the various translations, as well as the service routines for adding to the translations, and retrieving translations.

Methods

The methods in MathJax.Localization include:

_(id,message[,form][,arguments])
The function described in detail above that returns the translated string for a given id.
setLocale(locale)
Sets the selected locale to the given one, e.g. MathJax.Localization.setLocale("fr");
addTranslation(local,domain,def)
Defines (or adds to) the translation data for the given locale and domain. The def is the definition to be merged with the current translation data (if it exists) or to be used as the complete definition (if not). The data format is described below.
fontFamily()
Get the font-family needed to display text in the selected language. Returns null if no special font is required.
plural(n)
The method that returns the correct plural form for the value n. See the [CLDR rules](http://unicode.org/cldr/charts/supplemental/language_plural_rules.html) above.
number(n)
The method that returns the localized version of the string n representing a number.

Properties

locale
The currently selected locale, e.g., "fr". This is set by the setLocale() method, and should not be modified by hand.
directory
The URL for the localization data files. This can be overridden for individual languages or domains (see below). The default is [MathJax]/localization.
strings
This is the main data structure that holds the translation strings. It consists of an entry for each language that MathJax knows about, e.g., there would be an entry with key `fr` whose value is the data for the Frenchtranslation. Initially, these simply reference the files that define the translation data, which MathJax will load when needed. After the file is loaded, they will contain the translation data as well. This is described in more detail below.

Translation Data

Each language has its own data in the MathJax.Localization.strings structure. This structure holds data about the translation, plus the translated strings for each domain.

A typical example might be

fr: {
  version: "1.0",
  directory: "[MathJax]/localization/fr",    // optional
  file: "fr.js",                             // optional
  isLoaded: true,                            // set when loaded
  font: "...",                               // optional
  meta: {
    translator: "...",                       // other metadata could be added
  },
  plural: function (n,str) {...},            // optional implementation of plural forms
  domains: {
    hub: {
      version: "1.0",
      file: "http://somecompany.com/MathJax/localization/fr/hub.js",  // optional
      isLoaded: true,
      strings: {
        fnf: "File '%1' not found",
        fl: ["%1 file loaded","%1 files loaded"],
        ...
      }
    },
    TeX: {
      ...
    },
    "_": {
      ...
    },
    ...
  }

The fields have the following meanings:

version
The version of the translation data.
directory
An optional value that can be used to override the directory where the translation files for this language are stored. The default is to add the locale identifier to the end of `MathJax.Localization.directory`, so the value given in the example above is the default value, and could be omitted.
file
The name of the file containing the translation data for this language. The default is the locale identifier with .js appended, so the value given in the example above is the default value, and could be omitted.
isLoaded
This is set to true when MathJax has loaded the data for this language. Typically, when a language is registered with MathJax, the data file isn't loaded at that point. It will be loaded when it is first needed, and when that happens, this value is set.
font
This is a font-family (or list of font-families) that should be used when text in this language is displayed. If not present, then no special font is needed.
meta
This is an object that contains the meta-data about the translation. Such information can include the name of the translator, the date of the translation, etc. [This may not be needed in the data itself, so perhaps this could be in comments instead.]
plural
This is an optional function that implements the selection of the plural form given an integer value and an array of plural forms. If not present, the default plural selector is used (which returns the n-th element of the array if n is within the range of the array, and the last element otherwise).
domains
This is an object that contains the translation strings for this language, grouped by domain. Each domain has an entry, and its value is an object that contains the translation strings for that domain. The format is described in more detail below.

Domain Data

Each domain for which there are translations has an entry in the locale's domains object. These store the following information:

version
The version of the data for this domain
file
If the domain data is stored in a separate file from the rest of the language's data (e.g., a third-party extension that is not stored on the CDN may have translation data that is provied by the thrid-party), this property tells where to obtain the translation data. In the example above, the data is provided by another company via a complete URL. The default value is the locale's directory with the domain name appended and .js appended to that.
isLoaded
This is set to true when the data file has been loaded.
strings
This is an object that contains that actual translated strings. The keys are the message identifiers described in the section on "Getting a Translated String" above, and the values are the translations, or arrays of translations (see the sections on "Plural Forms" above), or translated HTML snippets (see the section on "HTML Snippets" above).

Registering a Translation

Typically, for languages stored on the CDN, MathJax will register the language with a call like

MathJax.Localization.addTranslation("fr",null,{});

which will create an fr entry in the localization data that will be tied to the [MathJax]/localization/fr directory, and the [MathJax]/localization/fr/fr.js file. That directory could contain individual files for the various domains, or the fr.js file could contain combined data that includes the most common domains, leaving only the lesser-used domains in separate files.

An example fr.js file could be

MathJax.Localization.addTranslation("fr",null,{
  version: "1.0",
  meta: {
    translator: "Joe Green"
  },
  domains: {
    "_": {},
    TeX: {},
    Menu: {}
  }
});

This would declare that there are translation files for the _, TeX, and Menu domains, and that these will be loaded individually from their default file names in the default directory of [MathJax]/localization/fr. Other domains will not be translated unless they register themselves via a command like

MathJax.Localization.addTranslation("fr","Zoom",{});

in which case the domain's data file will be loaded automatically when needed.

One could preload translation strings by including them in the fr.js file:

MathJax.Localization.addTranslation("fr",null,{
  version: "1.0",
  meta: {
    translator: "Joe Green"
  },
  domains: {
    "_": {
      isLoaded: true,
      strings: {
        'fnf': "Fichier `%1` non trouvé",
        ...
      }
    },
    TeX: {
      isLoaded: true,
      strings: {
        'mcb': "Accolade de fermeture manquante",
        ...
      }
    },
    Menu: {}
  }
});

Here the _ and TeX strings are preloaded, while the Menu strings will be loaded on demand.

A third party extension could include

MathJax.Localization.addTranslation("fr","myExtension",{
  file: "http://myserver.com/MathJax/localization/myExtension/fr.js"
});

to add french translations for the myExtension domain (used by the extension) so that they would be obtained from the third-party server when needed.

A third party could provide a translation for a language not covered by the MathJax CDN by using

MathJax.Localization.addTranslation("kr",null,{
  directory: "http://mycompany.com/MathJax/localization/kr"
});

and providing a kr.js file in their MathJax/localization/kr directory that defines the details of their translation. If the Korean (kr) locale is selected, MathJax will load http://mycompany.com/MathJax/localization/kr/kr.js and any other domain files when they are needed.

The Translation Files

In order to make working MathJax data convenient for translators, we will need to provide the translation strings in one of the standard formats, like .po for example. We will need to decide on a format (or perhaps formats) that we want to be able to provide for translators. The examples below will use .po, but any format could be chosen.

The usual approach is to have a program that scans the code for the _() calls and builds the data file from that, and that should work with MathJax as well. The .po format supports the domain approach, as well as plural forms, and the idea of multiple forms (e.g., verb versus noun). HTML snippets should be translated into HTML strings, so that

["Do it ",["b",null,["now!"]]]

would become

"Do it <b>now!</b>"

for translation. The translator would produce an HTML version of the phrase with tags in the proper place (which will be translated back into an HTML snippet for use in MathJax at a later point).

There are two complications to automating the collection of the strings needing translation. The first is that the use of local definitions for _() hide the use of the domain, so there will need to be special processing to obtain the domains. It may be possible to recognize the local definition (if we use a common syntax for that) so that the domain can be handled automatically. Alternatively, one could use special comments to mark the domain regions so that the collection program will be able to handle the domains properly. The latter is probably more reliable, but takes extra steps to be sure to include the comments.

The second issue is if _() is called from within a routine that has the message strings passed to it (so that the message passed to _() is not a string literal). This would be the case, for example, if TEX.Error() was made to call _() for you. Such shorthands are very convenient (and reduce the code size), so it would be good to be able to accommodate this case as well. One approach would be to use comments again to tell the collector program what other functions to treat like _().

The collector will be run over the various MathJax components that have message strings, and produce individual .po for each component. These can be combined to make one large .po for translators, or translators could handle them individually. Certainly in the case of third-party extensions, their files will be translated separately.

Once the translations are obtained (as new .po files), we need a second program to turn these into the .js files that MathJax needs, in the formats described above. We may want to have control files that tell which domains to combine in the main language file, and which to make as individual domain files.

Clone this wiki locally