Duplicate translations with different XLIFF unit IDs are not maintained #224

asmecher · 2019-09-16T23:35:41Z

This is a knock-on issue from #220. (I hope you won't regret adding that already!)

An XLIFF file may contain several units with the same translation content... (silly example but you get the point)

<unit id="am.permanently">
  <segment>
    <source>am</source>
    <target>soy</target>
  </segment>
</unit>
<unit id="am.temporarily">
  <segment>
    <source>am</source>
    <target>estoy</target>
  </segment>
</unit>

However, when loading the translation from the XLIFF file, only one gets loaded into $translations. This is because the two will have the same ID, based only on the source text ("am"), without considering the unit ID.

Thus trying to save the loaded XLIFF again will drop one of the two entries.

The text was updated successfully, but these errors were encountered:

asmecher · 2019-09-16T23:36:33Z

@oscarotero, I am happy to code up a solution, but as with #220 I'd like a little guidance about the best option because I'm not familiar with how the translation IDs are used elsewhere in the library.

oscarotero · 2019-09-17T21:08:34Z

Mmm, I don't know how to solve this.

On the one hand, the unit id is not unique, on the other hand, the source is not unique either. I guess the uniqueness of a translation in Xliff is a sort of combination of unit id + source?

asmecher · 2019-09-17T21:18:32Z

I guess the uniqueness of a translation in Xliff is a sort of combination of unit id + source?

I don't think XLIFF has any requirements of uniqueness involving the source text. The XLIFF 2.0 spec does talk about uniqueness of IDs, but that's pretty much it. For example, by my read, each unit (=Translation class instance) can be uniquely identified by a combination of unit ID and file ID (see section 3.1).

I think it'll be necessary to inform the mechanism for ID generation (currently the static generateID function) based on the format of the file being sourced.

oscarotero · 2019-09-17T22:11:40Z

In gettext, we have the following structure:

domain
context
original

Right now, the equivalent in Xliff would be:

domain (the id of the file)
context (the <node category="context"> element)
original (the <source> element)

A possible solution could be to get the id of the unit as the msgid, instead the source, and store the source as a comment. So in a exported .po file, we could get:

# XLIFF_SOURCE: source-element
msgid "unit-id"
msgstr "translation"

asmecher · 2019-09-17T23:25:19Z

@oscarotero, but wouldn't that be a radical departure from either/both the PO and XLIFF standards? My understanding of .po files is that msgid should always be the source language, likewise source elements in XLIFF.

oscarotero · 2019-09-18T08:16:25Z

I was thinking in how to use these translations in the html templates. Using the unit-id as the msgid, you can get the translations by that id:

<h1><?= __('unit-id') ?></h1>

But, at the same time, if you scan this template searching for gettext entries, this translation will have "unit-id" string as the original untranslated string. So, finally, we can assume that the unit id and the source are the same thing (or at least, share the same purpose: to identify a translation).

I don't know, it's hard to fit the Xliff format into gettext without lossing data. I'm open to other possible workaround.

asmecher · 2019-09-18T15:12:30Z

@oscarotero, we already have large parts of an internationalization toolset in place, including a caching layer that'll go in front of Gettext and tools in our PHP code and Smarty templates for getting translations. What I'm hoping to use Gettext for is specifically its XLIFF load/save features -- i.e. just the XLIFF extractor and generator, but not the Translator class. My operations on the translations always deal with the full list of translations -- i.e. foreach($translations as $translation) {...}

We're introducing Gettext in an attempt to move from our own bespoke XML files to the XLIFF standard. We're using symbolic IDs instead of English-language text in our source code, hence the use of the unit ID in XLIFF. Our goal is to facilitate the use of commodity translation tools like POEdit and Weblate, thus we need the XLIFF content to be usable within those tools (and I think using msgid or <source> with the unit ID will break them).

I think the purpose of generateID etc. is to make it easy for the Translator class to fetch a translation quickly, correct? I can see two options:

Disentangle the loading of the translations (the fromString method in the Extractor) from the indexing/deduplication of the translations (currently in Translations::offsetSet). Currently there's no way to load or save translation files without going through the Translations class, which has built-in resolution of what it considers to be duplicates. (Maybe the indexing/deduplication tools belong in Translator rather than Translations?)
Alternatively, support a richer translation ID that doesn't just consider the segment and source text. Looking up translations by source text alone could continue to return the first match, but going through a list of all translations (foreach) would no longer be missing translations.

oscarotero · 2019-09-18T21:41:33Z

Maybe the indexing/deduplication tools belong in Translator rather than Translations?

Translation has the getId() and the static generateId() methods, so the deduplication belongs to Translation. This value is important not only to search translations but also to merge different Translations instances.

The right solution could be add a setId() method, so you could choose another different way to define the ids/deduplication and use generateId internally as a fallback. The only problem is Translations::find that currently depends on this method to find the translations. Removing or changing this method is a breaking change, and I don't want to release a major version right now. And if you define a different id, you can't find a specific translation by context + original. Using a foreach is discarded for performance reasons.

Maybe we could add the Translation::setId() method with a warning that using this method makes imposible to search the translation in a Translations instance using the context + original.

And because Xliff generator use this method to generate ids, include the warning here too.

…bering

asmecher · 2019-09-19T01:16:23Z

Thanks, @oscarotero, I submitted this according to the suggestions you made: #225

I think there is still one issue to resolve, as noted in the PR comment.

#224 Allow for setting custom ID to avoid translation clobbering

asmecher · 2019-09-19T21:12:16Z

From your earlier comment:

A possible solution could be to get the id of the unit as the msgid, instead the source, and store the source as a comment. So in a exported .po file, we could get:
# XLIFF_SOURCE: source-element
msgid "unit-id"
msgstr "translation"

My reaction was...

@oscarotero, but wouldn't that be a radical departure from either/both the PO and XLIFF standards?

...but it turns out many projects actually do use PO files written this way (with symbolic IDs in the msgid), and it's supported in translation tools like Weblate (known there as "Monolingual gettext").

Just documenting something I learned in the hopes it'll be useful to someone else :)

ShawnMilo · 2022-01-19T18:27:14Z

I'm maintaining XLIFF tools for a free site: https://xliff.tools/

Does the "ID Replacement" tool do what's required? If not, I'm interested in learning more about the issue and possibly implementing it.

Thanks!

asmecher mentioned this issue Sep 16, 2019

Replace bespoke translation toolset with more standards-based options pkp/pkp-lib#4779

Closed

asmecher changed the title ~~Duplicate translations with the same XLIFF unit ID are not maintained~~ Duplicate translations with different XLIFF unit IDs are not maintained Sep 16, 2019

asmecher added a commit to asmecher/Gettext that referenced this issue Sep 19, 2019

php-gettext#224 Allow for setting custom ID to avoid translation clob…

6889463

…bering

asmecher added a commit to asmecher/Gettext that referenced this issue Sep 19, 2019

php-gettext#224 Allow for setting custom ID to avoid translation clob…

d068418

…bering

asmecher added a commit to asmecher/Gettext that referenced this issue Sep 19, 2019

php-gettext#224 Add explanation of math in test

9fffadb

asmecher added a commit to asmecher/Gettext that referenced this issue Sep 19, 2019

php-gettext#224 Make use of unit IDs as translation IDs optional

3384000

oscarotero added a commit that referenced this issue Sep 19, 2019

Merge pull request #225 from asmecher/xliff-set-ids

241899f

#224 Allow for setting custom ID to avoid translation clobbering

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate translations with different XLIFF unit IDs are not maintained #224

Duplicate translations with different XLIFF unit IDs are not maintained #224

asmecher commented Sep 16, 2019 •

edited

Loading

asmecher commented Sep 16, 2019

oscarotero commented Sep 17, 2019

asmecher commented Sep 17, 2019

oscarotero commented Sep 17, 2019

asmecher commented Sep 17, 2019

oscarotero commented Sep 18, 2019 •

edited

Loading

asmecher commented Sep 18, 2019

oscarotero commented Sep 18, 2019

asmecher commented Sep 19, 2019

asmecher commented Sep 19, 2019

ShawnMilo commented Jan 19, 2022

Duplicate translations with different XLIFF unit IDs are not maintained #224

Duplicate translations with different XLIFF unit IDs are not maintained #224

Comments

asmecher commented Sep 16, 2019 • edited Loading

asmecher commented Sep 16, 2019

oscarotero commented Sep 17, 2019

asmecher commented Sep 17, 2019

oscarotero commented Sep 17, 2019

asmecher commented Sep 17, 2019

oscarotero commented Sep 18, 2019 • edited Loading

asmecher commented Sep 18, 2019

oscarotero commented Sep 18, 2019

asmecher commented Sep 19, 2019

asmecher commented Sep 19, 2019

ShawnMilo commented Jan 19, 2022

asmecher commented Sep 16, 2019 •

edited

Loading

oscarotero commented Sep 18, 2019 •

edited

Loading