Replies: 22 comments
-
I like the json-ld approach of @{lang-code}. I actually had this in the original version of simple data format (but it got removed in the quest for simplicity). While i18n seems good I do wonder whether the occam's razor for standards should also be applied here: "how essential is this, and how many potential users will care about this feature?" |
Beta Was this translation helpful? Give feedback.
-
I agree that it could be omitted, but that decision should then be mentioned in the standard or a FAQ:
|
Beta Was this translation helpful? Give feedback.
-
I'm starting to think we could at least mention idea of using @ style stuff ... |
Beta Was this translation helpful? Give feedback.
-
I actually quite like this but I would focus more on l10n than i18n especially since we're very likely to add foreign keys soon (issue #23). That would mean everybody could point to the same dataset which could include many locales (translations). What I'm thinking is something like a new optional field for the datapackage specification: The form I'm thinking is something like: {
"name": "dataset-identifier",
"...": "...",
"resources": [
{
"name": "resource-identifier",
"schema" : { "..." : "..." },
"..." : "..."
}
],
"..." : "...",
"alternativeResources" : {
"resource-identifier": {
"is-IS" : {
"path": "/data/LC_messages/is_IS.csv",
"format": "csv",
"mediatype": "text/csv",
"encoding": "<default utf8>",
"bytes": 10000000,
"hash": "<md5 hash of file>",
"modified": "<iso8601 date>"
"sources": "<source for this file>",
"licenses": "<inherits from resource or datapackage>"
},
"de-DE" : { "..." : "..." },
"..." : "..."
}
},
"..." : "..." At the moment I'm thinking the translations would be files with the exact same schema (so things are duplicated) because that makes it easier to do both translations (copy this file and translate the values you want) and implementation (want to get the Romanian version just fetch this resource instead). I'm reluctant to calling However that just opens up a new problem: How to standardise "locales/alternativeResources" identifiers? So maybe it's enough to just stick with locales as identifiers and stick to BCP 47. If people decide to create a jargonless version of a dataset then that would be a different dataset (with its own l10n). So we could just call it |
Beta Was this translation helpful? Give feedback.
-
@tryggvib How often do people actually translate an entire dataset? Is it quite common? |
Beta Was this translation helpful? Give feedback.
-
I think this applies to perhaps smaller datasets used with foreign keys. This could be datasets with names of all countries in the world so you can point to them instead of having them only in English, classification datasets like I mention etc. (I think this is the biggest use case). I also think this is beneficial for datasets created in one non-English speaking country, that you want to make comparable to other datasets, for example as part of some global data initiative, so you would translate it into English and make that available. That way you can make the dataset available in two languages. As a side note, it might be interesting to start some project to make dataset translations simpler ;) |
Beta Was this translation helpful? Give feedback.
-
Hi @tryggvib @rgrp, I found this thread while searching for i18n in datapackage.json. Most common usecase probably is that people will want to describe their dataset in more then a single language. However we've also found some cases where a full dataset is translated in multiple languages. Looking at json-lds' @language attribute, seems there are three options available (http://www.w3.org/TR/json-ld/#string-internationalization)
or
or
first seems to have best backwards compat |
Beta Was this translation helpful? Give feedback.
-
To summarize my experience with translations: the translation is on multiple levels: metadata translation and data translation. The metadata translation is simpler:
Having the localization in the main file might be handy for the package reader, however it has a disadvantage of providing additional translations. One has to edit the file or have a tool that will combine multiple metadata specifications into one file. Much better solution is to have metadata translations as separate objects/files, for example Data translation is slightly different. The localized data can be provided in multiple formats:
Question is: which case we would like to handle? All of them? Only certain ones? How the translation is handled technically during data analysis process depends on the case: The most relevant tables to be localized are the dimension tables, therefore I'm going to use them as an example.
As for specification requirements:
As for the denormalized translation: do we want to provide "logical" column name or the original name? For example, the columns might be In Cubes framework we are using the denormalized translation and hiding the original column names (stripping the locale column extension) – therefore the reports work regardless of language used. The reports even work when localized column was added to non-localized dataset later. But Cubes is metadata-heavy framework. |
Beta Was this translation helpful? Give feedback.
-
@pwalsh @danfowler this is one to look at again. |
Beta Was this translation helpful? Give feedback.
-
@rgrp related, my long standing pull request, which deals with i18n in the resources themselves: #190 |
Beta Was this translation helpful? Give feedback.
-
@pwalsh I know - I still feel we should do metadata first then data. |
Beta Was this translation helpful? Give feedback.
-
I agree that starting with meta-data is a good idea. My humble suggestion is that each localizable string in datapackage.json could take two forms:
(For the sake of simplicity, I also think that we could limit this to only apply for the For example:
|
Beta Was this translation helpful? Give feedback.
-
Since we do lots of "string or object" type patterns in the Data Package specs generally, I'm partial to the suggestion made by @akariv. However, it could get complicated real quick if someone tries to apply this liberally to any string located anywhere on the One way to counter that is to limit translatable fields explicitly, but that kind of goes against the flexibility of the family of Data Package specifications in general. I'd suggest something that follows on from the pattern I suggest for data localisation here Where:
I also think that the distinction between localisation and translation is important, and would again suggest the same concept as I suggest for data, here. Note that this is not some invention: the pattern I'm suggesting is heavily influenced by my work with translation and localisation using Django, and probably is quite consistent with other web frameworks. Example:
|
Beta Was this translation helpful? Give feedback.
-
@pwalsh a two comments:
|
Beta Was this translation helpful? Give feedback.
-
On the first point, user-specified fields on Data Package are part of the design of the spec, and with the way the family of specs works, I do think it would be unusual to explicitly say only specific fields are translatable. On the second point: yes, it would result in a lot of clutter. I guess we have to decide if we are optimising for human reading of the spec too. Al alternate approach would be to group everything by language which would at least be an ordered type of clutter :).
|
Beta Was this translation helpful? Give feedback.
-
(What I meant was not that only these two fields are translatable, but that only for them the spec specifies a method for translating - and other user-specified fields may use a different scheme - although in second though that may not be the best practice). As for readability - I think that is definitely a factor (as someone said: "JSON is readable as simple text making it amenable to management and processing using simple text tools") And your suggestion does improve things in terms of clutter, but it somehow doesn't feel right to me to separate the original value from the translation. |
Beta Was this translation helpful? Give feedback.
-
@akariv yes, it is not a simple problem to solve. Maybe we should be optimising for cases of a handful of translations - say: 2-5 languages. And, acknowledging the fact that it is likely that we might expect, say, 2-5 translatable properties on a giving package? |
Beta Was this translation helpful? Give feedback.
-
So, I've thought quite a bit about this and I generally agree with @akariv approach:
I've updated the main description of the issue with a relatively full spec based on this. Welcome comments from @frictionlessdata/specs-working-group Research
|
Beta Was this translation helpful? Give feedback.
-
@rufuspollock agreed. In my opinion, we do need I prefer the array and the special treatment of the first element in the array, as per my pattern. Another approach, like in Django for example, is |
Beta Was this translation helpful? Give feedback.
-
@rufuspollock let's schedule this for v1.1 - there are lots of changes for v1 and they should settle before we introduce translations, esp. as the proposal here uses the dynamic type pattern we moved away from in v1. |
Beta Was this translation helpful? Give feedback.
-
@pwalsh agreed. |
Beta Was this translation helpful? Give feedback.
-
Hi, no news here (only later v1.1)? If "real life example" is useful to this discussion ... My approach (while no v1.1) at datasets-br/state-codes's Hum... the interpretation was "language of the descriptions (!and CSV textual contents)". If some field or descriptor need to use other language, I use a suffix |
Beta Was this translation helpful? Give feedback.
-
How should the standard support titles, descriptions and data fields in languages other than English?
Proposal (Nov 2016)
An internationalised field:
Summary:
Each localizable string in datapackage.json could take two forms:
Not all properties would be localizable for now. For the sake of simplicity, we limit this to only the following properties;
Default Language
You can define the default language for a data package using a
lang
attribute:The default language if none is specified is English (?).
Beta Was this translation helpful? Give feedback.
All reactions