-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add NaiveFallback #2686
Comments
I created a small script that takes all locale data for a given key, hashes the content and deduplicated to the smallest number of data files to cover all keys. In case of |
I assume that this deduplication is already happening for values. |
tl;dr, locale fallback is very, very complicated. I have a plan for how to solve the size and speed issues. It's something we can improve incrementally. |
I don't think this is the same as #834. That one is about including locales based on allowlist, this is about deduplicating. |
It's the same in the sense that these problems can be solved by making DataExporter smarter. We can either pre-populate locales or strip them if they have duplicates. Note that "naive fallback" is already supported; see the no_data constructor on FallbackProvider. |
Back on my computer so I can write a more complete response…
It has some cost in binary size, but not too much. The data size cost is about 10 kB.
We do deduplicate data; we store essentially just a pointer for these extra locales. It does mean that lookup is slower if we use binary search. This can be mitigated by shipping locales as separate language packs or by adopting ZeroHashMap for locale lookup (#2579).
This is exactly what our currently algorithm is doing when you run it in no-data mode. You add data for it to handle scripts and parent locales correctly.
We don't remove script without language because that's a well documented footgun. The only exception is for collation data.
Again, this is what we already do. I guess what I'm missing is, what are you suggesting that your "naive fallback" is improving on? Your wish list is basically what I've already implemented. You can see full details in flexible vertical fallback (which I'm pretty sure you've reviewed before). Note that the algorithm itself is deceptively simple. It boils down to naive subtag removal, but with additional support for extension keywords. |
Another point on this topic. I think there have traditionally been two ways of looking at locales in software, and ICU4X adds a third:
For case (1), these clients are already eating a large cost by shipping all locales, so shipping a little extra code and data to help them with fallback is a small price percentage-wise. I hear you about the performance hit, but there are ways to solve this without changing the fallback algorithm. For case (2), fallbacking should be pre-computed in datagen based on the desired set of locales such that we don't need to ship any fallbacking code or data, not even a "naive" version. For case (3), we would likely employ a hybrid approach where desired locales are pre-computed in their respective language packs, but clients can opt in to the runtime locale fallback in order to get the full behavior of case (1). |
My claim is that the storing 500 locale strings when only 100 are needed is a substantial portion of the postcard payload, and it accrues cost for component constructor to select from. My hope is that we can do better. If you think that the fallback is not costly, I'd love to see the size of the postcard files when we use the full fallback to deduplicate and reduce the number of keys to minimum. Can we do that now? If so, how? |
It's improving on In case of That's a substantial payload decrease, memory cost reduction and I suspect constructor perf win. |
Yeah, the empty pointers are a significant source of data size and especially lookup speed issues in certain keys including number format and date format. This is a known issue in datagen. The empty pointers should be calculated and removed at datagen time, and a subset may be added back if compile-time fallback resolution is requested.
Nit: Let's be careful on language here. |
I guess what I'm trying to say is, I'm thinking from the angle that the bug in datagen, the one where empty locale pointers are unnecessarily generated, which I've known has existed since basically the inception of datagen, will be fixed. Fixing it requires some care and design, but it is totally fixable in a dot release. My above claims about fallback being a fairly low cost relative to the cost of adding all locales are based on the world in which the bug in datagen is fixed. |
Gotcha. I think we agree. I have a slight concern about what you're proposing as |
OK, sure, I could see a third LocaleFallbacker constructor that narrows the likely subtags data to only the multi-script languages. You can get rid of a few K's of data this way. It would cause locales like "de-Latn-LI" to fail (it should be able to find "de-LI"), so it should be used only in cases where you know that the script is only ever specified in multi-script languages like zh and sr. I guess an issue here is that languages may change from single-script to multi-script in future CLDR releases, as has happened in the past. |
The nice thing is that this is all totally configurable since locale fallback is its own separate component in the data provider. If someone comes up with a better fallback mechanism, we can plug it right in. |
By the way, I see 18 multi-script languages in CLDR 41. The majority are Cyrl/Latn/Arab hybrids (it's not just Serbian); the rest are scattered in Africa, the Middle East, India, and China.
|
Discuss with: |
We have an approximation of this now. With #5114 you can pass a |
I don't want to take action on this until we have a client who would clearly benefit from it relative to other size and speed optimizations we have available elsewhere. As @robertbastian said, we do have one way to achieve this end. |
In relation to #2683.
The current fallback mechanism is quite costly in binary size and data payload to support. In order to enable customers to use ICU4X without it, we by default do not deduplicate data that would rely on runtime fallbacking.
I believe we can (in a true Rust fashion!) resolve this dychotomy by introducing The Third Way between
no fallback
andfull fallback
. I dubbed itnaive fallback
.Naive fallback works only one way - minimizing tags, and contains a very short list of exceptions.
The algorithm works like this:
language-script-region
pair.5.1 If it is, use language-script from that exception
5.2 If not, remove script.
und
This will cater to exceptions in
sr
andzh
, but not much more. For everything else it will justcut off
from right to left and eventually fallback onund
.The algorithm is super small, the data is super small (maybe even baked in by default?) and if used in datagen+runtime allows us to cut out huge portion of locales which in turn reduces the number of keys in the key table in data payload.
This reduction has two benefits:
The text was updated successfully, but these errors were encountered: