-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TimeZoneIdMapper to icu_timezone #4774
Conversation
I'm fine with this landing as is but in the long run I want to make sure it is incredibly obvious which of these many mappers is right for the use case, via docs and naming and deprecation. |
Co-authored-by: Robert Bastian <[email protected]>
Co-authored-by: Robert Bastian <[email protected]>
I addressed @robertbastian's feedback except the open question about the naming of |
Ok I think this is mergeable but I'm still unsure about the name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a few comments and suggestions, most with a common theme: most developers won't know how many IDs there are and how long they are, so providing some indication of the scale of the data may help developers make more informed perf and scale tradeoffs.
Feel free to ignore this feedback if it's not the right level of detail for Rust docs.
components/timezone/src/ids.rs
Outdated
|
||
/// Returns the canonical, normalized IANA ID of the given BCP-47 ID. | ||
/// | ||
/// This function performs a slow linear search. If this is problematic, consider one of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's helpful to add context that "slow" here means searching through a list of a few hundred short strings. For users who don't know how many IANA IDs there are, that context may be helpful.
So I'd suggest quantifying the number of BCP-47 IDs to provide context.
/// This function performs a slow linear search. If this is problematic, consider one of the | |
/// This function performs a slow linear search. Note that there are fewer than 500 BCP-47 IDs (typically growing 1-3 per year) so a linear search may be acceptable for many use cases. But if this is problematic, consider one of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I'm assuming that BCP-47 IDs are only for canonical IANA names not all IANA names. I realized after I reviewed that I wasn't 100% sure about that assumption.
components/timezone/src/ids.rs
Outdated
/// 1. [`TimeZoneIdMapperBorrowed::canonicalize_iana()`] | ||
/// is faster if you have an IANA ID. | ||
/// 2. [`TimeZoneIdMapperWithFastCanonicalizationBorrowed::canonical_iana_from_bcp47()`] | ||
/// is faster, but it requires loading additional data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above, can we quantify how much additional data? Many environments would be OK with 10K but not 10MB, for example, and developers who are unfamiliar with time zones may not know the scale of the data involved.
Similarly, some context about the CPU cost of loading that data would also be helpful. Is it a difficult index to create?
components/timezone/src/ids.rs
Outdated
/// | ||
/// There is only one canonical name, which is "America/Indiana/Indianapolis". The | ||
/// *canonicalization* operation returns the canonical name. You should canonicalize if you | ||
/// need to compare time zones for equality or display the name to the user. Note that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// need to compare time zones for equality or display the name to the user. Note that the | |
/// need to compare time zones for equality. |
"or display the name to the user" is probably bad advice. Instead, users should generally see the normalized ID that was originally provided. By avoiding canonicalizing user-provided IDs, this insulates programs from renames like Kiev=>Kyiv so that existing code and data continues to work as-is with the old ID.
There may be some cases where auto-updating programs to the latest ID is desired behavior, but from our experience in ECMAScript these cases seem to be in the minority.
Note that "user" in this case assumes a technically-savvy user, because actual end-users should see the localized name, not the ID.
TimeZoneBcp47Id, TimeZoneError, | ||
}; | ||
|
||
/// A mapper between IANA time zone identifiers and BCP-47 time zone identifiers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For developers unfamiliar with the details of the IANA Time Zone Database (which is most of them!), it may be helpful to understand the scale of the timezone ID data, in order to help them understand the perf and RAM consequences of various mapping options. Here's one possible way to do it. Feel free to ignore if this kind of info is not appropriate here.
/// A mapper between IANA time zone identifiers and BCP-47 time zone identifiers. | |
/// A mapper between IANA time zone identifiers and BCP-47 time zone identifiers. | |
/// | |
/// There are about 600 IANA time zone identifiers, and fewer than 500 BCP-47 | |
/// time zone identifiers. | |
/// | |
/// BCP-47 time zone identifiers are 8 ASCII characters or less and currently | |
/// average 5.1 characters long. Current IANA time zone identifiers are less than | |
/// 40 ASCII characters and average 14.2 characters long. | |
/// | |
/// These lists grow very slowly; in a typical year, 2-3 new identifiers are added. |
components/timezone/src/ids.rs
Outdated
/// - "America/Indianapolis" | ||
/// - "US/East-Indiana" | ||
/// | ||
/// There is only one canonical name, which is "America/Indiana/Indianapolis". The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're using "name" and "identifier" interchangeably. I'd suggest using only one term. My preference would be for "identifier" but "name" (for IANA, dunno about BCP-47) is also valid.
components/timezone/src/ids.rs
Outdated
/// There is only one canonical name, which is "America/Indiana/Indianapolis". The | ||
/// *canonicalization* operation returns the canonical name. You should canonicalize if you | ||
/// need to compare time zones for equality or display the name to the user. Note that the | ||
/// canonical name can change over time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might help to provide an example.
/// canonical name can change over time. | |
/// canonical name can change over time, for example the identifier Europe/Kiev | |
/// was renamed to the newly-added identifier Europe/Kyiv in 2022. |
components/timezone/src/ids.rs
Outdated
/// Normalization is a data-driven operation because there are no algorithmic casing rules that | ||
/// work for all IANA time zone identifiers. | ||
/// | ||
/// Normalization is a cheap operation, but canonicalization might be expensive. If you need |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to quantify "expensive" here, or at least clarify that "expensive" may not actually be that expensive relative to actual expensive operations? There's only 600 IDs, after all.
One more thought: the index of a time zone id is <16 bits. Are there ever cases where we'd want to expose that index? In particular I'm asking because for canonical IDs, there's a <= 8-byte key available with the BCP-47 ID which can fit into one 64-bit register or memory location. But for non-canonicalized IANA IDs, the entire string (up to 34 characters, AFAIK) needs to be stored. Is this OK? |
Not a bad idea, but the data model doesn't currently have the concept of an index to a non-canonical identifier. I'm not really sure how to do that. It wasn't part of the design. |
I think I addressed all of @justingrant's suggestions. |
@justingrant says I can merge this PR since I have integrated his feedback. |
@echeran, this has already been approved as you can see above. I merged main and ran datagen. I need another approval in order to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rslgtm
Fixes #4031
Replaces #4548
This new all-in-one type supports all of the time zone ID operations we need, with some operations being asymptotically faster than others.
Not sure if we want to deprecate the old ones or keep them around, at least IanaBcp47RoundTripMapper. It's not completely obsolete since it has different performance characteristics.