Add TimeZoneIdMapper to icu_timezone #4774

sffc · 2024-04-04T08:31:21Z

Fixes #4031
Replaces #4548

This new all-in-one type supports all of the time zone ID operations we need, with some operations being asymptotically faster than others.

Not sure if we want to deprecate the old ones or keep them around, at least IanaBcp47RoundTripMapper. It's not completely obsolete since it has different performance characteristics.

Manishearth · 2024-04-05T22:51:29Z

I'm fine with this landing as is but in the long run I want to make sure it is incredibly obvious which of these many mappers is right for the use case, via docs and naming and deprecation.

utils/zerotrie/src/cursor.rs

components/timezone/src/ids.rs

Co-authored-by: Robert Bastian <[email protected]>

sffc · 2024-04-15T18:21:36Z

I addressed @robertbastian's feedback except the open question about the naming of NormalizedIana.

sffc · 2024-04-18T23:20:22Z

Ok I think this is mergeable but I'm still unsure about the name TimeZoneIdMapperWithFastCanonicalization

sffc · 2024-04-18T23:22:12Z

Flagging @justingrant for an API docs review.

See:

justingrant

I added a few comments and suggestions, most with a common theme: most developers won't know how many IDs there are and how long they are, so providing some indication of the scale of the data may help developers make more informed perf and scale tradeoffs.

Feel free to ignore this feedback if it's not the right level of detail for Rust docs.

justingrant · 2024-04-23T18:38:25Z

components/timezone/src/ids.rs

+
+    /// Returns the canonical, normalized IANA ID of the given BCP-47 ID.
+    ///
+    /// This function performs a slow linear search. If this is problematic, consider one of the


It's helpful to add context that "slow" here means searching through a list of a few hundred short strings. For users who don't know how many IANA IDs there are, that context may be helpful.

So I'd suggest quantifying the number of BCP-47 IDs to provide context.

Suggested change

/// This function performs a slow linear search. If this is problematic, consider one of the

/// This function performs a slow linear search. Note that there are fewer than 500 BCP-47 IDs (typically growing 1-3 per year) so a linear search may be acceptable for many use cases. But if this is problematic, consider one of the

Note that I'm assuming that BCP-47 IDs are only for canonical IANA names not all IANA names. I realized after I reviewed that I wasn't 100% sure about that assumption.

justingrant · 2024-04-23T18:41:04Z

components/timezone/src/ids.rs

+    /// 1. [`TimeZoneIdMapperBorrowed::canonicalize_iana()`]
+    ///    is faster if you have an IANA ID.
+    /// 2. [`TimeZoneIdMapperWithFastCanonicalizationBorrowed::canonical_iana_from_bcp47()`]
+    ///    is faster, but it requires loading additional data.


Similar to above, can we quantify how much additional data? Many environments would be OK with 10K but not 10MB, for example, and developers who are unfamiliar with time zones may not know the scale of the data involved.

Similarly, some context about the CPU cost of loading that data would also be helpful. Is it a difficult index to create?

justingrant · 2024-04-23T18:46:12Z

components/timezone/src/ids.rs

+///
+/// There is only one canonical name, which is "America/Indiana/Indianapolis". The
+/// *canonicalization* operation returns the canonical name. You should canonicalize if you
+/// need to compare time zones for equality or display the name to the user. Note that the


Suggested change

/// need to compare time zones for equality or display the name to the user. Note that the

/// need to compare time zones for equality.

"or display the name to the user" is probably bad advice. Instead, users should generally see the normalized ID that was originally provided. By avoiding canonicalizing user-provided IDs, this insulates programs from renames like Kiev=>Kyiv so that existing code and data continues to work as-is with the old ID.

There may be some cases where auto-updating programs to the latest ID is desired behavior, but from our experience in ECMAScript these cases seem to be in the minority.

Note that "user" in this case assumes a technically-savvy user, because actual end-users should see the localized name, not the ID.

justingrant · 2024-04-23T20:17:55Z

components/timezone/src/ids.rs

+    TimeZoneBcp47Id, TimeZoneError,
+};
+
+/// A mapper between IANA time zone identifiers and BCP-47 time zone identifiers.


For developers unfamiliar with the details of the IANA Time Zone Database (which is most of them!), it may be helpful to understand the scale of the timezone ID data, in order to help them understand the perf and RAM consequences of various mapping options. Here's one possible way to do it. Feel free to ignore if this kind of info is not appropriate here.

Suggested change

/// A mapper between IANA time zone identifiers and BCP-47 time zone identifiers.

/// A mapper between IANA time zone identifiers and BCP-47 time zone identifiers.

///

/// There are about 600 IANA time zone identifiers, and fewer than 500 BCP-47

/// time zone identifiers.

///

/// BCP-47 time zone identifiers are 8 ASCII characters or less and currently

/// average 5.1 characters long. Current IANA time zone identifiers are less than

/// 40 ASCII characters and average 14.2 characters long.

///

/// These lists grow very slowly; in a typical year, 2-3 new identifiers are added.

justingrant · 2024-04-23T20:20:03Z

components/timezone/src/ids.rs

+/// - "America/Indianapolis"
+/// - "US/East-Indiana"
+///
+/// There is only one canonical name, which is "America/Indiana/Indianapolis". The


You're using "name" and "identifier" interchangeably. I'd suggest using only one term. My preference would be for "identifier" but "name" (for IANA, dunno about BCP-47) is also valid.

justingrant · 2024-04-23T20:30:01Z

components/timezone/src/ids.rs

+/// There is only one canonical name, which is "America/Indiana/Indianapolis". The
+/// *canonicalization* operation returns the canonical name. You should canonicalize if you
+/// need to compare time zones for equality or display the name to the user. Note that the
+/// canonical name can change over time.


It might help to provide an example.

Suggested change

/// canonical name can change over time.

/// canonical name can change over time, for example the identifier Europe/Kiev

/// was renamed to the newly-added identifier Europe/Kyiv in 2022.

justingrant · 2024-04-23T20:30:49Z

components/timezone/src/ids.rs

+/// Normalization is a data-driven operation because there are no algorithmic casing rules that
+/// work for all IANA time zone identifiers.
+///
+/// Normalization is a cheap operation, but canonicalization might be expensive. If you need


Would it make sense to quantify "expensive" here, or at least clarify that "expensive" may not actually be that expensive relative to actual expensive operations? There's only 600 IDs, after all.

justingrant · 2024-04-23T22:33:59Z

One more thought: the index of a time zone id is <16 bits. Are there ever cases where we'd want to expose that index?

In particular I'm asking because for canonical IDs, there's a <= 8-byte key available with the BCP-47 ID which can fit into one 64-bit register or memory location. But for non-canonicalized IANA IDs, the entire string (up to 34 characters, AFAIK) needs to be stored. Is this OK?

sffc · 2024-04-24T01:52:13Z

One more thought: the index of a time zone id is <16 bits. Are there ever cases where we'd want to expose that index?

In particular I'm asking because for canonical IDs, there's a <= 8-byte key available with the BCP-47 ID which can fit into one 64-bit register or memory location. But for non-canonicalized IANA IDs, the entire string (up to 34 characters, AFAIK) needs to be stored. Is this OK?

Not a bad idea, but the data model doesn't currently have the concept of an index to a non-canonical identifier. I'm not really sure how to do that. It wasn't part of the design.

sffc · 2024-04-24T16:08:37Z

I think I addressed all of @justingrant's suggestions.

sffc · 2024-05-17T16:54:31Z

@justingrant says I can merge this PR since I have integrated his feedback.

sffc · 2024-05-17T17:47:19Z

@echeran, this has already been approved as you can see above. I merged main and ran datagen. I need another approval in order to merge.

echeran

rslgtm

sffc added 3 commits April 4, 2024 01:27

Add baked data for IanaToBcp47MapV2Marker

1532c70

Add new TimeZoneIdMapper type

f4845a9

Add test in datagen

4f3bbb4

sffc requested review from robertbastian, Manishearth, nordzilla and a team as code owners April 4, 2024 08:31

sffc removed request for a team, robertbastian and nordzilla April 4, 2024 08:31

sffc added 5 commits April 4, 2024 01:36

fmt

1dfdc26

features

4ea5be8

Merge branch 'main' into iana-canon-2

60316b3

Update ids.rs

90824c9

Update ids.rs

107173c

Manishearth previously approved these changes Apr 5, 2024

View reviewed changes

robertbastian reviewed Apr 8, 2024

View reviewed changes

Reduce allocations and DRY

4b2516a

sffc dismissed Manishearth’s stale review via 4b2516a April 15, 2024 07:54

sffc and others added 7 commits April 15, 2024 11:15

Update utils/zerotrie/src/cursor.rs

bf11327

Co-authored-by: Robert Bastian <[email protected]>

Update components/timezone/src/ids.rs

8886642

Co-authored-by: Robert Bastian <[email protected]>

Improve examples; rename function

b9bc044

impl Deref for TimeZoneBcp47Id

87f339c

Safety

66355bb

Apply bf11327 to call site

3da0592

Return NormalizedIana

8fab067

sffc requested a review from robertbastian April 15, 2024 18:21

Add TimeZoneIdMapperWithFastCanonicalization

28ee49b

sffc added 6 commits April 18, 2024 15:17

Add constructor Diplomat attr

92f4e7e

impl Default

fe60770

Document Normalization vs Canonicalization

c817256

Line length

4692100

rm NormalizedIana

2647dcd

fmt

4e9d5ee

sffc requested a review from robertbastian April 18, 2024 22:59

robertbastian previously approved these changes Apr 19, 2024

View reviewed changes

sffc added 2 commits April 22, 2024 19:14

Merge branch 'main' into iana-canon-2

0a5c746

Clippy

33015b8

sffc dismissed robertbastian’s stale review via 33015b8 April 23, 2024 02:16

justingrant reviewed Apr 23, 2024

View reviewed changes

Review feedback

50f5b1e

fmt

63d5b8b

Merge branch 'main' into iana-canon-2

ef3f8b0

datagen

c74ce44

sffc requested a review from echeran May 17, 2024 17:46

Manishearth approved these changes May 23, 2024

View reviewed changes

robertbastian approved these changes May 23, 2024

View reviewed changes

echeran approved these changes May 23, 2024

View reviewed changes

robertbastian merged commit d2c4f63 into unicode-org:main May 23, 2024
30 checks passed

sffc deleted the iana-canon-2 branch May 23, 2024 16:58

sffc mentioned this pull request May 23, 2024

Implement IANA time zone ID normalization #4548

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TimeZoneIdMapper to icu_timezone #4774

Add TimeZoneIdMapper to icu_timezone #4774

sffc commented Apr 4, 2024

Manishearth commented Apr 5, 2024

sffc commented Apr 15, 2024

sffc commented Apr 18, 2024

sffc commented Apr 18, 2024

justingrant left a comment

justingrant Apr 23, 2024

justingrant Apr 23, 2024

justingrant Apr 23, 2024

justingrant Apr 23, 2024

justingrant Apr 23, 2024

justingrant Apr 23, 2024

justingrant Apr 23, 2024

justingrant Apr 23, 2024

justingrant commented Apr 23, 2024

sffc commented Apr 24, 2024

sffc commented Apr 24, 2024

sffc commented May 17, 2024

sffc commented May 17, 2024

echeran left a comment

	/// This function performs a slow linear search. If this is problematic, consider one of the
	/// This function performs a slow linear search. Note that there are fewer than 500 BCP-47 IDs (typically growing 1-3 per year) so a linear search may be acceptable for many use cases. But if this is problematic, consider one of the

	/// need to compare time zones for equality or display the name to the user. Note that the
	/// need to compare time zones for equality.

-/// A mapper between IANA time zone identifiers and BCP-47 time zone identifiers.
+/// A mapper between IANA time zone identifiers and BCP-47 time zone identifiers.
+///
+/// There are about 600 IANA time zone identifiers, and fewer than 500 BCP-47
+/// time zone identifiers.
+///
+/// BCP-47 time zone identifiers are 8 ASCII characters or less and currently
+/// average 5.1 characters long. Current IANA time zone identifiers are less than
+/// 40 ASCII characters and average 14.2 characters long.
+///
+/// These lists grow very slowly; in a typical year, 2-3 new identifiers are added.

	/// canonical name can change over time.
	/// canonical name can change over time, for example the identifier Europe/Kiev
	/// was renamed to the newly-added identifier Europe/Kyiv in 2022.

Add TimeZoneIdMapper to icu_timezone #4774

Add TimeZoneIdMapper to icu_timezone #4774

Conversation

sffc commented Apr 4, 2024

Manishearth commented Apr 5, 2024

sffc commented Apr 15, 2024

sffc commented Apr 18, 2024

sffc commented Apr 18, 2024

justingrant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justingrant commented Apr 23, 2024

sffc commented Apr 24, 2024

sffc commented Apr 24, 2024

sffc commented May 17, 2024

sffc commented May 17, 2024

echeran left a comment

Choose a reason for hiding this comment