-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ucd/category] Provisionally add unic-ucd-category #31
Conversation
components/ucd/category/src/lib.rs
Outdated
|
||
use self::GeneralCategory::*; | ||
|
||
// TODO: potentially change these to match pattern matching. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love to see a comparison here. There's a large possibility that the two could be optimized to the same thing or that the pattern match is more performant. I feel that this form is much clearer; for comparison:
fn is_cased_letter(&self) -> bool {
[UppercaseLetter, LowercaseLetter, TitlecaseLetter].contains(self)
}
versus
fn is_cased_letter(&self) -> bool {
match *self {
UppercaseLetter | LowercaseLetter | TitlecaseLetter => true,
_ => false,
}
}
There's an even bigger difference when more symbols are used, such as in is_punctuation
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The later is kind of guaranteed to be the fastest in Rust. The former may optimize to the same performance, but that's taking a chance.
If you don't like the second form, then maybe this is a good time that we start using matches
crate. It gives you the match
feature in a macro call. It's a very stable little package, so is fine to depend on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
matches
has a very nice representation. I've pulled it in for this sub-crate.
components/ucd/category/src/lib.rs
Outdated
impl GeneralCategory { | ||
/// Lu | Ll | Lt | ||
/// | ||
/// Abbreviated: LC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments are probably close to useless, but they're how tr44 describes them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not useless at all! Very good for documentation! And maybe one day we find a way to let people use the shorter names alongside the longer ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use UppercaseLetter as Lu
? In any case it's probably best to "force" use of the descriptive names.
components/ucd/category/src/lib.rs
Outdated
} | ||
} | ||
|
||
fn bsearch_range_value_table( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fn is copied almost verbaitim from ucd/normal/decomposition_type.rs
. Is it a good candidate for extracting into a helper? I suspect something like a ucd-util
that contains the most commonly used impls for searching the commonly used table formats would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that's the plan. If you want to put them in the utils crates, please do so in a different diff to make it easier to review.
That said, we're thinking of replacing the python code with https://github.com/BurntSushi/rucd, so we can use other Rust packages during table generation, like https://github.com/behnam/rust-phf. That's why I haven't spent much time on the cleaning up the current table code helpers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually experimenting with porting the python scripts into Rust (to use in cargo buildscripts), partially because I'm just interested, and partially because it could potentially work. And I'm enjoying it (lets me exercise understanding Python code and get more practice writing Rust, so it's not wasted effort. And it's a problem space I wanted to dip my toes into.
Assume rucd works for the generation, it'd definitely be best to use that, once its ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note there's one really big small problem with using PHF macros: rust-lang/cargo#1359
PHF is really slow without release optimizations for large maps.
Other than that, it'd definitely be a plus to be able to use a PHF map where it would be appropriate.
extern crate unic_ucd_category; | ||
|
||
|
||
//#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, need to re-enable this (had it disabled before the python build script was edited).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
tools/gen_ucd_tables.py
Outdated
# default to ET | ||
(0x20A0, 0x20CF, "ET"), # Currency Symbols | ||
(0x20A0, 0x20CF, "ET"), # Currency Symbols |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These format changes must have been done automatically by my ide (bad IDE! only touch VCS changed text unless told otherwise!); I can change them back, and probably should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, would be great to have them in one column. Don't worry if the IDE is bothering too much.
general_category_mapping = get_general_category_mapping() | ||
|
||
with open(join(dir, 'general_category.rsv'), "w") as values_file: | ||
rustout.emit_table( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variation of rustout.emit_table
feels like it will be used a lot for the simple properties; would it be worth making a version with this default print_fun
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine either way. (See my comment above about dropping the Python code completely.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, @CAD97! Thanks for working on it!
Generally, everything's in good order. There are few nits and organization matters that I've commented inline. We can land it soon, afterwards.
About the API, the design looks great. https://github.com/swgillespie/unicode-categories/blob/master/src/lib.rs is taking a different approach, which is not that performant.
components/ucd/category/Cargo.toml
Outdated
repository = "https://github.com/behnam/rust-unic/" | ||
license = "MIT/Apache-2.0" | ||
keywords = ["text", "unicode"] | ||
description = "UNIC - Unicode Character Database" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to update the description
field.
components/ucd/category/Cargo.toml
Outdated
license = "MIT/Apache-2.0" | ||
keywords = ["text", "unicode"] | ||
description = "UNIC - Unicode Character Database" | ||
readme = "README.md" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop this if we don't have a README.md
on this crate. (The small crates don't have one yet.)
components/ucd/category/src/lib.rs
Outdated
@@ -0,0 +1,313 @@ | |||
// Copyright 2012-2015 The Rust Project Developers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is only needed ifyou are copying content from the Rust project, in which case should have the same years as the original file. Is this the case here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, I don't actually know how that sneaked in.
components/ucd/category/src/lib.rs
Outdated
//! | ||
//! Unicode General Category. | ||
//! | ||
//! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the first thing people see on the package page on crates.io, and it's a very basic and useful character property. Would be great to have a bit more text here about it.
And don't forget to link to the definition of the property on unicode.org. (See links on other packages for formatting.)
components/ucd/category/src/lib.rs
Outdated
/// | ||
/// * <http://unicode.org/reports/tr44/#General_Category_Values> | ||
#[derive(Clone, Copy, Debug, PartialEq, Eq)] | ||
pub enum GeneralCategory { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recently started putting all type-level abstraction in their own module. So, Age
enum and its various implementations and tests are in age.rs
. And, char
/str
helper traits go into traits.rs
, and only a couple of needed things get pub use
d here in lib.rs
from different modules.
It has shown to be a good way to keep unrelated matters separate. Would be great to do the same here.
PS. I'll start writing these in CONTRIBUTING.md
. (#36)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think I've seen that guideline in a few places (and mostly agree with it): lib.rs
should really just be a list of [pub] mod
and pub use
.
extern crate unic_ucd_category; | ||
|
||
|
||
//#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
tools/gen_ucd_tables.py
Outdated
@@ -28,12 +28,12 @@ | |||
from common import path, memoize | |||
from unicode_utils import is_surrogate | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Python's PEP-8 recommends two empty lines here, so would be great to keep it.
You can just run the .cargo/format.sh
script to get it fixed if you have the CLI tool installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't have autopep8 installed previously (don't write too much python and PyCharm/IDEA usually do a decent job of formatting along pep8). Now I do.
Strangely, it's not changing this. I went ahead and changed it back manually.
tools/gen_ucd_tables.py
Outdated
# default to ET | ||
(0x20A0, 0x20CF, "ET"), # Currency Symbols | ||
(0x20A0, 0x20CF, "ET"), # Currency Symbols |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, would be great to have them in one column. Don't worry if the IDE is bothering too much.
general_category_mapping = get_general_category_mapping() | ||
|
||
with open(join(dir, 'general_category.rsv'), "w") as values_file: | ||
rustout.emit_table( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine either way. (See my comment above about dropping the Python code completely.)
components/ucd/Cargo.toml
Outdated
@@ -18,3 +18,4 @@ unic-ucd-bidi = { path = "bidi/", version = "0.4.0" } | |||
unic-ucd-core = { path = "core/", version = "0.4.0" } | |||
unic-ucd-normal = { path = "normal/", version = "0.4.0" } | |||
unic-ucd-utils = { path = "utils/", version = "0.4.0" } | |||
unic-ucd-category = { path = "category/", version = "0.0.1" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I think you can start with 0.4.0
. There won't be a release of it, but it allows the code work and we will update it to 0.5.0
for the next general release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌
... such that it appears first in the docs
👌 we should be good now! |
Wait, I need to add myself to (And I'm pretty sure we should be fetching from http://www.unicode.org/Public/10.0.0/ucd/ and not http://www.unicode.org/Public/UCD/latest/ucd/ ) |
This is another issue we need to address sometime soon. I want the update script to take the version of unicode to download the files for, instead of always getting the latest. The same for the IDNA update script. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! All looks good! Let's get it in!
My use cases mainly draw from DerivedCoreProperties, one of which the main builders for is General_Category. And the best way to learn is to dive right in, so:
This updates
gen_ucd_tables.py
and adds the crate with enum for ucd-category. The design was mostly ad-hoc based on other crates in UNIC, but I expect that something can be improved. At the very least, my placeholder version0.0.1
will probably be changed, and unic-ucd and unic bumped as well.I skipped over name in the list of planned ucd crates because there are more interesting compression opportunites available there, such as a prefix trie or such, and I'm not quite prepared to dive into how other crates have done it quite yet.