-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move Unicode-related functions to new Unicode stdlib package #25021
Conversation
Nice, thanks for tackling this! |
stdlib/Unicode/docs/src/index.md
Outdated
|
||
```@docs | ||
Unicode.is_assigned_char | ||
Unicode.normalize_string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about renaming is_assigned_char
to isassigned
and generalizing it to strings the same way the rest are – i.e. by all
reduction over the characters in the string. Similarly, we could rename normalize_string
to normalize
now that it's in the Unicode
module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added them in the commit deprecating isnumber
in favor of isnumeric
. Though note that the names clash with corresponding functions in Base, so they always need to be qualified. I guess that's OK.
isassigned
was already restricted to Char
arguments. But I noticed isascii
accepts strings directly, which is inconsistent with other functions. Time to deprecate it too?
While I agree that having a small base is valuable, I'm not sure that these changes make sense. Unicode is really complicated, and if people have to go to stdlib to get something that works, while base has things that work ~most of the time with ascii, I think the result will likely be a bunch of code that doesn't deal with unicode well. |
@oscardssmith, I think you're misinterpreting the situation here. Base strings fully support Unicode and always have. This is a collection of utility functions to query Unicode properties about characters and strings. The functions all have to do with querying character classes, computing text widths, string normalization, and changing character cases. These are all the things that might change from one version of Unicode to another whereas what's in base is not going to change – i.e. the basic mechanics of UTF-8. So it makes sense to separate it out since you might want to change the version of Unicode that your program supports independent of the version of Julia you're using, whereas nothing in base is specific to any particular version of Unicode. |
Missing deprecations? |
66b5e9c
to
4a804b4
Compare
I've managed to get deprecations to work. It's somewhat tricky:
Finally, there's a problem with the manual: functions appear as |
Can't we get rid of the old deprecations? They're either from a previous release, in which case they're due to be deleted anyway, or they're from this release cycle, in which case we don't need to keep them if they're superseded by newer deprecations. |
4a804b4
to
8070033
Compare
8070033
to
189e2a7
Compare
Indeed, this deprecation was introduced in 0.6 so it should be fine to remove it. I've added a commit, at the same time as new deprecations for BTW, I'm not sure what's the convention to document functions in the stdlib. I've added the |
c63baff
to
e2d1088
Compare
Keep them under Base.Unicode since they are needed inside Base, but stop exporting them to Base since they would conflict with the deprecations. Base.Unicode is the new name for Base.UTF8proc, but including a few more functions.
To match the new name of the module.
isnumeric() is consistent with Python and Rust (but not Go), and less easy to confuse with isdigit(). Improve documentation to make confusion less easy. Also fix a few uses where isdigit() is more appropriate than isnumber().
e2d1088
to
756936a
Compare
Would have been nice to have Compat functionality (ie. |
I don't think we generally add Compat support before merging PRs, especially since the priority is to merge breaking PRs this week before the feature freeze. But indeed it's going to be needed. |
julia> using Unicode
julia> isassigned
WARNING: both Unicode and Base export "isassigned"; uses of it in module Main must be qualified
ERROR: UndefVarError: isassigned not defined I don't think export |
Yeah, probably. I'm not sure either what exporting |
This should probably have defined the functions in the Unicode stdlib instead of just reexporting existing functions. Otherwise e.g. the documentation will write these functions as |
I've mentioned that problem above, and couldn't find a solution. At least |
Ah, and the functions are not defined in Base because I liked that, it's because they are used all over the place in Base. |
We could do module Unicode
isnumeric(args...) = Base.isnumeric(args...)
...
end and move the docs to document the functions in the Unicode module |
This is how I did it with CRC32c: https://github.com/JuliaLang/julia/blob/master/stdlib/CRC32c/src/CRC32c.jl (but that was fewer functions so easier to do). |
OK, see #25902. |
As decided at #14347. It turns out most character category predicates and similar functions are used somewhere in Base, so it does not sound possible to move them to the standard library (except for a few particular functions). So I kept these in Base but unexported them, and reexported them from Unicode (which is otherwise an empty module). I've also moved some functions to utf8proc.jl for clarity and because it created problems during bootstrap due to attempts to import functions in UTF8proc which had not been defined yet in Base.
I wasn't sure how many functions we want to move to the stdlib. For example, should
isascii
,isdigit
andisxdigit
be moved, given how simple they are, and given that they are not really Unicode-related?