re-implement Base Unicode features using utf8proc_jll #381

stevengj · 2023-11-05T01:05:15Z

Closes #377.

stevengj · 2023-11-05T01:18:31Z

(It's maybe 1% faster on tests/benchmark.jl compared to main.)

codecov · 2023-11-05T01:22:14Z

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (cfdac5f) 96.57% compared to head (1506668) 96.64%.
Report is 8 commits behind head on main.

❗ Current head 1506668 differs from pull request most recent head c820724. Consider uploading reports for the commit c820724 to get more accurate results

Files	Patch %	Lines
src/unicode.jl	98.11%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #381      +/-   ##
==========================================
+ Coverage   96.57%   96.64%   +0.06%     
==========================================
  Files          14       15       +1     
  Lines        4179     4198      +19     
==========================================
+ Hits         4036     4057      +21     
+ Misses        143      141       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…epath

stevengj · 2023-11-05T02:04:16Z

Looks like utf8proc_jll version 2.9.0 doesn't work with Julia < 1.6 due to changes in BinaryBuilder or something?

How badly do you want to support ancient Julia versions?

c42f · 2023-11-05T11:32:52Z

How badly do you want to support ancient Julia versions

Personally? I'd prefer to keep supporting them but I don't have a super strong opinion as long as we support 1.6 and above. But tooling maintainers did want it to work on older versions I think. @davidanthoff @pfitzseb ?

If we can't do it easily, can we continue to support older versions by calling into Base.is_id_char() for now?

stevengj · 2023-11-05T18:40:01Z

If we can't do it easily, can we continue to support older versions by calling into Base.is_id_char() for now?

That will silently give different results on older versions of Julia, which also seems bad?

(If you want to go that route, it would be easier to simply lower the version requirement for utf8proc_jll?)

Though I think JLL's were not supported at all before Julia 1.3 or so? I really don't see how to maintain compatibility with Julia 1.0 without keeping a separate branch around with a different Project.toml that omits the utf8proc_jll requirement.

c42f · 2023-11-06T03:21:15Z

I really don't see how to maintain compatibility with Julia 1.0 without keeping a separate branch around with a different Project.toml that omits the utf8proc_jll requirement.

Oof. Yes of course - the lack of optional dependencies in early Julia versions makes this really awkward. Hmm.

Now I'm looking at porting the necessary parts of utf8proc.c to Julia. The library doesn't look large or scary and I'm tempted to give it a go.

pfitzseb · 2023-11-06T10:40:18Z

For the VS Code extension, any non-Julia dependencies are a little annoying because we're trying to vendor everything. If porting utf8proc to Julia is feasible, then that'd be ideal I think.

stevengj · 2023-11-06T14:37:06Z

Now I'm looking at porting the necessary parts of utf8proc.c to Julia. The library doesn't look large or scary and I'm tempted to give it a go.

Yes, it shouldn't be too bad if you write a little script to convert the data tables in utf8proc_data.c (from C arrays to Julia arrays), something like:

utf8proc_data = read("utf8proc_data.c", String)
jl_data = replace(utf8proc_data,
    # arrays
    ",\n"=> ", ", "};"=>"]",
    r"static const [A-Za-z0-9_]+ ([A-Za-z0-9_]+)\[\] = \{\n" => s"const \1 = [ ",

   # utf8proc_property constructors
   "{" => "Property(", "}," => "),",

    # constants
    "UINT16_MAX" => typemax(UInt16))
write("utf8proc_data.jl", jl_data)

along with copying the definitions of UTF8PROC_CATEGORY_CC etcetera.

Note that Julia does not support bitfield structs, so you'd have to emulate them for re-defining utf8proc_property_t unless you want to waste a lot of memory for full-width fields.

That being said, this seems like a lot of effort (and an ongoing maintenance headache since utf8proc has to be updated every year for new Unicode versions) for very little gain. You don't really gain much from Julia's flexibility since the unicode stuff is all going to be concretely typed, and it's not clear to me that there is much to be gained (for the functions we use) by e.g. inlining or constant propagation. It's an awful lot of hassle to go to in order to avoid "a little annoyance" for vscode, and to make it easier to support julia < 1.6 versions that fewer and fewer people care about.

pfitzseb · 2023-11-06T15:43:12Z

It's an awful lot of hassle to go to in order to avoid "a little annoyance" for vscode, and to make it easier to support julia < 1.6 versions that fewer and fewer people care about.

That's fair.

"A little annoyance" was me understating the issue, because we don't currently have a good way to vendor binary dependencies. That doesn't mean that we can't figure it out or that anyone should go do the work of porting utf8proc, of course.

stevengj · 2023-11-06T16:33:10Z

Maybe there is a straightforward way to make JuliaSyntax optionally use the version of utf8proc that is linked into julia itself?

This would presumably help the "vendoring" (whatever that is) situation with vscode, and it would also be good for the version of JuliaSyntax that is bundled with julia (so that it doesn't load two identical versions of utf8proc into memory).

Does some Pkg expert (e.g. @vchuravy or @IanButterworth) know a good way to do this that would be friendly with the way JuliaSyntax will be bundled with Julia? Preferences.jl?

c42f · 2023-11-07T03:34:11Z

"A little annoyance" was me understating the issue

Yea I've have a feeling it could be quite a problem for you. It's troubling that porting (parts of?) utf8proc might actually be the simplest way out of this situation. It seems like taking extreme measures.

Oh, but if you're vendoring JuliaSyntax and other libs, you're already messing with their Project.toml's, yes? So in principle one can modify those to not depend on utf8proc_jll on older Julia VERSION? Or is that not true? Is the vendored stuff for the VSCode plugin shared between Julia versions?

this seems like a lot of effort (and an ongoing maintenance headache since utf8proc has to be updated every year for new Unicode versions)

Mm it seems like buying into integration, testing, bugfixing and ongoing maintenance which is a chunk of duplicated effort :( The initial porting is a lot less scary

For vendoring into Base I think we're in an ok situation - in that case JuilaSyntax isn't a stdlib, it's actually a submodule of Base. So we don't depend on the Project.toml at all. We could branch on something - at toplevel I guess - to detect that situation and use Julia's internal utf8proc in that case.

davidanthoff · 2023-11-07T06:54:40Z

if you're vendoring JuliaSyntax and other libs, you're already messing with their Project.toml's

No, we are not messing with anything inside the packages that we are using in the LS, we just ship them as part of the extension, and we have a different project/manifest combo for each Julia version for the entire project, but the packages themselves are left as-is. We keep some packages on older versions on older Julia versions, but that really only works if their public API doesn't change...

Also, just for background: this is clearly a major pain point... My hope is that at some point we will get support for statically compiling the LS in Julia. At that point we ship a binary version of the LS with the extension, and we could drop support for running the LS on older Julia versions. I would still want it to be able to analyze code that targets old Julia versions, but overall that is a much, much simpler problem. So my current plan is to continue to go through all these hoops to support running on old Julia versions and hope that the static compile story materializes at some point...

c42f · 2023-11-16T03:45:19Z

Just FYI, I've not forgotten this. I'm currently working on porting parts of libutf8proc. I've decided to start with porting the data_generation.rb script there from ruby to Julia. That will familiarize me with what's going on a bit in that library. If nothing else comes of this at least I hope the libutf8proc table generation will be cleaner and not depend on both Julia and ruby :)

KristofferC · 2023-11-16T11:07:05Z

One thing I don't understand.

JuliaSyntax is in the sysimage
This PR will hence cause utf8proc_jll to be in the sysimage.
The version of utf8proc_jll used will thus no longer be independently upgradable and now be tied to the Julia version
So how is JuliaSyntax made independent of the julia version by this PR?

c42f · 2023-11-17T06:21:07Z

So how is JuliaSyntax made independent of the julia version by this PR?

JuliaSyntax is vendored into the sysimage as part of Base, it's not a stdlib. In that context, it's certainly dependent on the version of utf8proc shipped with Julia (and it wouldn't use utf8proc_jll - it'd use the builtin library).

However, it's also separately installable as a normal package for use by tooling. In this use case it'd be independent of Julia's VERSION.

As a non-stdlib, the vendored copy's only public API is Meta.parse()

KristofferC · 2023-11-17T06:40:02Z

and it wouldn't use utf8proc_jll - it'd use the builtin library).

Okay, that answers it.

c42f · 2024-01-17T06:04:36Z

fyi I'm still plugging away at the upstream implications of a pure-Julia unicode library...

https://github.com/c42f/UnicodeNext.jl

I think this could be generally useful though it does feel a bit marginal in terms of payoff for the effort 🤷‍♀️

Here I've put tokenization-related functionality like `is_id_start_char()` in the Tokenize module.

c42f · 2024-01-23T06:28:17Z

Ok, I've pushed an update here which makes this PR use my experimental UnicodeNext library.

Current issues:

This breaks CI because that library isn't registered yet. To get this to work you need to Pkg.develop https://github.com/c42f/UnicodeNext.jl
When vendored into Base it might be nice to instead depend on Base.Unicode and avoid the extra dependency and sysimage cost of UnicodeNext unless(/until?) we can figure out how to make UnicodeNext the implementation of unicode in Base.

Thoughts? This would

Solve any issues with dependencies way back to Julia 1.0 (sigh)
Allow JuliaSyntax to understand the latest unicode versions

stevengj added 3 commits November 4, 2023 20:43

initial port of Base.Unicode

85f64dd

port Base.is_id_char and Base.is_id_start_char

22e3c6a

optimize ascii case

e34e93d

revert accidental change to test

2676e1a

stevengj added 4 commits November 4, 2023 21:22

rm fixme for 🏳️‍🌈 test

c6101d0

rm obsolete arg

5a69b0f

slight consolidation/optimization of common isascii(c), c % UInt8 cod…

7fcbdd0

…epath

cleanup using u8_str macro

3ebeaf8

stevengj added 2 commits November 4, 2023 22:09

pin utf8proc_jll to a specific minor version

8f55073

clarify comments

1506668

c42f mentioned this pull request Dec 10, 2023

Port ruby data_generator.rb to Julia JuliaStrings/utf8proc#258

Merged

Use UnicodeNext for unicode functionality

c820724

Here I've put tokenization-related functionality like `is_id_start_char()` in the Tokenize module.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re-implement Base Unicode features using utf8proc_jll #381

re-implement Base Unicode features using utf8proc_jll #381

stevengj commented Nov 5, 2023

stevengj commented Nov 5, 2023

codecov bot commented Nov 5, 2023 •

edited

Loading

stevengj commented Nov 5, 2023 •

edited

Loading

c42f commented Nov 5, 2023 •

edited

Loading

stevengj commented Nov 5, 2023 •

edited

Loading

c42f commented Nov 6, 2023

pfitzseb commented Nov 6, 2023

stevengj commented Nov 6, 2023 •

edited

Loading

pfitzseb commented Nov 6, 2023

stevengj commented Nov 6, 2023 •

edited

Loading

c42f commented Nov 7, 2023

davidanthoff commented Nov 7, 2023

c42f commented Nov 16, 2023

KristofferC commented Nov 16, 2023

c42f commented Nov 17, 2023 •

edited

Loading

KristofferC commented Nov 17, 2023

c42f commented Jan 17, 2024

c42f commented Jan 23, 2024

re-implement Base Unicode features using utf8proc_jll #381

Are you sure you want to change the base?

re-implement Base Unicode features using utf8proc_jll #381

Conversation

stevengj commented Nov 5, 2023

stevengj commented Nov 5, 2023

codecov bot commented Nov 5, 2023 • edited Loading

Codecov Report

stevengj commented Nov 5, 2023 • edited Loading

c42f commented Nov 5, 2023 • edited Loading

stevengj commented Nov 5, 2023 • edited Loading

c42f commented Nov 6, 2023

pfitzseb commented Nov 6, 2023

stevengj commented Nov 6, 2023 • edited Loading

pfitzseb commented Nov 6, 2023

stevengj commented Nov 6, 2023 • edited Loading

c42f commented Nov 7, 2023

davidanthoff commented Nov 7, 2023

c42f commented Nov 16, 2023

KristofferC commented Nov 16, 2023

c42f commented Nov 17, 2023 • edited Loading

KristofferC commented Nov 17, 2023

c42f commented Jan 17, 2024

c42f commented Jan 23, 2024

codecov bot commented Nov 5, 2023 •

edited

Loading

stevengj commented Nov 5, 2023 •

edited

Loading

c42f commented Nov 5, 2023 •

edited

Loading

stevengj commented Nov 5, 2023 •

edited

Loading

stevengj commented Nov 6, 2023 •

edited

Loading

stevengj commented Nov 6, 2023 •

edited

Loading

c42f commented Nov 17, 2023 •

edited

Loading