-
-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
re-implement Base Unicode features using utf8proc_jll #381
base: main
Are you sure you want to change the base?
Conversation
(It's maybe 1% faster on tests/benchmark.jl compared to |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #381 +/- ##
==========================================
+ Coverage 96.57% 96.64% +0.06%
==========================================
Files 14 15 +1
Lines 4179 4198 +19
==========================================
+ Hits 4036 4057 +21
+ Misses 143 141 -2 ☔ View full report in Codecov by Sentry. |
Looks like utf8proc_jll version 2.9.0 doesn't work with Julia < 1.6 due to changes in BinaryBuilder or something? How badly do you want to support ancient Julia versions? |
Personally? I'd prefer to keep supporting them but I don't have a super strong opinion as long as we support 1.6 and above. But tooling maintainers did want it to work on older versions I think. @davidanthoff @pfitzseb ? If we can't do it easily, can we continue to support older versions by calling into |
That will silently give different results on older versions of Julia, which also seems bad? (If you want to go that route, it would be easier to simply lower the version requirement for utf8proc_jll?) Though I think JLL's were not supported at all before Julia 1.3 or so? I really don't see how to maintain compatibility with Julia 1.0 without keeping a separate branch around with a different Project.toml that omits the |
Oof. Yes of course - the lack of optional dependencies in early Julia versions makes this really awkward. Hmm. Now I'm looking at porting the necessary parts of utf8proc.c to Julia. The library doesn't look large or scary and I'm tempted to give it a go. |
For the VS Code extension, any non-Julia dependencies are a little annoying because we're trying to vendor everything. If porting utf8proc to Julia is feasible, then that'd be ideal I think. |
Yes, it shouldn't be too bad if you write a little script to convert the data tables in utf8proc_data = read("utf8proc_data.c", String)
jl_data = replace(utf8proc_data,
# arrays
",\n"=> ", ", "};"=>"]",
r"static const [A-Za-z0-9_]+ ([A-Za-z0-9_]+)\[\] = \{\n" => s"const \1 = [ ",
# utf8proc_property constructors
"{" => "Property(", "}," => "),",
# constants
"UINT16_MAX" => typemax(UInt16))
write("utf8proc_data.jl", jl_data) along with copying the definitions of Note that Julia does not support bitfield structs, so you'd have to emulate them for re-defining That being said, this seems like a lot of effort (and an ongoing maintenance headache since utf8proc has to be updated every year for new Unicode versions) for very little gain. You don't really gain much from Julia's flexibility since the unicode stuff is all going to be concretely typed, and it's not clear to me that there is much to be gained (for the functions we use) by e.g. inlining or constant propagation. It's an awful lot of hassle to go to in order to avoid "a little annoyance" for vscode, and to make it easier to support julia < 1.6 versions that fewer and fewer people care about. |
That's fair. "A little annoyance" was me understating the issue, because we don't currently have a good way to vendor binary dependencies. That doesn't mean that we can't figure it out or that anyone should go do the work of porting utf8proc, of course. |
Maybe there is a straightforward way to make JuliaSyntax optionally use the version of utf8proc that is linked into This would presumably help the "vendoring" (whatever that is) situation with vscode, and it would also be good for the version of JuliaSyntax that is bundled with Does some Pkg expert (e.g. @vchuravy or @IanButterworth) know a good way to do this that would be friendly with the way JuliaSyntax will be bundled with Julia? Preferences.jl? |
Yea I've have a feeling it could be quite a problem for you. It's troubling that porting (parts of?) utf8proc might actually be the simplest way out of this situation. It seems like taking extreme measures. Oh, but if you're vendoring JuliaSyntax and other libs, you're already messing with their Project.toml's, yes? So in principle one can modify those to not depend on utf8proc_jll on older Julia
Mm it seems like buying into integration, testing, bugfixing and ongoing maintenance which is a chunk of duplicated effort :( The initial porting is a lot less scary For vendoring into Base I think we're in an ok situation - in that case JuilaSyntax isn't a stdlib, it's actually a submodule of |
No, we are not messing with anything inside the packages that we are using in the LS, we just ship them as part of the extension, and we have a different project/manifest combo for each Julia version for the entire project, but the packages themselves are left as-is. We keep some packages on older versions on older Julia versions, but that really only works if their public API doesn't change... Also, just for background: this is clearly a major pain point... My hope is that at some point we will get support for statically compiling the LS in Julia. At that point we ship a binary version of the LS with the extension, and we could drop support for running the LS on older Julia versions. I would still want it to be able to analyze code that targets old Julia versions, but overall that is a much, much simpler problem. So my current plan is to continue to go through all these hoops to support running on old Julia versions and hope that the static compile story materializes at some point... |
Just FYI, I've not forgotten this. I'm currently working on porting parts of libutf8proc. I've decided to start with porting the data_generation.rb script there from ruby to Julia. That will familiarize me with what's going on a bit in that library. If nothing else comes of this at least I hope the libutf8proc table generation will be cleaner and not depend on both Julia and ruby :) |
One thing I don't understand.
|
However, it's also separately installable as a normal package for use by tooling. In this use case it'd be independent of Julia's As a non-stdlib, the vendored copy's only public API is |
Okay, that answers it. |
fyi I'm still plugging away at the upstream implications of a pure-Julia unicode library... https://github.com/c42f/UnicodeNext.jl I think this could be generally useful though it does feel a bit marginal in terms of payoff for the effort 🤷♀️ |
Here I've put tokenization-related functionality like `is_id_start_char()` in the Tokenize module.
Ok, I've pushed an update here which makes this PR use my experimental Current issues:
Thoughts? This would
|
Closes #377.