Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize generated tables ; clean up unicode.py #73

Merged
merged 1 commit into from
Apr 16, 2015
Merged

optimize generated tables ; clean up unicode.py #73

merged 1 commit into from
Apr 16, 2015

Conversation

kwantam
Copy link
Contributor

@kwantam kwantam commented Apr 16, 2015

There was an easy opportunity to better optimize the tables generated
by unicode.py. Not sure why I didn't catch this long ago, but in any
case now the tables are substantially smaller and should maybe improve
performance slightly.

There was also some dead code sitting in unicode.py that I pulled out.

There was an easy opportunity to better optimize the tables generated
by unicode.py. Not sure why I didn't catch this long ago, but in any
case now the tables are substantially smaller and should maybe improve
performance slightly.

There was also some dead code sitting in unicode.py that I pulled out.
@rust-highfive
Copy link

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @huonw (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. The way Github handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see CONTRIBUTING.md for more information.

@huonw
Copy link
Member

huonw commented Apr 16, 2015

Could you give a high-level summary of how the tables changed? (github isn't showing the diff and I'm finding it hard to visualise from just the python changes.)

@BurntSushi
Copy link
Member

My suspicion is that this script was once used to generate other Unicode data that wasn't used by regex (e.g., east asian character widths?). So this is probably trimming it back down?

@kwantam I'm still interested in a high level summary though. I think I wrote the original version of this over a year ago for regex stuff only, but I no longer understand the code.

@kwantam
Copy link
Contributor Author

kwantam commented Apr 16, 2015

Quick background: the property tables that unicode.py generates are sorted arrays of non-overlapping ordered pairs of chars, (x, y), x<=y. When matching against a certain property (say, with \p{Alphabetic}), the regex code does a binary search in the corresponding table to identify whether there is a pair that contains a given character.

Now, in some cases, the tables that were previously generated contained ordered pairs that should have been merged together. For example, derived_property::Alphabetic_table previously contained a sequence like this:

...
    ('\u{d8}', '\u{f6}'), ('\u{f8}', '\u{1ba}'), ('\u{1bb}', '\u{1bb}'),
    ('\u{1bc}', '\u{1bf}'), ('\u{1c0}', '\u{1c3}'), ('\u{1c4}', '\u{293}'),
    ('\u{294}', '\u{294}'), ('\u{295}', '\u{2af}'), ('\u{2b0}', '\u{2c1}'),
...

Note that, for example, ('\u{f8}', '\u{1ba}'), ('\u{1bb}', '\u{1bb}') can be collapsed into a single range, because 0x1bb = 0x1ba + 1. In fact, the above sequence, in the new version of the table, is just

...
    ('\u{d8}', '\u{f6}'), ('\u{f8}', '\u{2c1}')
...

Net, unicode.rs goes from 360kB to 309kB after this change.

In case you're wondering, the reason this was happening is because of the way that the Unicode-provided tables were formatted. The function that imports these tables, load_properties(), was implicitly assuming that all sequential ranges in the tables were collapsed, but this isn't the case. The very small addition, among all the cuts in unicode.py, that fixes this, is at the end of the load_properties(), function:

for prop in props:
    props[prop] = group_cat(ungroup_cat(props[prop]))

@BurntSushi I think the reason unicode.py is no longer recognizable as the script you wrote is because at some point I merged your original code into the script that generated the rest of tables.rs for libunicode. That was probably the same patch that made libregex use tables from libunicode rather than having its own tables.

Anyhow, as you say, a lot of the stuff that this patch removes from unicode.py is dead code that was previously used to generate tables for other functions in libunicode.

@alexcrichton
Copy link
Member

Nice! Thanks @kwantam!

alexcrichton added a commit that referenced this pull request Apr 16, 2015
optimize generated tables ; clean up unicode.py
@alexcrichton alexcrichton merged commit aaa28c4 into rust-lang:master Apr 16, 2015
kwantam added a commit to kwantam/rust that referenced this pull request Apr 18, 2015
Apply optimization described in
rust-lang/regex#73 (comment)
to rust's copy of `unicode.py`.

This shrinks librustc_unicode's tables.rs from 479kB to 456kB,
and should improve performance slightly for related operations
(e.g., is_alphabetic(), is_xid_start(), etc).

In addition, pull in fix from @dscorbett's commit
d25c39f86568a147f9b7080c25711fb1f98f056a in regex, which
makes `load_properties()` more tolerant of whitespace
in the Unicode tables. (This fix does not result in any
changes to tables.rs, but could if the Unicode tables
change in the future.)
bors added a commit to rust-lang/rust that referenced this pull request Apr 18, 2015
Apply optimization described in
rust-lang/regex#73 (comment)
to rust's copy of `unicode.py`.

This shrinks librustc_unicode's tables.rs from 479kB to 456kB,
and should improve performance slightly for related operations
(e.g., is_alphabetic(), is_xid_start(), etc).

In addition, pull in fix from @dscorbett's commit
d25c39f86568a147f9b7080c25711fb1f98f056a in regex, which
makes `load_properties()` more tolerant of whitespace
in the Unicode tables. (This fix does not result in any
changes to tables.rs, but could if the Unicode tables
change in the future.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants