optimize generated tables ; clean up unicode.py #73

kwantam · 2015-04-16T07:07:52Z

There was an easy opportunity to better optimize the tables generated
by unicode.py. Not sure why I didn't catch this long ago, but in any
case now the tables are substantially smaller and should maybe improve
performance slightly.

There was also some dead code sitting in unicode.py that I pulled out.

There was an easy opportunity to better optimize the tables generated by unicode.py. Not sure why I didn't catch this long ago, but in any case now the tables are substantially smaller and should maybe improve performance slightly. There was also some dead code sitting in unicode.py that I pulled out.

rust-highfive · 2015-04-16T07:08:04Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @huonw (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. The way Github handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see CONTRIBUTING.md for more information.

huonw · 2015-04-16T08:35:04Z

Could you give a high-level summary of how the tables changed? (github isn't showing the diff and I'm finding it hard to visualise from just the python changes.)

BurntSushi · 2015-04-16T10:58:59Z

My suspicion is that this script was once used to generate other Unicode data that wasn't used by regex (e.g., east asian character widths?). So this is probably trimming it back down?

@kwantam I'm still interested in a high level summary though. I think I wrote the original version of this over a year ago for regex stuff only, but I no longer understand the code.

kwantam · 2015-04-16T16:24:06Z

Quick background: the property tables that unicode.py generates are sorted arrays of non-overlapping ordered pairs of chars, (x, y), x<=y. When matching against a certain property (say, with \p{Alphabetic}), the regex code does a binary search in the corresponding table to identify whether there is a pair that contains a given character.

Now, in some cases, the tables that were previously generated contained ordered pairs that should have been merged together. For example, derived_property::Alphabetic_table previously contained a sequence like this:

...
    ('\u{d8}', '\u{f6}'), ('\u{f8}', '\u{1ba}'), ('\u{1bb}', '\u{1bb}'),
    ('\u{1bc}', '\u{1bf}'), ('\u{1c0}', '\u{1c3}'), ('\u{1c4}', '\u{293}'),
    ('\u{294}', '\u{294}'), ('\u{295}', '\u{2af}'), ('\u{2b0}', '\u{2c1}'),
...

Note that, for example, ('\u{f8}', '\u{1ba}'), ('\u{1bb}', '\u{1bb}') can be collapsed into a single range, because 0x1bb = 0x1ba + 1. In fact, the above sequence, in the new version of the table, is just

...
    ('\u{d8}', '\u{f6}'), ('\u{f8}', '\u{2c1}')
...

Net, unicode.rs goes from 360kB to 309kB after this change.

In case you're wondering, the reason this was happening is because of the way that the Unicode-provided tables were formatted. The function that imports these tables, load_properties(), was implicitly assuming that all sequential ranges in the tables were collapsed, but this isn't the case. The very small addition, among all the cuts in unicode.py, that fixes this, is at the end of the load_properties(), function:

for prop in props:
    props[prop] = group_cat(ungroup_cat(props[prop]))

@BurntSushi I think the reason unicode.py is no longer recognizable as the script you wrote is because at some point I merged your original code into the script that generated the rest of tables.rs for libunicode. That was probably the same patch that made libregex use tables from libunicode rather than having its own tables.

Anyhow, as you say, a lot of the stuff that this patch removes from unicode.py is dead code that was previously used to generate tables for other functions in libunicode.

alexcrichton · 2015-04-16T16:49:10Z

Nice! Thanks @kwantam!

optimize generated tables ; clean up unicode.py

@dscorbett

Apply optimization described in rust-lang/regex#73 (comment) to rust's copy of `unicode.py`. This shrinks librustc_unicode's tables.rs from 479kB to 456kB, and should improve performance slightly for related operations (e.g., is_alphabetic(), is_xid_start(), etc). In addition, pull in fix from @dscorbett's commit d25c39f86568a147f9b7080c25711fb1f98f056a in regex, which makes `load_properties()` more tolerant of whitespace in the Unicode tables. (This fix does not result in any changes to tables.rs, but could if the Unicode tables change in the future.)

@dscorbett

Apply optimization described in rust-lang/regex#73 (comment) to rust's copy of `unicode.py`. This shrinks librustc_unicode's tables.rs from 479kB to 456kB, and should improve performance slightly for related operations (e.g., is_alphabetic(), is_xid_start(), etc). In addition, pull in fix from @dscorbett's commit d25c39f86568a147f9b7080c25711fb1f98f056a in regex, which makes `load_properties()` more tolerant of whitespace in the Unicode tables. (This fix does not result in any changes to tables.rs, but could if the Unicode tables change in the future.)

rust-highfive assigned huonw Apr 16, 2015

alexcrichton added a commit that referenced this pull request Apr 16, 2015

Merge pull request #73 from kwantam/master

aaa28c4

optimize generated tables ; clean up unicode.py

alexcrichton merged commit aaa28c4 into rust-lang:master Apr 16, 2015

kwantam mentioned this pull request Apr 18, 2015

optimize Unicode tables rust-lang/rust#24560

Merged

kwantam unassigned huonw Sep 21, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize generated tables ; clean up unicode.py #73

optimize generated tables ; clean up unicode.py #73

kwantam commented Apr 16, 2015

rust-highfive commented Apr 16, 2015

huonw commented Apr 16, 2015

BurntSushi commented Apr 16, 2015

kwantam commented Apr 16, 2015

alexcrichton commented Apr 16, 2015

optimize generated tables ; clean up unicode.py #73

optimize generated tables ; clean up unicode.py #73

Conversation

kwantam commented Apr 16, 2015

rust-highfive commented Apr 16, 2015

huonw commented Apr 16, 2015

BurntSushi commented Apr 16, 2015

kwantam commented Apr 16, 2015

alexcrichton commented Apr 16, 2015