Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change RegExps to work on 16-bit wide characters on all platforms. Removes wchar_t refs #566

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

tcr
Copy link
Member

@tcr tcr commented Oct 11, 2014

This removes the use of "mbtowcs", replacing it with an explicit UCS-2 function and and hsregex patch that fixes its characters to 16-bits wide.

@tcr tcr changed the title [NRY] Remove use of mbtowcs in hsregex. Change RegExps to work on 16-bit wide characters on all platforms. Removes wchar_t refs Oct 12, 2014
@tcr tcr force-pushed the tcr-mbtowcs branch 2 times, most recently from 625aecc to debeeb0 Compare October 12, 2014 05:01
@tcr
Copy link
Member Author

tcr commented Oct 13, 2014

Will dig through the comments and update this patch. (Github's interface is really confusing me with its comments UX today.)

@tcr
Copy link
Member Author

tcr commented Oct 14, 2014

@natevw Updated. To recap, hsregex is blindly using 16-bit wide code points and so we can seamlessly get away with UTF16, because we'll never have a non-surrogate pair originating from a colony-internal string representation. (Hence the function name tm_str_to_utf16.)

Also I printed out your string diagram and pasted it on our wall.

@tcr tcr force-pushed the tcr-mbtowcs branch 2 times, most recently from 15bb22e to 383455c Compare October 14, 2014 02:12
@natevw
Copy link
Contributor

natevw commented Oct 14, 2014

Looking much better, although I think you should change the signature (and relevant internals) to const uint16_t ** const dstptr — this not only will turn *((uint16_t *) (buf + buf_pos)) = uchar; into a much clearer buf[buf_pos] = (uint16_t) uchar; but makes it clear to callers that it is native endian (i.e. a chunk of 16-bit ints) rather than a ??? endian chunk of bytes.

@tcr
Copy link
Member Author

tcr commented Oct 15, 2014

@natevw I've changed the function definition to take a tm_endian_t argument, as the buffer methods do. Hopefully this will be useful with the Buffer encoding options as well.

@natevw
Copy link
Contributor

natevw commented Oct 17, 2014

@tcr That works, sorry for not reviewing sooner. R+ assuming it actually works in practice too ;-)

natevw added a commit that referenced this pull request Nov 26, 2014
…uffer methods need, this way #566 will be able to share our tm_str_to_utf16 implemenattion
natevw added a commit that referenced this pull request Nov 26, 2014
…uffer methods need, this way #566 will be able to share our tm_str_to_utf16 implemenattion
tm-rampart added a commit that referenced this pull request Dec 4, 2014
Found this larking about.

NOTE: this is actually the only (current!) user of tm_endian_t, but it has been left since e.g. #645 and/or #566 will want it
natevw added a commit that referenced this pull request Dec 15, 2014
…uffer methods need, this way #566 will be able to share our tm_str_to_utf16 implemenattion
natevw added a commit that referenced this pull request Feb 20, 2015
…uffer methods need, this way #566 will be able to share our tm_str_to_utf16 implemenattion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants