Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster isascii(::Char) #48281

Closed
wants to merge 2 commits into from
Closed

faster isascii(::Char) #48281

wants to merge 2 commits into from

Conversation

stevengj
Copy link
Member

@stevengj stevengj commented Jan 14, 2023

The previous implementation of isascii(c::Char), by @StefanKarpinski in #24999, was bswap(reinterpret(UInt32, c)) < 0x80. However, I noticed that base/io.jl is using a nicer trick: c ≤ '\x7f', which oddly enough is also by @StefanKarpinski in #24999. The latter seems to compile to one fewer instruction because it omits the bswap.

@StefanKarpinski, any comment here?

The PR also cleans up io.jl to use isascii(c) rather than c ≤ '\x7f', since they should now be equivalent.

@stevengj stevengj added performance Must go faster strings "Strings!" labels Jan 14, 2023
@matthias314
Copy link
Contributor

The new version looks cleaner, but are you sure that it's faster? On master I've tried

s = String(' ':'~')
@btime all(isascii, $s)

and the run time is the same as before.

@stevengj
Copy link
Member Author

The new version looks cleaner, but are you sure that it's faster?

Not sure; it's really hard to benchmark such a simple function. However, I'm operating on the principle that 1 cmp instruction is better than 1 cmp + 1 bswap.

Also, it's weird to have two different versions of the isascii check in our codebase — seems like we should settle on one, and we might as well settle on the one with fewer instructions.

@stevengj
Copy link
Member Author

Actually, it looks like the c ≤ '\x7f' is simpler because it is simply wrong for malformed Char

julia> c = reinterpret(Char, 0x7f000000 - 0x01)
'\x7e\xff\xff\xff': Malformed UTF-8 (category Ma: Malformed, bad data)

julia> c  '\x7f'
true

julia> isascii(c)
false

So, this actually seems like a bug in io.jl.

I'll close this PR then, change io.jl to use isascii, and add a test.

@stevengj stevengj closed this Jan 15, 2023
@giordano giordano deleted the stevengj-patch-3 branch January 15, 2023 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants