Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser vulnerable to Trojan Source attack #11392

Open
straight-shoota opened this issue Nov 1, 2021 · 18 comments
Open

Parser vulnerable to Trojan Source attack #11392

straight-shoota opened this issue Nov 1, 2021 · 18 comments
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. security topic:compiler:parser

Comments

@straight-shoota
Copy link
Member

A recently released paper titled Trojan Source: Invisible Vulnerabilities demonstrates an attack against source code. It uses Unicode bi-directional overrides to disguise the meaning of code to a human reader. This can lead to seemingly harmless code introducing malicious behaviour.

Crystal demonstration

The following code demonstrates a stretched-string attack in Crystal:

access_level = "user"
if access_level != "user‮ ⁦# Check if admin⁩ ⁦"
  puts "You are an admin!"
end

https://carc.in/#/r/c6ka

The following code demonstrates a commenting-out attack in Crystal:

access_level = "user"
if access_level != "none‮⁦" # Check if admin⁩⁦" && access_level != "user
  puts "You are an admin!"
end

https://carc.in/#/r/c6kh

They looks mostly unsuspicious. You wouldn't expect either to print anything. But both programs actually print You are an admin! despite access_level = "user".

The second lines of each program's source code contain a number of Unicode control characters for bi-directional overrides. This is what the parser reads:

# stretched-string attack
if access_level != "user\u202E \u2066# Check if admin\u2069 \u2066"
# commenting-out attack
if access_level != "none\u202E\u2066"# Check if admin\u2069\u2066" && access_level != "user

The only indicator that something might be off is the syntax highlighting, which should be pretty resistant to being fooled.
Github has already introduced a feature that shows a warning when bi-directional overrides are detected in a file: https://github.blog/changelog/2021-10-31-warning-about-bidirectional-unicode-text/

Mitigation

This vulnerability can be defended easily by disallowing bi-directional control characters in source code.
In many locations, such control characters are already a syntax error. But they are currently valid in comments and string literals. Those are the typical spots for most languages.

However, Crystal's parser currently even accepts Unicode control characters in identifiers, including bi-directional override characters. Restricting the allowed character set in general is another problem and tracked in #11216.

I propose to change the language specification and lexer rules such that valid Crystal source code must not contain any bi-directional control characters, regardles of location.

A more fine-grained approach would be possible as well, but this should be unnecessary considering there are little to no legitimate use cases for bidirectional control characters in Crystal source code (but for some specific exceptions mentioned in the following section).

Workarounds

Bi-directional override characters are legitimate contents for string literals. Instead of encoding them directly in the source code, a proper workaround is to use escape sequences for that.

Bi-directional overrides can technically be legitimate in comments if you want to mix languages with different directions in the comment text. That does not seem like a very important use case, though.

Still, as a further enhancement, bi-directional overrides could potentially be allowed in comments and possibly other locations such as string literals as long as they are fully enclosed inside the comment or literal.

The general vulnerability is tracked as CVE-2021-42574.

@straight-shoota straight-shoota added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:compiler:parser security labels Nov 1, 2021
@straight-shoota straight-shoota added this to the 1.2.2 milestone Nov 1, 2021
@naqvis
Copy link
Contributor

naqvis commented Nov 2, 2021

I propose to change the language specification and lexer rules such that valid Crystal source code must not contain any bi-directional control characters, regardles of location.

I will argue against this proposal as this is too aggressive. We might take insights from how other languages are handling such cases. Below is an excerpt from how Rust team is doing that:

Mitigations

We will be releasing Rust 1.56.1 today, 2021-11-01, with two new
deny-by-default lints detecting the affected codepoints, respectively in string
literals and in comments. The lints will prevent source code files containing
those codepoints from being compiled, protecting you from the attack.

If your code has legitimate uses for the codepoints we recommend replacing them
with the related escape sequence. The error messages will suggest the right
escapes to use.

@asterite
Copy link
Member

asterite commented Nov 2, 2021

Isn't that what Rust is doing, except that they suggest the escape codes to use?

@straight-shoota
Copy link
Member Author

straight-shoota commented Nov 2, 2021

I don't see much difference, actually. AFAIK there should be no possibilty of such a character being valid anywhere in Crystal source code outside of string literals or comments. As mentioned, in those contexts we could employ a more sophisticated analysis and allow bi-directional control overwrites if they're fully enclosed in the respective literal or comment.

But that is an additional feature and requires more discussion on how to implement.

The primary focus is to apply a defence against this kind of source code modification.
Making bi-directional control characters invalid anywhere in a Crystal source file ensures that. With a more fine-grained approach we would need to make sure to identify and cover all vulnerable tokens. The lexer is not very restrictive in many places where you would expect it to be (see #11216). So this would be a real challenge and probably end up with a pretty similar result.

@drhuffman12
Copy link

FWIW, see also python approach for unicode handling for key words: https://stackoverflow.com/questions/41757886/how-can-i-recognise-non-printable-unicode-characters-in-python

@fmease
Copy link

fmease commented Nov 3, 2021

I don't see much difference, actually.

The difference between the proposed hard ban of these control characters in Crystal and Rust's implementation is the fact that the latter is a lint which throws an error by default (deny-by-default) but it can be overwritten by the user to not raise any error or to raise only a warning. This can be done at the level of a whole program but also at the level of individual modules, single functions or merely specific expressions. Just FYI.

@straight-shoota
Copy link
Member Author

Yes, I understand Rust uses a different implementation approach. But I see little practical difference. We could consider adding an option ignore or warn instead of failing. In Crystal, however, we do not have a lint system which could offer fine-grained control for where it applies. I figure a global option would be less useful.

But is there even a reasonably valid use case for having bi-directional control characters in your Crystal source code? What would make you disable the bi-direction control character validation?

@Fryguy
Copy link
Contributor

Fryguy commented Nov 3, 2021

A even better question is whether or not there's any Crystal code out there with bi-directional control characters. Seems extremely unlikely that there is actual, legitimate, usage of this.

@beta-ziliani
Copy link
Member

I'd say we take the most aggressive approach, and if we find the community is having legitimate uses of these characters, then we make an opt-out option. (We can do the opt-out right away, but this is more work to maintain.)

@beta-ziliani
Copy link
Member

After some back and forth thinking, we decided to give us a time to discuss this properly, and not to rush it in the coming 1.2.2 version.

In a way, we can see three vectors for this type of attack in a language: string literals, identifiers, and comments. With Crystal having only CR-terminating comments, it's not clear that there can be an actual attack using comments, so let's focus on the other two vectors.

When it comes to identifiers, this is already discussed in #11216, and there are more Unicode characters than the bidi control ones that are problematic. I don't think there is much of a debate about that we need to properly handle these.

As for string literals, an option could be to let the format tool to turn bidi control chars into their escape sequence. That will allow us to not be restrictive, and just impose it as a formatting standard.

@HertzDevil
Copy link
Contributor

HertzDevil commented Jan 9, 2023

I created a PR that warns on those unescaped characters, now that parser warnings are available. Then the follow-up would be escaping them whenever possible, which includes more things than string literals. For demonstration purposes, assume that 😂 is U+202E. Then the following:

"😂"
%(😂)
:😂
:"😂"
`😂`

def foo(😂 x)
end

foo(😂: 1)

FOO😂 = 1
%w(😂)

would be formatted into:

"\u202E"
%(\u202E)
:"\u202E" # note the addition of quotation marks
:"\u202E"
`\u202E`

def foo("\u202E" x)
end

foo("\u202E": 1)

FOO😂 = 1 # unchanged because escape sequences are unavailable here
%w(😂)    # ditto

#12740 is another example where warnings + formatter are used to phase out deprecated syntax in the language, so we should follow suit here.

@straight-shoota
Copy link
Member Author

Format + warning sounds good to me.

@straight-shoota
Copy link
Member Author

straight-shoota commented Jan 10, 2023

To make it clear for everyone: We're not talking about replacing emojis. 😂 is just a stand-in for a bi-directional control character (U+202E for example).

@straight-shoota
Copy link
Member Author

@HertzDevil What about comments?

# 😂

Should that become this?

# \u202E

Comment's don't actively support unicode escape sequences, but then it's just comments 🤷

@beta-ziliani
Copy link
Member

but isn't the problem in this example about bidi chars in comments?
That said, it'd be nice if we can still allow certain things like:

p "hello world" # مرحبا بالعالم

@HertzDevil
Copy link
Contributor

HertzDevil commented Jan 10, 2023

I think that if all the problematic characters to the left of # are escaped, then the rendering of the non-comment part is never affected and an attack isn't possible, but I don't have a proof for this yet. If that is the case we don't need to escape anything within comments.

We cannot unconditionally escape all those characters in Crystal code blocks within comments, as that might produce invalid code. And then there are #<loc:...> pragmas which syntactically behave like /* ... */ comments in other languages if one disregards its effects on stack traces; the first argument is a string that ignores escape sequences and can even span across multiple lines.

In any case, even if we don't escape the comments automatically, the compiler and formatter shall warn on them every time anyway. By the way, here is an example of an attack that involves no string literals at all, only comments, made possible because these control characters are allowed in identifiers:

user_id‮⁦ = 1000
if user_id‮⁦ # Check if admin⁩⁦ != 1000
  puts "You are an admin!"
end

The escaped form is:

user_id\u202E\u2066 = 1000
if user_id\u202E\u2066 # Check if admin\u2069\u2066 != 1000
  puts "You are an admin!"
end

@straight-shoota
Copy link
Member Author

straight-shoota commented Jan 10, 2023

For identifiers, I think the compiler should eventually reject bidi control characters (ref #11216).
But warning should be good as a start.

@HertzDevil
Copy link
Contributor

There is now a draft Unicode Technical Standard for this: https://www.unicode.org/reports/tr55/#Spoofing-bidi

@naqvis
Copy link
Contributor

naqvis commented Sep 6, 2023

There is now a draft Unicode Technical Standard for this: https://www.unicode.org/reports/tr55/#Spoofing-bidi

Nice to see 😃

The solution is not to forbid the directional formatting characters; indeed the Ada example above does not use these. This document instead makes two recommendations.

First, source code editors should display source code according to its lexical structure, as described in Section 4.1, Bidirectional Ordering.

Second, computer languages should allow for the insertion of directional formatting characters as described in Section 3.2, Whitespace and Syntax, and implementers should provide tools that automatically remove spurious directional formatting characters, and insert the correct ones, as described in Section 5.2, Conversion to Plain Text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. security topic:compiler:parser
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants