Parser vulnerable to Trojan Source attack #11392

straight-shoota · 2021-11-01T16:55:53Z

A recently released paper titled Trojan Source: Invisible Vulnerabilities demonstrates an attack against source code. It uses Unicode bi-directional overrides to disguise the meaning of code to a human reader. This can lead to seemingly harmless code introducing malicious behaviour.

Crystal demonstration

The following code demonstrates a stretched-string attack in Crystal:

access_level = "user"
if access_level != "user‮ ⁦# Check if admin⁩ ⁦"
  puts "You are an admin!"
end

https://carc.in/#/r/c6ka

The following code demonstrates a commenting-out attack in Crystal:

access_level = "user"
if access_level != "none‮⁦" # Check if admin⁩⁦" && access_level != "user
  puts "You are an admin!"
end

https://carc.in/#/r/c6kh

They looks mostly unsuspicious. You wouldn't expect either to print anything. But both programs actually print You are an admin! despite access_level = "user".

The second lines of each program's source code contain a number of Unicode control characters for bi-directional overrides. This is what the parser reads:

# stretched-string attack
if access_level != "user\u202E \u2066# Check if admin\u2069 \u2066"
# commenting-out attack
if access_level != "none\u202E\u2066"# Check if admin\u2069\u2066" && access_level != "user

The only indicator that something might be off is the syntax highlighting, which should be pretty resistant to being fooled.
Github has already introduced a feature that shows a warning when bi-directional overrides are detected in a file: https://github.blog/changelog/2021-10-31-warning-about-bidirectional-unicode-text/

Mitigation

This vulnerability can be defended easily by disallowing bi-directional control characters in source code.
In many locations, such control characters are already a syntax error. But they are currently valid in comments and string literals. Those are the typical spots for most languages.

However, Crystal's parser currently even accepts Unicode control characters in identifiers, including bi-directional override characters. Restricting the allowed character set in general is another problem and tracked in #11216.

I propose to change the language specification and lexer rules such that valid Crystal source code must not contain any bi-directional control characters, regardles of location.

A more fine-grained approach would be possible as well, but this should be unnecessary considering there are little to no legitimate use cases for bidirectional control characters in Crystal source code (but for some specific exceptions mentioned in the following section).

Workarounds

Bi-directional override characters are legitimate contents for string literals. Instead of encoding them directly in the source code, a proper workaround is to use escape sequences for that.

Bi-directional overrides can technically be legitimate in comments if you want to mix languages with different directions in the comment text. That does not seem like a very important use case, though.

Still, as a further enhancement, bi-directional overrides could potentially be allowed in comments and possibly other locations such as string literals as long as they are fully enclosed inside the comment or literal.

The general vulnerability is tracked as CVE-2021-42574.

The text was updated successfully, but these errors were encountered:

naqvis · 2021-11-02T06:21:31Z

I propose to change the language specification and lexer rules such that valid Crystal source code must not contain any bi-directional control characters, regardles of location.

I will argue against this proposal as this is too aggressive. We might take insights from how other languages are handling such cases. Below is an excerpt from how Rust team is doing that:

Mitigations

We will be releasing Rust 1.56.1 today, 2021-11-01, with two new
deny-by-default lints detecting the affected codepoints, respectively in string
literals and in comments. The lints will prevent source code files containing
those codepoints from being compiled, protecting you from the attack.

If your code has legitimate uses for the codepoints we recommend replacing them
with the related escape sequence. The error messages will suggest the right
escapes to use.

asterite · 2021-11-02T09:05:13Z

Isn't that what Rust is doing, except that they suggest the escape codes to use?

straight-shoota · 2021-11-02T09:05:27Z

I don't see much difference, actually. AFAIK there should be no possibilty of such a character being valid anywhere in Crystal source code outside of string literals or comments. As mentioned, in those contexts we could employ a more sophisticated analysis and allow bi-directional control overwrites if they're fully enclosed in the respective literal or comment.

But that is an additional feature and requires more discussion on how to implement.

The primary focus is to apply a defence against this kind of source code modification.
Making bi-directional control characters invalid anywhere in a Crystal source file ensures that. With a more fine-grained approach we would need to make sure to identify and cover all vulnerable tokens. The lexer is not very restrictive in many places where you would expect it to be (see #11216). So this would be a real challenge and probably end up with a pretty similar result.

drhuffman12 · 2021-11-02T23:35:47Z

FWIW, see also python approach for unicode handling for key words: https://stackoverflow.com/questions/41757886/how-can-i-recognise-non-printable-unicode-characters-in-python

fmease · 2021-11-03T16:37:24Z

I don't see much difference, actually.

The difference between the proposed hard ban of these control characters in Crystal and Rust's implementation is the fact that the latter is a lint which throws an error by default (deny-by-default) but it can be overwritten by the user to not raise any error or to raise only a warning. This can be done at the level of a whole program but also at the level of individual modules, single functions or merely specific expressions. Just FYI.

straight-shoota · 2021-11-03T16:53:16Z

Yes, I understand Rust uses a different implementation approach. But I see little practical difference. We could consider adding an option ignore or warn instead of failing. In Crystal, however, we do not have a lint system which could offer fine-grained control for where it applies. I figure a global option would be less useful.

But is there even a reasonably valid use case for having bi-directional control characters in your Crystal source code? What would make you disable the bi-direction control character validation?

Fryguy · 2021-11-03T18:18:41Z

A even better question is whether or not there's any Crystal code out there with bi-directional control characters. Seems extremely unlikely that there is actual, legitimate, usage of this.

beta-ziliani · 2021-11-03T18:33:15Z

I'd say we take the most aggressive approach, and if we find the community is having legitimate uses of these characters, then we make an opt-out option. (We can do the opt-out right away, but this is more work to maintain.)

beta-ziliani · 2021-11-04T14:23:12Z

After some back and forth thinking, we decided to give us a time to discuss this properly, and not to rush it in the coming 1.2.2 version.

In a way, we can see three vectors for this type of attack in a language: string literals, identifiers, and comments. With Crystal having only CR-terminating comments, it's not clear that there can be an actual attack using comments, so let's focus on the other two vectors.

When it comes to identifiers, this is already discussed in #11216, and there are more Unicode characters than the bidi control ones that are problematic. I don't think there is much of a debate about that we need to properly handle these.

As for string literals, an option could be to let the format tool to turn bidi control chars into their escape sequence. That will allow us to not be restrictive, and just impose it as a formatting standard.

HertzDevil · 2023-01-09T12:43:54Z

I created a PR that warns on those unescaped characters, now that parser warnings are available. Then the follow-up would be escaping them whenever possible, which includes more things than string literals. For demonstration purposes, assume that 😂 is U+202E. Then the following:

"😂"
%(😂)
:😂
:"😂"
`😂`

def foo(😂 x)
end

foo(😂: 1)

FOO😂 = 1
%w(😂)

would be formatted into:

"\u202E"
%(\u202E)
:"\u202E" # note the addition of quotation marks
:"\u202E"
`\u202E`

def foo("\u202E" x)
end

foo("\u202E": 1)

FOO😂 = 1 # unchanged because escape sequences are unavailable here
%w(😂)    # ditto

#12740 is another example where warnings + formatter are used to phase out deprecated syntax in the language, so we should follow suit here.

straight-shoota · 2023-01-09T13:59:34Z

Format + warning sounds good to me.

straight-shoota · 2023-01-10T13:34:45Z

To make it clear for everyone: We're not talking about replacing emojis. 😂 is just a stand-in for a bi-directional control character (U+202E for example).

straight-shoota · 2023-01-10T13:36:14Z

@HertzDevil What about comments?

# 😂

Should that become this?

# \u202E

Comment's don't actively support unicode escape sequences, but then it's just comments 🤷

beta-ziliani · 2023-01-10T14:24:56Z

but isn't the problem in this example about bidi chars in comments?
That said, it'd be nice if we can still allow certain things like:

p "hello world" # مرحبا بالعالم

HertzDevil · 2023-01-10T15:28:54Z

I think that if all the problematic characters to the left of # are escaped, then the rendering of the non-comment part is never affected and an attack isn't possible, but I don't have a proof for this yet. If that is the case we don't need to escape anything within comments.

We cannot unconditionally escape all those characters in Crystal code blocks within comments, as that might produce invalid code. And then there are #<loc:...> pragmas which syntactically behave like /* ... */ comments in other languages if one disregards its effects on stack traces; the first argument is a string that ignores escape sequences and can even span across multiple lines.

In any case, even if we don't escape the comments automatically, the compiler and formatter shall warn on them every time anyway. By the way, here is an example of an attack that involves no string literals at all, only comments, made possible because these control characters are allowed in identifiers:

user_id‮⁦ = 1000
if user_id‮⁦ # Check if admin⁩⁦ != 1000
  puts "You are an admin!"
end

The escaped form is:

user_id\u202E\u2066 = 1000
if user_id\u202E\u2066 # Check if admin\u2069\u2066 != 1000
  puts "You are an admin!"
end

straight-shoota · 2023-01-10T16:21:39Z

For identifiers, I think the compiler should eventually reject bidi control characters (ref #11216).
But warning should be good as a start.

HertzDevil · 2023-09-05T21:24:14Z

There is now a draft Unicode Technical Standard for this: https://www.unicode.org/reports/tr55/#Spoofing-bidi

naqvis · 2023-09-06T04:27:58Z

There is now a draft Unicode Technical Standard for this: https://www.unicode.org/reports/tr55/#Spoofing-bidi

Nice to see 😃

The solution is not to forbid the directional formatting characters; indeed the Ada example above does not use these. This document instead makes two recommendations.

First, source code editors should display source code according to its lexical structure, as described in Section 4.1, Bidirectional Ordering.

Second, computer languages should allow for the insertion of directional formatting characters as described in Section 3.2, Whitespace and Syntax, and implementers should provide tools that automatically remove spurious directional formatting characters, and insert the correct ones, as described in Section 5.2, Conversion to Plain Text.

straight-shoota added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:compiler:parser security labels Nov 1, 2021

straight-shoota added this to the 1.2.2 milestone Nov 1, 2021

straight-shoota mentioned this issue Nov 1, 2021

Fix disallow Unicode bi-directional control characters #11393

Closed

beta-ziliani removed this from the 1.2.2 milestone Nov 4, 2021

straight-shoota mentioned this issue Nov 6, 2021

String#inspect should escape non-printable characters #11406

Closed

joshuapassos mentioned this issue Nov 6, 2021

Parser vulnerable to Trojan Source attack dotnet/fsharp#12352

Open

straight-shoota mentioned this issue Dec 1, 2021

Formatter: Escape non-printable characters in literals #11520

Merged

straight-shoota mentioned this issue Jan 3, 2022

Revert "Restrict identifier grammar" #11687

Merged

HertzDevil mentioned this issue Jan 9, 2023

Warn on unescaped Unicode bi-directional control characters #12923

Open

This was referenced Jan 10, 2023

Add Block Comment Syntax #1401

Closed

Formatter should normalize symbol literals and similar fragments #12938

Open

HertzDevil mentioned this issue Feb 13, 2023

Formatter: escape bi-directional control characters within strings #13067

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser vulnerable to Trojan Source attack #11392

Parser vulnerable to Trojan Source attack #11392

straight-shoota commented Nov 1, 2021

naqvis commented Nov 2, 2021

Mitigations

asterite commented Nov 2, 2021

straight-shoota commented Nov 2, 2021 •

edited

Loading

drhuffman12 commented Nov 2, 2021

fmease commented Nov 3, 2021

straight-shoota commented Nov 3, 2021

Fryguy commented Nov 3, 2021

beta-ziliani commented Nov 3, 2021

beta-ziliani commented Nov 4, 2021

HertzDevil commented Jan 9, 2023 •

edited

Loading

straight-shoota commented Jan 9, 2023

straight-shoota commented Jan 10, 2023 •

edited

Loading

straight-shoota commented Jan 10, 2023

beta-ziliani commented Jan 10, 2023

HertzDevil commented Jan 10, 2023 •

edited

Loading

straight-shoota commented Jan 10, 2023 •

edited

Loading

HertzDevil commented Sep 5, 2023

naqvis commented Sep 6, 2023

Parser vulnerable to Trojan Source attack #11392

Parser vulnerable to Trojan Source attack #11392

Comments

straight-shoota commented Nov 1, 2021

Crystal demonstration

Mitigation

Workarounds

naqvis commented Nov 2, 2021

Mitigations

asterite commented Nov 2, 2021

straight-shoota commented Nov 2, 2021 • edited Loading

drhuffman12 commented Nov 2, 2021

fmease commented Nov 3, 2021

straight-shoota commented Nov 3, 2021

Fryguy commented Nov 3, 2021

beta-ziliani commented Nov 3, 2021

beta-ziliani commented Nov 4, 2021

HertzDevil commented Jan 9, 2023 • edited Loading

straight-shoota commented Jan 9, 2023

straight-shoota commented Jan 10, 2023 • edited Loading

straight-shoota commented Jan 10, 2023

beta-ziliani commented Jan 10, 2023

HertzDevil commented Jan 10, 2023 • edited Loading

straight-shoota commented Jan 10, 2023 • edited Loading

HertzDevil commented Sep 5, 2023

naqvis commented Sep 6, 2023

straight-shoota commented Nov 2, 2021 •

edited

Loading

HertzDevil commented Jan 9, 2023 •

edited

Loading

straight-shoota commented Jan 10, 2023 •

edited

Loading

HertzDevil commented Jan 10, 2023 •

edited

Loading

straight-shoota commented Jan 10, 2023 •

edited

Loading