Parser vulnerable to Trojan Source attack #12352

joshuapassos · 2021-11-06T21:38:48Z

Recently paper called Trojan Source: Invisible Vulnerabilities demonstrates an attack against source code. It uses Unicode bi-direcional overrides to misguide the meaning of code to a human reader.

Repro steps

let access_level = "user"

[<EntryPoint>]
let main _ =
  if access_level <> "user‮⁦ (* Check if admin *)⁩⁦" then
    printf "You are an admin.\n"
  0

Only selecting text with mouse over condicional that is possible see some different thing.

Here I have an example to reproduce the problem

Expected behavior

Maybe compiler error which message Invalid unicode character

Actual behavior

You are an admin.

Known workarounds

I don't know

Related information
Crystal lang discussion about this: crystal-lang/crystal#11392
Site about the problem: https://trojansource.codes/

The text was updated successfully, but these errors were encountered:

smoothdeveloper · 2021-11-06T22:25:58Z

I'd like to have a CI step that rules out that code making it into FSC and derivatives, don't have any such unicode hidden thing.

KevinRansom · 2021-11-06T23:54:54Z

@joshuapassos

It is certainly a novel attack, and one that open source repo's are going to be particularly sensitive to ... we will certainly have to figure out how to react to it.

My naiive first thoughts are we need to ensure that:

Disallow bi-directional control characters in source code
Enable an escaping syntax, that allows them to be applied in, the escaped bi-direction code will be rendered literally in the generated binary, or as escaped xml in the emitted doc comments
- String literals
- Comments (because of doc comments)
- `` identifiers
A CI step that simply scans the files looking for any of these bidirectional identifiers in the source code, and fails when it finds one.
Obviously this is a breaking change so we will need a switch to turn the behavior off, which we can remove after a reasonable period of time.

/cc @dsyme , @vzarytovskii , @jonsequitur , @brettfo , @smoothdeveloper

Happypig375 · 2021-11-07T07:28:19Z

Oh the slippery slope of banning more and more Unicode characters from source code.

Figure 1. Using a zero-width space.

let access_level = "user"
if access_level <> "user" then
    printf "You are an admin."

You are an admin.

Figure 2. The homoglyphic attack as outlined in the same paper, here we use U+0435 CYRILLIC SMALL LETTER IE

let access_level = "user"
if access_level <> "usеr" then
    printf "You are an admin."

You are an admin.

Figure 3. Another homoglyphic attack, here we have U+1D5BE Mathematical Sans-Serif Small E

let access_level = "user"
if access_level <> "us𝖾r" then
    printf "You are an admin."

You are an admin.

Let's go ban all these characters in source! I can't wait for me having to escape all the special characters when pasting online text. Oh, resource files and text configurations will still face the same attacks anyways. Let's ban these characters from all text readers and config readers too. These characters can burn in hell! /s

The proper way to prevent these is

Not using strings (Use DUs, where duplicate definitions will be apparent) and defining identifiers for the same strings
Testing and code coverage

Untested code invite bugs, and such so-called attacks can easily be prevented with proper architecture and testing.

smoothdeveloper · 2021-11-07T09:04:08Z

Not using strings (Use DUs, where duplicate definitions will be apparent) and defining identifiers for the same strings

there are enough magic strings in the compiler, and probably no easy way to figure out if they could create issues such as showcase of that paper.

To me, it seems good enough as first measure, that we have sanity check in the CI to rule out that the current and future code exhibit those tricks; I think fixing the compiler itself is another issue, and things @KevinRansom has outlined above gives a good overview of an approach that can be refined once it is set in motion.

Step 3 should be a priority and the rest can probably wait to see how the industry is adjusting.

Aside, there is also this paper: https://github.com/QiushiWu/qiushiwu.github.io/raw/main/papers/OpenSourceInsecurity.pdf that caused similar stir.

I'm all in favor of adding static checks & all, but not so much about approach around "gauging contributions safety based on what owner knows about contributor" (which the paper kind of hints towards, as increasing security); contributions should be taken from whoever, without need for contributor to disclose anything, just on basis of merrit / adequation of the change with the evolution of the language.

Let's go ban all these characters in source!

I understand the concern, as a person also expressing myself beyond the realm of ASCII :)

dsyme · 2021-11-08T12:58:05Z

I'd like to see GitHub roll out a solution across the entire platform.

smoothdeveloper · 2021-11-08T14:28:49Z

If getting upstream support works just like that, I'll add, I'd like to see peace on earth :)

vzarytovskii · 2021-11-08T14:47:53Z

I'd like to see GitHub roll out a solution across the entire platform.

Yeah, it makes sense for tooling to solve it in first place IMO, "fixing" it in compiler will mean a (potential) huge breaking change. I'm not saying we should do nothing though.

I guess first step should be a CI leg which will be checking a PR for control characters.

dsyme · 2021-11-08T14:55:49Z

Oh I've no problem with solving it in the compiler and we should. Absolutely no one uses these characters in source code and if they do I don't care if we force them to use \UNNNN characters or give a special command-line flag.

But the problem is so systemic it needs to be solved in GitHub too, who should immediately scan all of GitHub for these characters and emit dependabot issue warnings

Happypig375 · 2021-11-08T15:01:43Z

@dsyme Well I'd hate to have certain characters be banned just because they can be abused. You close one avenue of attack, you leave lots more open, this is not fixing anything, unless you ban all the characters as in my comment too.

Happypig375 · 2021-11-08T15:06:01Z

@smoothdeveloper I am okay with individual code base owners banning certain characters, anyone can enforce any coding standard. However to ban characters on the compiler level is another thing - even tabs are allowed in string literals and comments.

dsyme · 2021-11-08T16:38:27Z

@dsyme ...this is not fixing anything, unless you ban all the characters as in my comment too.

To be clear, such characters would be allowed in strings using \UNNNN format. That seems entirely reasonable to me.

I'd have no particular problem with banning them in comments, though probably aren't as significant there.

KevinRansom · 2021-11-08T17:51:25Z

There is an official Microsoft response to this issue, I provide the text here:

Microsoft was made aware of this issue prior to the disclosure, and MSRC made the recommendation that this be addressed in future releases of our code editors. Per our bug bar (https://aka.ms/windowsbugbar) and internal policies, this did not warrant an immediate security update.

Teams are aware of this issue, and they are working on improvements to the code editing process while limiting impact to customers that may have legitimate uses for these characters based on the languages used on their systems.

KevinRansom · 2021-11-08T17:59:32Z

MSRC is- The Microsoft Security Response Center - https://www.microsoft.com/en-us/msrc

Happypig375 · 2021-11-14T13:42:32Z

@dsyme So are we requiring Cyrillic characters or Greek characters be escaped or not?

zanaptak · 2021-11-14T16:14:21Z

I hope there is not a kneejerk reaction to ban or require escaping of certain Unicode characters. That strikes me as a narrow English language centric viewpoint.

Surely there is legitimate use in strings when writing applications that will be viewed by non-English users. And coders, while obliged to use English keywords, might prefer to add comments in their native script. Escaping is not a good alternative; it hurts readability, especially in the case of the bidirectional codes in question -- imagine having to write all your strings/comments backwards because you are unable to embed the necessary codes.

Perhaps a better and broader approach is to have proper Unicode handling overall, rather than the incomplete support currently in place (for example reliance on legacy System.Char methods that don't understand non-BMP characters - #9600), and having intelligent syntax rules based on real understanding of Unicode semantics.

Happypig375 · 2021-11-15T01:51:12Z

If we don't care about the homoglyphic attack and simply switching code across string termination, we can just ban unterminated bidirectionality across string termination as mentioned in the paper.

Serentty · 2021-11-25T19:44:44Z

@dsyme ...this is not fixing anything, unless you ban all the characters as in my comment too.

To be clear, such characters would be allowed in strings using \UNNNN format. That seems entirely reasonable to me.

I'd have no particular problem with banning them in comments, though probably aren't as significant there.

Directional overrides are probably more useful in comments than in identifiers, because comments are exactly the kind of place where you’re likely to be mixing scripts, compared to an identifier, which will probably be all in one script for ease of typing. But I do think that the whole “Trojan source” thing is a bit of an imagined security concern.

T-Gro · 2022-11-07T18:02:13Z

This is mainly a tooling issue.
Especially for open source projects (like F#), seeing this in GitHub review windows is the missing piece.

Example how VS Code shows this:

T-Gro · 2022-11-07T18:03:08Z

joshuapassos added the Bug label Nov 6, 2021

dsyme added Feature Request and removed Bug labels Mar 3, 2022

dsyme added the Area-Compiler-Syntax lexfilter, indentation and parsing label Apr 20, 2022

vzarytovskii added this to F# Compiler and Tooling Jun 17, 2022

vzarytovskii moved this to Not Planned in F# Compiler and Tooling Jun 17, 2022

vzarytovskii added this to the Backlog milestone Oct 19, 2022

vzarytovskii added the Needs-Triage label Oct 19, 2022

T-Gro added Tracking-External and removed Needs-Triage labels Nov 7, 2022

OwnageIsMagic mentioned this issue Oct 1, 2023

Do not use Unicode aware API and CurrentCulture in compiler #16066

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser vulnerable to Trojan Source attack #12352

Parser vulnerable to Trojan Source attack #12352

joshuapassos commented Nov 6, 2021 •

edited

Loading

smoothdeveloper commented Nov 6, 2021

KevinRansom commented Nov 6, 2021

Happypig375 commented Nov 7, 2021

smoothdeveloper commented Nov 7, 2021

dsyme commented Nov 8, 2021

smoothdeveloper commented Nov 8, 2021

vzarytovskii commented Nov 8, 2021

dsyme commented Nov 8, 2021

Happypig375 commented Nov 8, 2021

Happypig375 commented Nov 8, 2021

dsyme commented Nov 8, 2021

KevinRansom commented Nov 8, 2021

KevinRansom commented Nov 8, 2021

Happypig375 commented Nov 14, 2021

zanaptak commented Nov 14, 2021 •

edited

Loading

Happypig375 commented Nov 15, 2021

Serentty commented Nov 25, 2021

T-Gro commented Nov 7, 2022

T-Gro commented Nov 7, 2022

Parser vulnerable to Trojan Source attack #12352

Parser vulnerable to Trojan Source attack #12352

Comments

joshuapassos commented Nov 6, 2021 • edited Loading

smoothdeveloper commented Nov 6, 2021

KevinRansom commented Nov 6, 2021

Happypig375 commented Nov 7, 2021

smoothdeveloper commented Nov 7, 2021

dsyme commented Nov 8, 2021

smoothdeveloper commented Nov 8, 2021

vzarytovskii commented Nov 8, 2021

dsyme commented Nov 8, 2021

Happypig375 commented Nov 8, 2021

Happypig375 commented Nov 8, 2021

dsyme commented Nov 8, 2021

KevinRansom commented Nov 8, 2021

KevinRansom commented Nov 8, 2021

Happypig375 commented Nov 14, 2021

zanaptak commented Nov 14, 2021 • edited Loading

Happypig375 commented Nov 15, 2021

Serentty commented Nov 25, 2021

T-Gro commented Nov 7, 2022

T-Gro commented Nov 7, 2022

joshuapassos commented Nov 6, 2021 •

edited

Loading

zanaptak commented Nov 14, 2021 •

edited

Loading