Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser vulnerable to Trojan Source attack #12352

Open
joshuapassos opened this issue Nov 6, 2021 · 19 comments
Open

Parser vulnerable to Trojan Source attack #12352

joshuapassos opened this issue Nov 6, 2021 · 19 comments
Labels
Milestone

Comments

@joshuapassos
Copy link
Contributor

joshuapassos commented Nov 6, 2021

Recently paper called Trojan Source: Invisible Vulnerabilities demonstrates an attack against source code. It uses Unicode bi-direcional overrides to misguide the meaning of code to a human reader.

Repro steps

let access_level = "user"

[<EntryPoint>]
let main _ =
  if access_level <> "user‮⁦ (* Check if admin *)⁩⁦" then
    printf "You are an admin.\n"
  0

Only selecting text with mouse over condicional that is possible see some different thing.

Here I have an example to reproduce the problem

Expected behavior

Maybe compiler error which message Invalid unicode character

Actual behavior

You are an admin.

Known workarounds

I don't know

Related information
Crystal lang discussion about this: crystal-lang/crystal#11392
Site about the problem: https://trojansource.codes/

@smoothdeveloper
Copy link
Contributor

I'd like to have a CI step that rules out that code making it into FSC and derivatives, don't have any such unicode hidden thing.

@KevinRansom
Copy link
Member

@joshuapassos

It is certainly a novel attack, and one that open source repo's are going to be particularly sensitive to ... we will certainly have to figure out how to react to it.

My naiive first thoughts are we need to ensure that:

  1. Disallow bi-directional control characters in source code

  2. Enable an escaping syntax, that allows them to be applied in, the escaped bi-direction code will be rendered literally in the generated binary, or as escaped xml in the emitted doc comments

    • String literals
    • Comments (because of doc comments)
    • `` identifiers
  3. A CI step that simply scans the files looking for any of these bidirectional identifiers in the source code, and fails when it finds one.

  4. Obviously this is a breaking change so we will need a switch to turn the behavior off, which we can remove after a reasonable period of time.

/cc @dsyme , @vzarytovskii , @jonsequitur , @brettfo , @smoothdeveloper

@Happypig375
Copy link
Member

Oh the slippery slope of banning more and more Unicode characters from source code.

Figure 1. Using a zero-width space.

let access_level = "user"
if access_level <> "user​" then
    printf "You are an admin."
You are an admin.

Figure 2. The homoglyphic attack as outlined in the same paper, here we use U+0435 CYRILLIC SMALL LETTER IE

let access_level = "user"
if access_level <> "usеr" then
    printf "You are an admin."
You are an admin.

Figure 3. Another homoglyphic attack, here we have U+1D5BE Mathematical Sans-Serif Small E

let access_level = "user"
if access_level <> "us𝖾r" then
    printf "You are an admin."
You are an admin.

Let's go ban all these characters in source! I can't wait for me having to escape all the special characters when pasting online text. Oh, resource files and text configurations will still face the same attacks anyways. Let's ban these characters from all text readers and config readers too. These characters can burn in hell! /s

The proper way to prevent these is

  • Not using strings (Use DUs, where duplicate definitions will be apparent) and defining identifiers for the same strings
  • Testing and code coverage

Untested code invite bugs, and such so-called attacks can easily be prevented with proper architecture and testing.

@smoothdeveloper
Copy link
Contributor

Not using strings (Use DUs, where duplicate definitions will be apparent) and defining identifiers for the same strings

there are enough magic strings in the compiler, and probably no easy way to figure out if they could create issues such as showcase of that paper.

To me, it seems good enough as first measure, that we have sanity check in the CI to rule out that the current and future code exhibit those tricks; I think fixing the compiler itself is another issue, and things @KevinRansom has outlined above gives a good overview of an approach that can be refined once it is set in motion.

Step 3 should be a priority and the rest can probably wait to see how the industry is adjusting.

Aside, there is also this paper: https://github.com/QiushiWu/qiushiwu.github.io/raw/main/papers/OpenSourceInsecurity.pdf that caused similar stir.

I'm all in favor of adding static checks & all, but not so much about approach around "gauging contributions safety based on what owner knows about contributor" (which the paper kind of hints towards, as increasing security); contributions should be taken from whoever, without need for contributor to disclose anything, just on basis of merrit / adequation of the change with the evolution of the language.

Let's go ban all these characters in source!

I understand the concern, as a person also expressing myself beyond the realm of ASCII :)

@dsyme
Copy link
Contributor

dsyme commented Nov 8, 2021

I'd like to see GitHub roll out a solution across the entire platform.

@smoothdeveloper
Copy link
Contributor

If getting upstream support works just like that, I'll add, I'd like to see peace on earth :)

@vzarytovskii
Copy link
Member

I'd like to see GitHub roll out a solution across the entire platform.

Yeah, it makes sense for tooling to solve it in first place IMO, "fixing" it in compiler will mean a (potential) huge breaking change. I'm not saying we should do nothing though.

I guess first step should be a CI leg which will be checking a PR for control characters.

@dsyme
Copy link
Contributor

dsyme commented Nov 8, 2021

Oh I've no problem with solving it in the compiler and we should. Absolutely no one uses these characters in source code and if they do I don't care if we force them to use \UNNNN characters or give a special command-line flag.

But the problem is so systemic it needs to be solved in GitHub too, who should immediately scan all of GitHub for these characters and emit dependabot issue warnings

@Happypig375
Copy link
Member

@dsyme Well I'd hate to have certain characters be banned just because they can be abused. You close one avenue of attack, you leave lots more open, this is not fixing anything, unless you ban all the characters as in my comment too.

@Happypig375
Copy link
Member

@smoothdeveloper I am okay with individual code base owners banning certain characters, anyone can enforce any coding standard. However to ban characters on the compiler level is another thing - even tabs are allowed in string literals and comments.

@dsyme
Copy link
Contributor

dsyme commented Nov 8, 2021

@dsyme ...this is not fixing anything, unless you ban all the characters as in my comment too.

To be clear, such characters would be allowed in strings using \UNNNN format. That seems entirely reasonable to me.

I'd have no particular problem with banning them in comments, though probably aren't as significant there.

@KevinRansom
Copy link
Member

There is an official Microsoft response to this issue, I provide the text here:

Microsoft was made aware of this issue prior to the disclosure, and MSRC made the recommendation that this be addressed in future releases of our code editors. Per our bug bar (https://aka.ms/windowsbugbar) and internal policies, this did not warrant an immediate security update.

Teams are aware of this issue, and they are working on improvements to the code editing process while limiting impact to customers that may have legitimate uses for these characters based on the languages used on their systems.

@KevinRansom
Copy link
Member

MSRC is- The Microsoft Security Response Center - https://www.microsoft.com/en-us/msrc

@Happypig375
Copy link
Member

@dsyme So are we requiring Cyrillic characters or Greek characters be escaped or not?

@zanaptak
Copy link
Contributor

zanaptak commented Nov 14, 2021

I hope there is not a kneejerk reaction to ban or require escaping of certain Unicode characters. That strikes me as a narrow English language centric viewpoint.

Surely there is legitimate use in strings when writing applications that will be viewed by non-English users. And coders, while obliged to use English keywords, might prefer to add comments in their native script. Escaping is not a good alternative; it hurts readability, especially in the case of the bidirectional codes in question -- imagine having to write all your strings/comments backwards because you are unable to embed the necessary codes.

Perhaps a better and broader approach is to have proper Unicode handling overall, rather than the incomplete support currently in place (for example reliance on legacy System.Char methods that don't understand non-BMP characters - #9600), and having intelligent syntax rules based on real understanding of Unicode semantics.

@Happypig375
Copy link
Member

If we don't care about the homoglyphic attack and simply switching code across string termination, we can just ban unterminated bidirectionality across string termination as mentioned in the paper.

@Serentty
Copy link

@dsyme ...this is not fixing anything, unless you ban all the characters as in my comment too.

To be clear, such characters would be allowed in strings using \UNNNN format. That seems entirely reasonable to me.

I'd have no particular problem with banning them in comments, though probably aren't as significant there.

Directional overrides are probably more useful in comments than in identifiers, because comments are exactly the kind of place where you’re likely to be mixing scripts, compared to an identifier, which will probably be all in one script for ease of typing. But I do think that the whole “Trojan source” thing is a bit of an imagined security concern.

@dsyme dsyme added Feature Request and removed Bug labels Mar 3, 2022
@dsyme dsyme added the Area-Compiler-Syntax lexfilter, indentation and parsing label Apr 20, 2022
@vzarytovskii vzarytovskii moved this to Not Planned in F# Compiler and Tooling Jun 17, 2022
@vzarytovskii vzarytovskii added this to the Backlog milestone Oct 19, 2022
@T-Gro
Copy link
Member

T-Gro commented Nov 7, 2022

This is mainly a tooling issue.
Especially for open source projects (like F#), seeing this in GitHub review windows is the missing piece.

Example how VS Code shows this:

@T-Gro
Copy link
Member

T-Gro commented Nov 7, 2022

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: New
Development

No branches or pull requests

9 participants