Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitize multiple spaces in display names to protect against some security concerns #5703

Merged
merged 1 commit into from
Oct 11, 2024

Conversation

pfrazee
Copy link
Collaborator

@pfrazee pfrazee commented Oct 11, 2024

Adds a display name sanitation rule which reduces runs of spaces into a single space.

Before After
CleanShot 2024-10-10 at 20 05 44@2x CleanShot 2024-10-10 at 20 05 29@2x

With apologies to rem.

@arcalinea arcalinea temporarily deployed to fix-display-name-sanitization - social-app PR #5703 October 11, 2024 03:09 — with Render Destroyed
Copy link
Contributor

@haileyok haileyok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works on both native and web

Copy link

Old size New size Diff
7.89 MB 7.89 MB 32 B (0.00%)

@pfrazee pfrazee merged commit 7677bb3 into main Oct 11, 2024
6 checks passed
@pfrazee pfrazee deleted the fix-display-name-sanitization branch October 11, 2024 05:14
@DavidBuchanan314
Copy link
Contributor

It's possible to bypass this regex by interleaving zero-width codepoints (e.g. U+200B) with whitespace, for example: " ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​"

const CHECK_MARKS_RE = /[\u2705\u2713\u2714\u2611]/gu
const CONTROL_CHARS_RE =
  /[\u0000-\u001F\u007F-\u009F\u061C\u200E\u200F\u202A-\u202E\u2066-\u2069]/g
const MULTIPLE_SPACES_RE = /[\s][\s]+/g
console.log("hello ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ world".replace(CHECK_MARKS_RE, '').replace(CONTROL_CHARS_RE, '').replace(MULTIPLE_SPACES_RE, ' ').trim())

@pfrazee
Copy link
Collaborator Author

pfrazee commented Oct 11, 2024

@DavidBuchanan314 appreciate it #5729

@JamesKoenig
Copy link

What was the security concern?

@pfrazee
Copy link
Collaborator Author

pfrazee commented Oct 12, 2024

Using whitespace to push the domain handle off screen so you can impersonate people

@JamesKoenig
Copy link

Since this is a change to the social-app frontend, usernames with tons of spaces will still be stored on the backend, no?
Meaning other clients using the service (or using the at protocol at large) would still have to face the same sorts of concerns?

if that's the case the api docs for app-bsky-actor-get-profile already state that the username could be 640 characters long, which should be enough of a warning for username length to be truncated appropriately, but it might be a good idea to have a style guide (if I missed one), that tells third party client developers to always have the domain handle visible (given that impersonation is a concern).

@oestradiol
Copy link

oestradiol commented Oct 13, 2024

@pfrazee still managed to bypass it, btw: test-acc.bsky.social

I think trying to validate like this is not the best idea since there are so many damn characters that can be used to do this (here's a regex I made to validate spaces, although not every character might fit this scenario since our use case was slightly different, albeit similar: ensureMultilineValid.ts#L2)

Display names are so problematic... my suggestion is follow what we're doing there and truncate the display name rather than the handle: psky-atp/client@dd48b6a

For reference, the character I used in the bypass is Hangul Filler, but I bet there are other characters such as Mongolian Vowel Separator which might also work. Some of these characters probably can't even be just replaced out without breaking some stuff, considering that they have legitimate use cases in some languages.

@pfrazee
Copy link
Collaborator Author

pfrazee commented Oct 14, 2024

Thanks so much @oestradiol. The only reason I haven't gone with that approach is because that kind of layout control is wildly finicky across all the react-native platforms, but it sounds like we just have to accept reality and attack it from that direction.

Really appreciate the pointers. You win again, unicode.

@oestradiol
Copy link

You're welcome, Paul. Good luck everyone, you got this! <3

@JamesKoenig
Copy link

JamesKoenig commented Oct 14, 2024

@pfrazee

Really appreciate the pointers. You win again, unicode.

Please bear with me to the end.

The design of Javascript purposefully excludes whitespace and control characters outside of a certain scope, as outlined in ECMA-262 12.2 "White space", Note 2:

NOTE 2 Other than for the code points listed in Table 36, ECMAScript WhiteSpace intentionally excludes all code points that have the Unicode “White_Space” property but which are not classified in general category “Space_Separator” (“Zs”)

NB also the following paragraph from 12.3 "Line Terminators":

Line terminators are included in the set of white space code points that are matched by the \s class in regular expressions

ECMA-262 Annex F notes:

12.2: In ECMAScript 2016, Unicode 8.0.0 or higher is mandated, as opposed to ECMAScript 2015 which mandated Unicode 5.1. In particular, this caused U+180E MONGOLIAN VOWEL SEPARATOR, which was in the Space_Separator (Zs) category and thus treated as whitespace in ECMAScript 2015, to be moved to the Format (Cf) category (as of Unicode 6.3.0). This causes whitespace-sensitive methods to behave differently. For example, "\u180E".trim().length was 0 in previous editions, but 1 in ECMAScript 2016 and later. Additionally, ECMAScript 2017 mandated always using the latest version of the Unicode Standard

The conclusion then would be that there are intentional limitations of what \s is intended to capture, which a pathological user uploading arbitrary unicode strings may take advantage of, which makes attempt to stop them futile.

However

Javascripts regexp engine, in unicode aware mode (eg. /.+/u) has a different tool we can use to address this: the character class escape \p{}, (ECMA-262 22.2.1 Patterns, in the bnf under CharacterClassEscape). For a list of categories one can consult https://www.unicode.org/reports/tr44/#GC_Values_Table (though the paragraph above it is salient and a good reference). A more interactive version that more easily presents which characters can be seen in those categories is https://www.compart.com/en/unicode/category .

Given that the code already attempts to strip out control characters with a rather daunting capture class range:

/[\u0000-\u001F\u007F-\u009F\u061C\u200E\u200F\u202A-\u202E\u2066-\u2069]/g

And that all of the above problematic characters are in the unicode control class category, If no control characters are desired, lines 8-9 could be replaced with:

const CONTROL_CHARS_RE = /\p{C}/gu

if some control chars are desired I would highly recommend whitelisting those in:

const CONTROL_CHARS_RE = /\p{C}/gu  //node automatically orders these gu, which is cute, so I pronounce it guh
const CONTROL_CHARS_WHITELIST_RE = /[\u0600-\u08E2]/ //e.g. Arabic format characters, g is not needed for now, but can be used
const MULTIPLE_SPACES_RE = /\s{2,}/g //two or more captures of whitespace
///[...]
    return str
      .replace(CHECK_MARKS_RE, '')
      .replace(CONTROL_CHARS_RE, (reMatchStr) =>
              reMatchStr.match(CONTROL_CHARS_WHITELIST_RE) ? reMatchStr : '' )
      .replace(MULTIPLE_SPACES_RE, ' ')
      .trim()

If \s, being defined by ecma language needs, and the unicode separator categories differ, /\s{2,}/g above could be replaced with /\p{Z}{2,}/gu to showcase explicit intention, though it is harder to immediately understand intent and should have a comment.

The above replace function argument can be defined elsewhere to improve readability and formatting. The basic structure is that every control char match is checked to see if it's permitted, and if it is, passed through. Because it breaks down to a ternary, the truthy String.prototype.match call could be replaced with a Set.prototype.has call, where a set of what control characters are permissible could be outlined as follows:

const controlCharsWhitelist = new Set([
 "\uXXXX", //a reason why this is allowed
 // ...
]);

As all the zero width and nonprinting spaces listed in the ticket's feedback are all found in the control character category, the rough solution here provided would likely fit your needs.

Any questions are welcome, or feedback on the presentational or design aspects outlined here.

I personally disagree with the approach of truncation in this PR and in the above code, and think you should use the layout control option, however.

@oestradiol
Copy link

oestradiol commented Oct 15, 2024

@JamesKoenig wow, thanks for the valuable input! I didn't know about that, and it's also gonna help me in my application.
I managed to clean up my regex a lot by replacing most of it with simply \p{C}\p{Z}\p{M}. That being said, I still think this is not enough for Bluesky, because for some reason some characters such as U+3164, U+115F and U+1160 are classified as... letters..? While doing my research, I found this source, if that helps at all.

Here's the final regex I'll be using, if anyone ever comes across this:
/[\p{C}\p{Z}\p{M}\u{115F}\u{1160}\u{2800}\u{3164}\u{FFA0}\u{FFFC}\u{1D159}]/gu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants