-
Notifications
You must be signed in to change notification settings - Fork 12.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect binary files by checking first 256 code units for 0xFFFD #57008
Conversation
@typescript-bot test top200 @typescript-bot perf test this |
Heya @jakebailey, I've started to run the tarball bundle task on this PR at 4e80546. You can monitor the build here. |
Heya @jakebailey, I've started to run the regular perf test suite on this PR at 4e80546. You can monitor the build here. Update: The results are in! |
Heya @jakebailey, I've started to run the parallelized Definitely Typed test suite on this PR at 4e80546. You can monitor the build here. Update: The results are in! |
Heya @jakebailey, I've started to run the diff-based user code test suite on this PR at 4e80546. You can monitor the build here. Update: The results are in! |
Heya @jakebailey, I've started to run the public perf test suite on this PR at 4e80546. You can monitor the build here. Update: The results are in! |
Heya @jakebailey, I've started to run the diff-based top-repos suite on this PR at 4e80546. You can monitor the build here. Update: The results are in! |
Hey @jakebailey, I've packed this into an installable tgz. You can install it for testing by referencing it in your
and then running There is also a playground for this build and an npm module you can use via |
@jakebailey Here are the results of running the user test suite comparing There were infrastructure failures potentially unrelated to your change:
Otherwise... Something interesting changed - please have a look. Details
|
@jakebailey Here they are:
tscComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
Developer Information: |
@jakebailey Here they are:
tscComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
tsserverComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
startupComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
Developer Information: |
Hey @jakebailey, the results of running the DT tests are ready. |
@jakebailey Here are the results of running the top-repos suite comparing Something interesting changed - please have a look. Details
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now do cryptocurrency .ts files.
One thing I'm wondering is if we should just search the first 80 characters or something and say it's binary if any of them are I'm trying to find how other tools detect that things are binary (e.g. vim and ripgrep can both tell). |
That's what I (indirectly) suggested here: #56516 (comment) I'm pretty sure the "first 80 characters" method is the way it's typically done (e.g. in text editors, to determine if they should open a hex view). The Windows API even has a function for it IIRC. |
NFT types when |
That or I could just check for |
Yeah, check the first 80 chars and if any is a replacement char, treat as binary |
src/compiler/scanner.ts
Outdated
@@ -1799,7 +1799,21 @@ export function createScanner(languageVersion: ScriptTarget, skipTrivia: boolean | |||
// position 0 because UTF-8 decode will fail and produce U+FFFD. | |||
// If that happens, just issue one error and refuse to try to scan further; | |||
// this is likely a binary file that cannot be parsed | |||
if (ch === CharacterCodes.replacementCharacter) { | |||
let isBinary = ch === CharacterCodes.replacementCharacter; | |||
// See if this is an MPEG Transport Stream, where every 188th byte is "G" and the rest is garbage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is an MPEG transport stream guaranteed to be a multiple of 188 bytes long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, unless it's malformed. But, I actually think I should dump this PR and instead just check the first 80 or 256 characters for replacementCharacter instead. Curious what people think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably fine, but my concern is really more the case that at any arbitrary position, when scan
is called, you will look ahead 188 characters.
If you had the byte position and the raw content length, you could avoid doing any of this, but because we're operating on something parsed into UTF-16, I don't think you have that option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only happens at pos == 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed that, thanks!
Heya @jakebailey, I'm starting to run the public perf test suite on this PR at 4ee2758. Hold tight - I'll update this comment with the log link once the build has been queued. |
Bot is a liar @typescript-bot perf test this |
Heya @jakebailey, I've started to run the public perf test suite on this PR at 4ee2758. You can monitor the build here. Update: The results are in! |
Heya @jakebailey, I've started to run the regular perf test suite on this PR at 4ee2758. You can monitor the build here. Update: The results are in! |
@jakebailey Here they are:
tscComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
Developer Information: |
@jakebailey Here they are:
tscComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
tsserverComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
startupComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
Developer Information: |
@typescript-bot perf test public |
Heya @jakebailey, I've started to run the tarball bundle task on this PR at eb2dcc7. You can monitor the build here. |
Heya @jakebailey, I've started to run the public perf test suite on this PR at eb2dcc7. You can monitor the build here. Update: The results are in! |
Hey @jakebailey, I've packed this into an installable tgz. You can install it for testing by referencing it in your
and then running There is also a playground for this build and an npm module you can use via |
@jakebailey Here they are:
tscComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
Developer Information: |
Oops, wrong comment @typescript-bot perf test |
Heya @jakebailey, I've started to run the regular perf test suite on this PR at eb2dcc7. You can monitor the build here. Update: The results are in! |
@jakebailey Here they are:
tscComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
tsserverComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
startupComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
Developer Information: |
@jakebailey this seems to affect JavaScript files that include a replacement character. Example: https://github.com/micromark/micromark/blob/8b08d16891f4d6801eb0c0e990bd7b87836c0826/test/io/misc/zero.js#L9. Is this intentional? Or a bug? Should I create an issue? Cross-reference micromark/micromark#166 |
Ugh, not super thrilled that people are quite literally putting the replacement character into their files instead of escaping it; that's going to be indistinguishable from actually broken files. I'm not sure if it's "intentional" because I really did not think people would be doing that. So far the linked issues appear to be cases where it's possible to just use I guess it could deserve a dedicated issue rather than discussing on the PR. Not doing this means we have to come up with some other, new way to handle this :( |
Well, you can always use an escape for all unicode characters. Many algos in the HTML spec also result in REPLACEMENT CHARACTER. I don’t expect most people to write them in files but there are APIs that generate them, so there will be test files. In my experience with GH internals, they have some heuristic with control characters and lone surrogates to infer whether something is binary. Could that be viable? |
I don't want to make TS slower for every other file in the world, or forego doing other more impactful work, for the sake of a preference in the small number of scenarios where people need to write text encoding testcases. We struggle with this too -- people are always asking us to normalize |
Fixes #56516
Fixes #21136