Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

string_decoder: make write after end to reset #16594

Closed
wants to merge 4 commits into from

Conversation

thefourtheye
Copy link
Contributor

Fixes: #16564

When StringDecoder's end is called, it is no longer supposed to wait
for the data. If a write call is made after end, then the decoder
has to be flushed and treated as a brand new write request.

This patch also introduces a new StringDecoder#reset method, which
simply resets all the internal data.

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • documentation is changed or added
  • commit message follows commit guidelines
Affected core subsystem(s)

Fixes: nodejs#16564

When StringDecoder's `end` is called, it is no longer supposed to wait
for the data. If a `write` call is made after `end`, then the decoder
has to be flushed and treated as a brand new write request.

This patch also introduces a new StringDecoder#reset method, which
simply resets all the internal data.
@nodejs-github-bot nodejs-github-bot added the string_decoder Issues and PRs related to the string_decoder subsystem. label Oct 29, 2017
@@ -82,3 +82,13 @@ Returns a decoded string, ensuring that any incomplete multibyte characters at
the end of the `Buffer` are omitted from the returned string and stored in an
internal buffer for the next call to `stringDecoder.write()` or
`stringDecoder.end()`.

### stringDecoder.reset([encoding])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit: it seems this should go before the stringDecoder.write(), ABC-wise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

StringDecoder.prototype.write = function(buf) {
if (this._closed === true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

=== true is unnecessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we explicitly compare strictly against the value, the performance would be better right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not with TurboFan, no. I asked about this in another PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, I think this is the previous discussion in question: #16397 (comment)

@jasnell jasnell added the semver-minor PRs that contain new features and should be released in the next minor version. label Oct 30, 2017
@mscdex
Copy link
Contributor

mscdex commented Oct 30, 2017

What if we do something simpler, like overwrite at least .write() (and possibly .end()) in .end(). Perhaps that would have less overhead? .write() is going to be called more times than .end() in general.

@thefourtheye
Copy link
Contributor Author

@mscdex In that case, users would have to create a new StringDecoder object every time end is called, right? They cannot reuse it. Also, that would break a lot of buggy code which is already out there I think.

@mscdex
Copy link
Contributor

mscdex commented Oct 30, 2017

@thefourtheye The basic concern still stands though, .write() is hotter than .end(), so if we can come up with some way to avoid adding stuff to .write() that would be ideal (unless benchmarks show that the current solution does not cause a performance regression).

@@ -66,6 +66,16 @@ substitution characters appropriate for the character encoding.
If the `buffer` argument is provided, one final call to `stringDecoder.write()`
is performed before returning the remaining input.

### stringDecoder.reset([encoding])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is an implementation detail users don't have to know about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is, there is no way for the users to reset the string decoder, right? They have to create a new instance if needed. This will enable reusability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new way to reset the state each time was not requested as far as I read the issue. So I also prefer not to expose the reset function. If I am correct the following should work.

const { StringDecoder } = require('string_decoder');
const decoder = new StringDecoder('utf8');

decoder.write(Buffer.from([0xE2, 0x82])); // => ''
decoder.end(); // => '�'
decoder.write(Buffer.of(0x61)); // => 'a'

StringDecoder.prototype.write = function(buf) {
if (this._closed === true)
this.reset();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this logic be in StringDecoder.prototype.end instead? Make the per-encoding this.end switching in the constructor actually set to an internal symbol-named property, and use a shared StringDecoder.prototype.end that calls that internal encoding-specific function and clean up after itself. Would probably help with write performance too if StringDecoders of different encodings (e.g. UTF-8 which does not have an own property end and UTF-16LE that does) are used simultaneously.

Copy link
Contributor

@jridgewell jridgewell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems all very complicated. Why isn't StringDecoder#end just doing:

this.lastNeed = 0;
this.lastTotal = 0;

The buffer is already "unsafe" and access to it is guarded, so we don't need to sanitize it. And reusing a constructor as a call seems fraught.

@mscdex
Copy link
Contributor

mscdex commented Oct 31, 2017

This seems all very complicated. Why isn't StringDecoder#end just doing:

One reason is backwards compatibility.

@BridgeAR
Copy link
Member

Ping @thefourtheye

@thefourtheye
Copy link
Contributor Author

I have updated the PR to make sure that write doesn't have much any changes. PTAL.

function end(buf) {
let result = '';

if (this.encoding === 'utf16le')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might consider using a switch here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

let result = '';

if (this.encoding === 'utf16le')
result = utf16End.call(this, buf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're the only ones calling these methods directly, then we should be able to avoid the overhead of .call() and just pass an instance as another argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

else
result = simpleEnd.call(this, buf);

this.reset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to use this.reset(this.encoding) to avoid the switch in the constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still cannot get rid of the switch in the constructor, right?

Copy link
Contributor

@mscdex mscdex Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, but this.reset(this.encoding) would only hit that first if in the constructor, which might be better.

]);

function end(buf) {
const result = (endMappings.get(this.encoding) || simpleEnd)(this, buf);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks "elegant" but using a switch case for the encoding and calling the function directly is probably faster.

@@ -66,6 +66,16 @@ substitution characters appropriate for the character encoding.
If the `buffer` argument is provided, one final call to `stringDecoder.write()`
is performed before returning the remaining input.

### stringDecoder.reset([encoding])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new way to reset the state each time was not requested as far as I read the issue. So I also prefer not to expose the reset function. If I am correct the following should work.

const { StringDecoder } = require('string_decoder');
const decoder = new StringDecoder('utf8');

decoder.write(Buffer.from([0xE2, 0x82])); // => ''
decoder.end(); // => '�'
decoder.write(Buffer.of(0x61)); // => 'a'

@targos
Copy link
Member

targos commented Jan 15, 2018

ping @thefourtheye

@BridgeAR
Copy link
Member

Ping again @thefourtheye

jridgewell added a commit to jridgewell/node that referenced this pull request Feb 1, 2018
This resets the StringDecoder's state after calling `#end`. Further
writes to the decoder will act as if it were a brand new instance,
allowing simple reuse.

Refs: nodejs#16594
Fixes: nodejs#16564
@thefourtheye thefourtheye deleted the string-decoder-fix branch February 1, 2018 09:05
addaleax pushed a commit that referenced this pull request Feb 2, 2018
This resets the StringDecoder's state after calling `#end`. Further
writes to the decoder will act as if it were a brand new instance,
allowing simple reuse.

PR-URL: #18494
Fixes: #16564
Refs: #16594
Reviewed-By: Tiancheng "Timothy" Gu <[email protected]>
Reviewed-By: Sakthipriyan Vairamani <[email protected]>
Reviewed-By: Ruben Bridgewater <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
MylesBorins pushed a commit that referenced this pull request Feb 20, 2018
This resets the StringDecoder's state after calling `#end`. Further
writes to the decoder will act as if it were a brand new instance,
allowing simple reuse.

PR-URL: #18494
Fixes: #16564
Refs: #16594
Reviewed-By: Tiancheng "Timothy" Gu <[email protected]>
Reviewed-By: Sakthipriyan Vairamani <[email protected]>
Reviewed-By: Ruben Bridgewater <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
MylesBorins pushed a commit that referenced this pull request Feb 21, 2018
This resets the StringDecoder's state after calling `#end`. Further
writes to the decoder will act as if it were a brand new instance,
allowing simple reuse.

PR-URL: #18494
Fixes: #16564
Refs: #16594
Reviewed-By: Tiancheng "Timothy" Gu <[email protected]>
Reviewed-By: Sakthipriyan Vairamani <[email protected]>
Reviewed-By: Ruben Bridgewater <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
MylesBorins pushed a commit that referenced this pull request Feb 21, 2018
This resets the StringDecoder's state after calling `#end`. Further
writes to the decoder will act as if it were a brand new instance,
allowing simple reuse.

PR-URL: #18494
Fixes: #16564
Refs: #16594
Reviewed-By: Tiancheng "Timothy" Gu <[email protected]>
Reviewed-By: Sakthipriyan Vairamani <[email protected]>
Reviewed-By: Ruben Bridgewater <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
MylesBorins pushed a commit that referenced this pull request Mar 20, 2018
This resets the StringDecoder's state after calling `#end`. Further
writes to the decoder will act as if it were a brand new instance,
allowing simple reuse.

PR-URL: #18494
Fixes: #16564
Refs: #16594
Reviewed-By: Tiancheng "Timothy" Gu <[email protected]>
Reviewed-By: Sakthipriyan Vairamani <[email protected]>
Reviewed-By: Ruben Bridgewater <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
MylesBorins pushed a commit that referenced this pull request Mar 28, 2018
This resets the StringDecoder's state after calling `#end`. Further
writes to the decoder will act as if it were a brand new instance,
allowing simple reuse.

PR-URL: #18494
Fixes: #16564
Refs: #16594
Reviewed-By: Tiancheng "Timothy" Gu <[email protected]>
Reviewed-By: Sakthipriyan Vairamani <[email protected]>
Reviewed-By: Ruben Bridgewater <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
MylesBorins pushed a commit that referenced this pull request Mar 30, 2018
This resets the StringDecoder's state after calling `#end`. Further
writes to the decoder will act as if it were a brand new instance,
allowing simple reuse.

PR-URL: #18494
Fixes: #16564
Refs: #16594
Reviewed-By: Tiancheng "Timothy" Gu <[email protected]>
Reviewed-By: Sakthipriyan Vairamani <[email protected]>
Reviewed-By: Ruben Bridgewater <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
MayaLekova pushed a commit to MayaLekova/node that referenced this pull request May 8, 2018
This resets the StringDecoder's state after calling `#end`. Further
writes to the decoder will act as if it were a brand new instance,
allowing simple reuse.

PR-URL: nodejs#18494
Fixes: nodejs#16564
Refs: nodejs#16594
Reviewed-By: Tiancheng "Timothy" Gu <[email protected]>
Reviewed-By: Sakthipriyan Vairamani <[email protected]>
Reviewed-By: Ruben Bridgewater <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-minor PRs that contain new features and should be released in the next minor version. string_decoder Issues and PRs related to the string_decoder subsystem.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

StringDecoder#end doesn't flush state
10 participants