Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expanded whitespace test case to catch more inefficiencies and updated code to optimize. #5

Merged
merged 7 commits into from
Jul 3, 2022
9 changes: 8 additions & 1 deletion index.js
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,14 @@ function trimLine(value, start, end) {
}

if (!end) {
value = value.replace(/[ \t]+$/, '')
let endIndex = value.length - 1
let char = value.charAt(endIndex)
while (char === ' ' || char === '\t') {
endIndex--
char = value.charAt(endIndex)
}

value = value.slice(0, endIndex + 1)
}

return value
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, it seems like a bad idea to use regexes for this. At least, it’s hard to read. Can you check if something like this works?

let start = 0
let end = value.length - 1

while (start < end) {
    let code = value.charAt(start);
    if (code === 9 || code === 32) {
        start++
    } else {
        break;
    }
}

while (end > start) {
  let code = value.charAt(start);
  if (code === 9 || code === 32) {
    end--;
  } else {
    break;
  }
}

return value.slice(start, end);

(formatted to match project)
(perhaps good to check for some lines that just include whitespace, and line that include nothing at all)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the regex approach where readability is concerned. It means people don't have to look up the char codes, and I feel it clarifies the intent over the above code.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your opinion, but I disagree. Can you change it?
Regexes are generally slow. Character codes, especially with parsing projects such as all of unified, are common and searchable.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose the follow as the function. Please still consider it pseudocode, however, I did test it and it seems to work

/**
 * @param {string} value
 * @param {boolean} start
 * @param {boolean} end
 * @returns {string}
 */
function trimLine(value, start, end) {
    let startIndex = 0
    let endIndex = value.length

    if (!start) {
        let code = value.charCodeAt(startIndex)

        while (code === 9 || code === 32) {
            startIndex++
            code = value.charCodeAt(startIndex)
        }
    }

    if (!end) {
        let code = value.charCodeAt(endIndex - 1)

        while (code === 9 || code === 32) {
            endIndex--
            code = value.charCodeAt(endIndex - 1)
        }
    }

    return endIndex > startIndex ? value.slice(startIndex, endIndex) : ''
}

This:

  • Implements faster trimming for !start as well, meaning that regexes are no longer needed
  • Uses charCodeAt which is faster that charAt, and no longer needs small strings. I understand that these codes might be new to you, and hence you do not prefer them, but I consider them common enough in parsing, in the 100s of projects I am maintaining, that I strongly prefer them.
  • Changes value only once, without reassigning it, or even not at all for empty lines. Reassigning a parameter is slow, because JavaScript “links” arguments[0] and value together. Not slicing at all for empty lines is likely also fast in edge cases of large blank lines.

All reasons why this should be faster.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using constants for the character codes to make it more readable:

const FOO = 9

//...


while (code === FOO || code === BAR) {

And I agree with not using regex, both for readability, performance, and also to avoid the risk of ReDoS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using constants for the character codes to make it more readable:

const FOO = 9

//...


while (code === FOO || code === BAR) {

And I agree with not using regex, both for readability, performance, and also to avoid the risk of ReDoS.

Just saw this, I think it resolves my concern about readability without compromising performance. I'll make the constants now.

Expand Down
29 changes: 29 additions & 0 deletions test.js
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,35 @@ test('efficiency', (t) => {
}, 0)
})

t.test('flanking whitespace', (t) => {
const timeoutId = setTimeout(() => {
t.fail('did not pass in 10ms')
}, 10)

t.deepEqual(trimLines(whitespace + '\na\n' + whitespace), '\na\n')

setTimeout(() => {
clearTimeout(timeoutId)
t.end()
}, 0)
})

t.test('internalized whitespace ', (t) => {
42shadow42 marked this conversation as resolved.
Show resolved Hide resolved
const timeoutId = setTimeout(() => {
t.fail('did not pass in 30ms')
}, 30)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this 30 instead of 10?
Should all be 30, and mean “fast enough”?

Copy link
Contributor Author

@42shadow42 42shadow42 Jul 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this test is slower because it's the edge case that originally caused problems. It might pass in 10ms with the fully non-regex version though I'd have to try it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually on second test, it appears it does pass in 10 ms, I must have accidentally tested 1ms or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a third revision it appears they all need 20-30ms, they are intermittently failing with 10. Good call.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we’re not testing xms. We’re testing: fast enough. Compared to alternative ways which result in like a second or more.
Timeouts mainly, but any timing tests in general are really hard to get right.
Older Node versions or Windows for example will make everything slightly slower. Because these numbers are really small, they will leed to randomly breaking CIs.


t.deepEqual(
trimLines('\na' + whitespace + 'b\n'),
'\na' + whitespace + 'b\n'
)

setTimeout(() => {
clearTimeout(timeoutId)
t.end()
}, 0)
})

t.test('whitespace around line', (t) => {
const timeoutId = setTimeout(() => {
t.fail('did not pass in 10ms')
Expand Down