Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expanded whitespace test case to catch more inefficiencies and updated code to optimize. #5

Merged
merged 7 commits into from
Jul 3, 2022

Conversation

42shadow42
Copy link
Contributor

No description provided.

@42shadow42 42shadow42 changed the title Expanded whitespace test case to catch more inefficiences and updated " Expanded whitespace test case to catch more inefficiencies and updated code to optimize. Jun 26, 2022
index.js Outdated Show resolved Hide resolved
index.js Outdated
const regex = /[ \t]/
let i = value.length
while (regex.test(value.charAt(--i)));
return value.slice(0, i + 1)
}

return value
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, it seems like a bad idea to use regexes for this. At least, it’s hard to read. Can you check if something like this works?

let start = 0
let end = value.length - 1

while (start < end) {
    let code = value.charAt(start);
    if (code === 9 || code === 32) {
        start++
    } else {
        break;
    }
}

while (end > start) {
  let code = value.charAt(start);
  if (code === 9 || code === 32) {
    end--;
  } else {
    break;
  }
}

return value.slice(start, end);

(formatted to match project)
(perhaps good to check for some lines that just include whitespace, and line that include nothing at all)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the regex approach where readability is concerned. It means people don't have to look up the char codes, and I feel it clarifies the intent over the above code.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your opinion, but I disagree. Can you change it?
Regexes are generally slow. Character codes, especially with parsing projects such as all of unified, are common and searchable.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose the follow as the function. Please still consider it pseudocode, however, I did test it and it seems to work

/**
 * @param {string} value
 * @param {boolean} start
 * @param {boolean} end
 * @returns {string}
 */
function trimLine(value, start, end) {
    let startIndex = 0
    let endIndex = value.length

    if (!start) {
        let code = value.charCodeAt(startIndex)

        while (code === 9 || code === 32) {
            startIndex++
            code = value.charCodeAt(startIndex)
        }
    }

    if (!end) {
        let code = value.charCodeAt(endIndex - 1)

        while (code === 9 || code === 32) {
            endIndex--
            code = value.charCodeAt(endIndex - 1)
        }
    }

    return endIndex > startIndex ? value.slice(startIndex, endIndex) : ''
}

This:

  • Implements faster trimming for !start as well, meaning that regexes are no longer needed
  • Uses charCodeAt which is faster that charAt, and no longer needs small strings. I understand that these codes might be new to you, and hence you do not prefer them, but I consider them common enough in parsing, in the 100s of projects I am maintaining, that I strongly prefer them.
  • Changes value only once, without reassigning it, or even not at all for empty lines. Reassigning a parameter is slow, because JavaScript “links” arguments[0] and value together. Not slicing at all for empty lines is likely also fast in edge cases of large blank lines.

All reasons why this should be faster.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using constants for the character codes to make it more readable:

const FOO = 9

//...


while (code === FOO || code === BAR) {

And I agree with not using regex, both for readability, performance, and also to avoid the risk of ReDoS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using constants for the character codes to make it more readable:

const FOO = 9

//...


while (code === FOO || code === BAR) {

And I agree with not using regex, both for readability, performance, and also to avoid the risk of ReDoS.

Just saw this, I think it resolves my concern about readability without compromising performance. I'll make the constants now.

test.js Outdated Show resolved Hide resolved
index.js Outdated Show resolved Hide resolved
index.js Outdated
const regex = /[ \t]/
let i = value.length
while (regex.test(value.charAt(--i)));
return value.slice(0, i + 1)
}

return value
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the regex approach where readability is concerned. It means people don't have to look up the char codes, and I feel it clarifies the intent over the above code.

test.js Outdated Show resolved Hide resolved
index.js Outdated Show resolved Hide resolved
index.js Outdated Show resolved Hide resolved
@42shadow42 42shadow42 requested a review from wooorm June 29, 2022 11:10
@42shadow42
Copy link
Contributor Author

I'll make both these changes this weekend. I think the code sample you provided is not equivalent, but I can find something similar I'm sure.

@wooorm
Copy link
Owner

wooorm commented Jul 1, 2022

Thank you <3
At least a slice is missing, but yeah: perhaps some more. Let me know if I can help you!

@42shadow42
Copy link
Contributor Author

Thank you <3 At least a slice is missing, but yeah: perhaps some more. Let me know if I can help you!

I'm not sure what you mean, can you clarify and/or show me what you mean?

@wooorm
Copy link
Owner

wooorm commented Jul 2, 2022

  1. My above code was pseudo code
  2. You said some stuff was missing
  3. I quickly looked and thought at least a slice was indeed missing
  4. Now that I look again, there is a slice that I thought was missing. But there might be other things missing!

Let me know if you need help figuring this out

index.js Outdated
const regex = /[ \t]/
let i = value.length
while (regex.test(value.charAt(--i)));
return value.slice(0, i + 1)
}

return value
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose the follow as the function. Please still consider it pseudocode, however, I did test it and it seems to work

/**
 * @param {string} value
 * @param {boolean} start
 * @param {boolean} end
 * @returns {string}
 */
function trimLine(value, start, end) {
    let startIndex = 0
    let endIndex = value.length

    if (!start) {
        let code = value.charCodeAt(startIndex)

        while (code === 9 || code === 32) {
            startIndex++
            code = value.charCodeAt(startIndex)
        }
    }

    if (!end) {
        let code = value.charCodeAt(endIndex - 1)

        while (code === 9 || code === 32) {
            endIndex--
            code = value.charCodeAt(endIndex - 1)
        }
    }

    return endIndex > startIndex ? value.slice(startIndex, endIndex) : ''
}

This:

  • Implements faster trimming for !start as well, meaning that regexes are no longer needed
  • Uses charCodeAt which is faster that charAt, and no longer needs small strings. I understand that these codes might be new to you, and hence you do not prefer them, but I consider them common enough in parsing, in the 100s of projects I am maintaining, that I strongly prefer them.
  • Changes value only once, without reassigning it, or even not at all for empty lines. Reassigning a parameter is slow, because JavaScript “links” arguments[0] and value together. Not slicing at all for empty lines is likely also fast in edge cases of large blank lines.

All reasons why this should be faster.

test.js Outdated Show resolved Hide resolved
t.test('internalized whitespace ', (t) => {
const timeoutId = setTimeout(() => {
t.fail('did not pass in 30ms')
}, 30)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this 30 instead of 10?
Should all be 30, and mean “fast enough”?

Copy link
Contributor Author

@42shadow42 42shadow42 Jul 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this test is slower because it's the edge case that originally caused problems. It might pass in 10ms with the fully non-regex version though I'd have to try it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually on second test, it appears it does pass in 10 ms, I must have accidentally tested 1ms or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a third revision it appears they all need 20-30ms, they are intermittently failing with 10. Good call.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we’re not testing xms. We’re testing: fast enough. Compared to alternative ways which result in like a second or more.
Timeouts mainly, but any timing tests in general are really hard to get right.
Older Node versions or Windows for example will make everything slightly slower. Because these numbers are really small, they will leed to randomly breaking CIs.

@42shadow42
Copy link
Contributor Author

42shadow42 commented Jul 2, 2022

Pushed hopefully the last revision, you are right your function works. However the linters requested changes to use codePointAt instead of charCodeAt, I think the numbers remain the same for space and tab so I left the numbers. On another note, I think I finally understand how the function works fully now and if I understand correctly the confusion I had was caused by the start and end boolean values being inverted, I changed start and end so that when start is true it trims the beginning of the string of whitespace, and when end it true it trims the end of the string. I feel like this makes the functions more clear. Let me know if you disagree.

Copy link
Owner

@wooorm wooorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between start and end as they previously were, was that start meant at the start of the whole string, and end meant at the end of the whole string.
With your change, they mean different things: start means to trim the start of a line (which must not happen at the start of the string), and end means to trim the end of a line (which must not happen at the end of the string). Either is fine with me.

@wooorm wooorm merged commit ca1b0bf into wooorm:main Jul 3, 2022
@wooorm
Copy link
Owner

wooorm commented Jul 3, 2022

Thanks for your continued work, released in 3.0.1!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants