Skip to content
This repository has been archived by the owner on Sep 21, 2023. It is now read-only.

Put excess characters from regexp matching back into the buffer #2

Merged

Conversation

krnowak
Copy link

@krnowak krnowak commented Jan 13, 2016

Put excess unmatched characters back into the buffer

The regexp.FindReaderSubmatchIndex() function may read more characters than it is necessary from the stream, thus possibly making subsequent regex expects to fail, like in following scenario:

String in stream is

prefix: foo
prefix: bar

A regexp we expect is (?m)^prefix: (.+)\s??$, the final \s?? is for matching the \r character in the \r\n newline

The first call to the expectRegexFind function may result in matches = ["prefix: foo", "foo"], output = "prefix: foo\npr" and err = nil. The output has the beginning of the second line in it.

The second call will then result in matches = [], output = "efix: bar\n" and err = "ExpectRegex didn't ..."

func TestRegexFindNoExcessBytes(t *testing.T) {
t.Logf("Testing Regular Expressions returning output with no excess strings")
repeats := 100
seqCmd := fmt.Sprintf("`seq 1 %d`", repeats)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also do a test with seq -s ' ' 1 100? I'd like to make sure that two matches on the same line work fine.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmh, I guess I meant echo -n in the next line, rather than seq -s ' '.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean something like matching the a \d+ b regexp in "a 1 b a 2 b a 3 c" string? Sure, I can test it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be good.

@alban
Copy link

alban commented Jan 13, 2016

Thanks for fixing this :)

@alban
Copy link

alban commented Jan 13, 2016

LGTM

@krnowak krnowak force-pushed the krnowak/regexp-matching-fixes branch from c183861 to 83e29f6 Compare January 14, 2016 09:36
@krnowak
Copy link
Author

krnowak commented Jan 14, 2016

Fixed the code putting bytes back into the stream (it should put the bytes back after the whole match, not after the last match). Updated the tests. And hopefully wrote a better commit message now.

The expectRegexOutput uses the regexp.FindReaderSubmatchIndex function
to do the matching. As its docs say, it may read more characters than
it is necessary from the stream. This may be problematic if the regex
we use can match chunks of the stream that come right after
another. This is because the function might match the first chunk and
read into the middle of the second chunk, so the subsequent call to
the function will match the third chunk, not the second. For example:

data: "one two three"
regexp: "\w{3,}"

The first call to the function may return "one" as a match and "one
tw" as an output. We can see that the output has the beginning of the
second word which comes after the matched first word. This part is not
available anymore in the stream, thus the following call to the
function will return "three" as a match (instead of "two") and "o
three" as an output.

To fix it, we put the excess characters coming after the whole match
back into the stream, so the subsequent calls to the function may
start their matching from the place when last whole match ended.
@krnowak krnowak force-pushed the krnowak/regexp-matching-fixes branch from 83e29f6 to 3bc350a Compare January 14, 2016 09:55
t.Fatalf("Failed to get the match number %d: %v", i, err)
}
if len(matches) != 2 {
t.Fatalf("Expected only 2 matches, got %d", len(matches))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// matches contain an index pair for each match
t.Fatalf("Expected only 1 match, got %d", len(matches)/2)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No no, the ExpectRegexFindWithOutput returns matches as a slice of strings. The underlying function returns a slice of ints being index pairs, but that's an implementation detail.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we expect the matches slice to have the whole match (like "prefix: 1 line") and a submatch for the number only ("1").

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmhokay.

LGTM then.

@alban
Copy link

alban commented Jan 14, 2016

small nit, otherwise LGTM

@alban
Copy link

alban commented Jan 14, 2016

@steveej who has merge-powers on this repository?

@jonboulle
Copy link

I do

jonboulle added a commit that referenced this pull request Jan 14, 2016
Put excess characters from regexp matching back into the buffer
@jonboulle jonboulle merged commit aefa734 into coreos:master Jan 14, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants