Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SimpleCharStream expandBuffer may corrupt buffer content #26

Closed
szarnekow opened this issue May 4, 2020 · 23 comments
Closed

SimpleCharStream expandBuffer may corrupt buffer content #26

szarnekow opened this issue May 4, 2020 · 23 comments
Assignees
Labels

Comments

@szarnekow
Copy link

While working with the excellent https://github.com/javaparser/javaparser, that is using this fork of javacc along with the option to use the modern templates without Unicode handling, I ran into lexing issues with long comments / strings when certain buffer expansion behaviour was triggered.

@szarnekow
Copy link
Author

Can be reproduced with this test case in ConditionParserTest when the JAVA_UNICODE option is disabled in the Conditions.jj.

  @Test
  public void testBufferExpansion () throws ParseException
  {
    char[] a = new char[2048 - 4 /* open + close of the comment */];
    Arrays.fill(a, 'a');
    char[] b = new char[4096 - 4 + 1 /* force the buffer to expand and wrap around */];
    Arrays.fill(b, 'b');
    _test (String.format("/*%s*/\n/*%s*/\nT || F", String.valueOf(a), String.valueOf(b)), true);
  }

I managed to fix this in the AbstractCharStream.template but the maven build does not seem to pick that up. Do you have any hint?

@phax phax self-assigned this May 4, 2020
@phax phax added the question label May 4, 2020
@phax
Copy link
Collaborator

phax commented May 4, 2020

if you paste the fix in this project, I'm happy to update it, if I understand the issue ;-)
Currently external templates are not supported.... Where did you put AbstractCharStream.template?

@szarnekow
Copy link
Author

Reverting the fix for #23 fixes the lexing error for me. I'll push a reproducible example in a minute.

szarnekow added a commit to szarnekow/ParserGeneratorCC that referenced this issue May 4, 2020
Signed-off-by: Sebastian Zarnekow <[email protected]>
phax added a commit that referenced this issue May 4, 2020
@phax
Copy link
Collaborator

phax commented May 4, 2020

What Java version are you using? I am testing with 1.8 atm

@szarnekow
Copy link
Author

So am I. Why are you asking?

@phax
Copy link
Collaborator

phax commented May 4, 2020

I thought it is maybe an issue with a later version. Basically I was confused because you said "I changed it and it workd", and I changed it and it didn't work. But with all the discussion in #27 I now know what you mean.
I need to dive into AbstractCharStream and really try to understand what happens in there - cannot be that hard ;-)

@MysterAitch
Copy link

I can confirm that we have test cases on the JavaParser project which fail when using ph-javacc-maven-plugin v4.1.3 but pass with v4.1.2 (the difference seemingly being parser-generator-cc v1.1.2 vs v1.1.1).

Based on my step-throughs in a debugger, the difference is definitely stream related. Specifically where a long token is encountered (this is where I got somewhat lost, though!). For that reason, I presume that this issue describes the root cause. Happy to provide direct links to these if desired! :)

For now we've rolled back to v4.1.2 of the maven plugin (as of our release yesterday).

@phax
Copy link
Collaborator

phax commented May 12, 2020

Thanks for the info. I am curreently swamped with work but definitively want to tackle it. As stated above, I need to understand the details of the template in order to fix it properly....

@MysterAitch
Copy link

MysterAitch commented May 12, 2020

No worries -- I completely understand!

Until very recently, we were on v3.0.0 of the maven plugin (and whatever version dependencies that has), so no particular rush needed on my account - unless anything new comes to light, we can just sit on v4.1.2 (/v1.1.1) for a while or even just rollback to v3.0.0! :)

@szarnekow seems to have more expertise than I based on the what I see of his comments above, but if I do revisit this I'll be sure to pass on any new information I uncover.

@szarnekow
Copy link
Author

What I did to track this down was basically a diff of the original java-cc templates to the ones here with a special focus on more recent changes.
One of the more recent changes was changing the way the offsets are maintained. Apparently the test coverage for the simple streams wasn't good enough for this kind of change.
Personally I was willing to work on this - added a test and a local fix. Unfortunately the fix wasn't picked up by the build and that's where I got lost.

@MysterAitch
Copy link

Personally I was willing to work on this - added a test and a local fix. Unfortunately the fix wasn't picked up by the build and that's where I got lost.

One of the things I've noticed and been caught out by when working with JP is that sometimes it takes multiple builds for changes to take effect -- once for the relevant builders/generators to overwrite its own code, and the second time is running tests against this new version.

Maybe you've already taken this into account? Apologies for the distraction if you have, or if it's not relevant when working with JavaCC itself!

@phax
Copy link
Collaborator

phax commented May 13, 2020

The dependency here is: that a release of PGCC requires an updated release of the "ph-javacc-maven-plugin" which is in turn used by the next version of PGCC. If both versions are on SNAPSHOT I can test dependencies between them.
I saw a JavaCC fix with some Lists of Lists and that looked a bit like overkill to me. And since I was not aware what the implications were I decided to not use it. I am now on it, fixing some issues while I'm going...

@szarnekow
Copy link
Author

About the lists of lists thing: It looks like a lot of stuff indeed. You can achieve almost the same performance characteristics with moving away from a constant buffer increase of 2048 chars to an approach that simply doubles the buffer size on each increase.
I've a local patch but was reluctant to push that to my fork because of the build issues.
Let me know if you are interested.

@phax
Copy link
Collaborator

phax commented May 13, 2020

Sure I am interested.
I am also seeing the "2048" constant used in the code and not sure yet why this is needed... In the meantime I'm trying to comment and understand...

@phax
Copy link
Collaborator

phax commented May 13, 2020

@szarnekow wait before you invest time in it. I think I am close to resolving it....

@phax
Copy link
Collaborator

phax commented May 13, 2020

This is also related to #21

@MysterAitch
Copy link

MysterAitch commented May 13, 2020

I've just seen the comment added to the commit linked above -- if my (very limited) understanding of that code is correct, this seems to be related to circular array / circular buffer functionality.

  • Normally with an array, if you "consume" the first n-values, the remaining values are shifted back to fill in the gap that remains so that index 0 is always the "front"/"first" value.

  • With a circular array, the start/end indices shift as tokens get consumed (and once you reach the end of the array, the "next" token requires looping around back to index 0).

Possibly a better explanation:
https://www.geeksforgeeks.org/circular-array/

Assuming that this is true, if the array is always of length 2048 then fine.... but the 2048 must not normally be hard-coded (instead, it should be the length of the array).

Edit: It would also be prudent to check at which point the index is incremented -- depending on how/when values change, choosing > vs >= in the check may lead to subtle bugs (given that array[array.length] is out of bounds).

@phax
Copy link
Collaborator

phax commented May 13, 2020

I think I get the circular array so far.
Maybe 2048 is like "50% of the buffer size"?
Is that an optimization that says "if the token starts in the second half, fill the buffer starting from 0"?
In that case I would replace "2048" with "bufsize / 2".
Any input?

@MysterAitch what index are you specifically referring to?

@MysterAitch
Copy link

MysterAitch commented May 13, 2020

I seem to have gotten a little carried away in trying to give a concrete answer rather than "maybe it might be x...". Here are some notes/thoughts thus far.

Sorry this is not clearer/more succinct, but I need to step away for a while now and don't have time to edit it down further. Hope it is at least a little helpful!

tldr... It is definitely a circular buffer in use (rationale and pointers to key parts quoted below), but I'm still not able to figure out what the 2048 refers to... Maybe it is the just the buffer size increase and got omitted from this commit?

I think I get the circular array so far.
Maybe 2048 is like "50% of the buffer size"?
Is that an optimization that says "if the token starts in the second half, fill the buffer starting from 0"?
In that case I would replace "2048" with "bufsize / 2".
Any input?

@MysterAitch what index are you specifically referring to?

It is the comment on line 220 of AbstractCharStream.template that I'd noticed. It is also quoted below, but with what I believe is more readable indentation and included option braces around single-line code blocks.

Reading a bit more of the context (it's within the fillBuff() method), it seems that this happens when values are being read from the source file and values are being inserted into the buffer (populating the buffer).

While I'm leaning towards 2048 being the presumed length of the buffer, without stepping through with a debugger or the git history of that file I'm unsure if 2048 refers to the presumed size of the buffer or, as you suggest, a good place to begin with resizing the array (akin to a hashtable-based collection's load factor).

    if (maxNextCharInd == available) {
        if (available == bufsize) {
            // ?What is the reason for this?
            if (tokenBegin > 2048) {
                // tokenBegin is always assigned as 0, -1, or bufpos
                // ...presumably this means that we've reached the end of the underlying array and need to "wrap around" back to buffer[0] ...?
                bufpos = 0;
                maxNextCharInd = 0;
                available = tokenBegin;
            } else if (tokenBegin < 0) {
                // Same as above, but don't edit 'available'..?
                bufpos = 0;
                maxNextCharInd = 0;
            } else {
                // Expand the buffer, but don't worry about wrapping around..?
                expandBuff(false);
            }
        } else if (available > tokenBegin) {
            available = bufsize;
        } else if ((tokenBegin - available) < 2048) {
            // This jumps out as a key place to inspect -- expanding the buffer size..
            expandBuff(true);
        } else {
            available = tokenBegin;
        }
    }

See also the comments I've added below to the array copy parts of expandBuff (lines 171-174):

      // src, srcPos, dest, destPos, length
      // copy the "tail end" to the "start" (index 0) of the new buffer array 
      System.arraycopy(buffer, tokenBegin, newbuffer, 0, nPreservedChars);
      // copy the remaining "wrap around" content of the buffer from the start of the original buffer (starting at srcPos index 0) 
      System.arraycopy(buffer, 0, newbuffer, nPreservedChars, bufpos); 
      // swap the new buffer in place of the old buffer
      buffer = newbuffer;

There is also this part of readChar (lines 335-340).

      // Equivalent to:
      // bufpos = (bufpos + 1) % bufsize;
      // note that bufsize might not necessarily be equivalent to buffer.length (assuming that bufsize refers to the number of USED slots within the array, while buffer.length refers to the number of possible spaces - some of which might be empty/null/whatever)

      // lines 335-340:
      ++bufpos;
      if (bufpos == bufsize)
      {
        // Buffer overflow
        bufpos = 0;
      }

This is my understanding of the purpose of each of the variables.

All in all though, I'm inclined to suggest that it might be easier to just make use of / delegate to java.nio.CharBuffer

    // lines 208-210

    // Increase buffer size
    bufsize = nNewBufSize; // The number of array positions (i.e. buffer.length)
    available = nNewBufSize; // The number of unoccupied array positions
    tokenBegin = 0; // The first array index (of `buffer`) that the current token starts

I include the below for completeness, from an earlier revision of this comment before I'd come to the firm conclusion that a circular buffer is used.

If it is a circular buffer then looping back to index 0 the number used in the comparison must always be wherever the end of the underlying array is (i.e. its length, or length - 1 depending on when the index gets incremented and which operator is being used).

Consider this image -- I've adapted it from one the data structures resources I use (not originally created by me though!). In particular, note that it shows an array of length 100 (0 to 99) and a case where the next token is char in the range (of the underlying array) of 98-1 (four array positions: 98, 99, 0, 1).

In this case, all wrapping around back to buffer[0] will only ever happen when we reach the array's length.

image

Edit: There is also this webpage which describes things much more clearly than the geeksforgeeks link (lots more pictures!):

http://www.mathcs.emory.edu/~cheung/Courses/171/Syllabus/8-List/array-queue2.html

phax added a commit that referenced this issue May 13, 2020
@phax
Copy link
Collaborator

phax commented May 13, 2020

Thanks alot for your input.
I crosschecked and the default buffer size was always 4096, so 2048 would be the half.
Did some changes in that direction (bufsize/2) and it looks good...
I think I leave it as it is now and build PGCC 1.1.3 and the ph-javacc-maven-plugin 4.1.4.

Afterwards a solution with CharBuffer definitively sounds interesting.
Also the differentiation between "modern" and "non-modern" Java code may be questionable. Especially because the "JavaCharStream" does some pretty weird shit using it's own buffering sigh

@szarnekow
Copy link
Author

The second buffering mechanism in JavaCharStream was probably necessary to make sure that unicode code points are not split by accident in the normal buffer flow.

@MysterAitch
Copy link

Thanks alot for your input.
I crosschecked and the default buffer size was always 4096, so 2048 would be the half.

At the risk of patronising (not intentionally!!), note that "Half the initial length" (always 2048) is subtly different to bufsize/2 ("half the current length" -- which could be e.g. 4096 or 9092 or whatever if the buffer doubles in size a few times).

...Which is also subtly different to "the buffer increase size/rate" (which, coincidentally/arbitrarily might be the same as half of the initial/current length if that's the algorithm chosen).

It is probably inconsequential in this case to use them interchangeably, but worth bearing in mind if needing to debug further or storing these values in a named static final field.

Afterwards a solution with CharBuffer definitively sounds interesting.
Also the differentiation between "modern" and "non-modern" Java code may be questionable. Especially because the "JavaCharStream" does some pretty weird shit using it's own buffering sigh

Heh, I think I'll leave that can of worms for you and @szarnekow to work through 😉

I'm appreciating that you're both looking into this though --- thanks! 😄

@phax
Copy link
Collaborator

phax commented May 13, 2020

Closing this for now - thank you all for the effort you put into this. PGCC 1.1.3 is on its way to Maven Central and the Maven Plugin will follow soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants