-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SimpleCharStream expandBuffer may corrupt buffer content #26
Comments
Can be reproduced with this test case in ConditionParserTest when the JAVA_UNICODE option is disabled in the Conditions.jj.
I managed to fix this in the AbstractCharStream.template but the maven build does not seem to pick that up. Do you have any hint? |
if you paste the fix in this project, I'm happy to update it, if I understand the issue ;-) |
Reverting the fix for #23 fixes the lexing error for me. I'll push a reproducible example in a minute. |
Signed-off-by: Sebastian Zarnekow <[email protected]>
What Java version are you using? I am testing with 1.8 atm |
So am I. Why are you asking? |
I thought it is maybe an issue with a later version. Basically I was confused because you said "I changed it and it workd", and I changed it and it didn't work. But with all the discussion in #27 I now know what you mean. |
I can confirm that we have test cases on the JavaParser project which fail when using Based on my step-throughs in a debugger, the difference is definitely stream related. Specifically where a long token is encountered (this is where I got somewhat lost, though!). For that reason, I presume that this issue describes the root cause. Happy to provide direct links to these if desired! :) For now we've rolled back to v4.1.2 of the maven plugin (as of our release yesterday). |
Thanks for the info. I am curreently swamped with work but definitively want to tackle it. As stated above, I need to understand the details of the template in order to fix it properly.... |
No worries -- I completely understand! Until very recently, we were on v3.0.0 of the maven plugin (and whatever version dependencies that has), so no particular rush needed on my account - unless anything new comes to light, we can just sit on v4.1.2 (/v1.1.1) for a while or even just rollback to v3.0.0! :) @szarnekow seems to have more expertise than I based on the what I see of his comments above, but if I do revisit this I'll be sure to pass on any new information I uncover. |
What I did to track this down was basically a diff of the original java-cc templates to the ones here with a special focus on more recent changes. |
One of the things I've noticed and been caught out by when working with JP is that sometimes it takes multiple builds for changes to take effect -- once for the relevant builders/generators to overwrite its own code, and the second time is running tests against this new version. Maybe you've already taken this into account? Apologies for the distraction if you have, or if it's not relevant when working with JavaCC itself! |
The dependency here is: that a release of PGCC requires an updated release of the "ph-javacc-maven-plugin" which is in turn used by the next version of PGCC. If both versions are on SNAPSHOT I can test dependencies between them. |
About the lists of lists thing: It looks like a lot of stuff indeed. You can achieve almost the same performance characteristics with moving away from a constant buffer increase of 2048 chars to an approach that simply doubles the buffer size on each increase. |
Sure I am interested. |
@szarnekow wait before you invest time in it. I think I am close to resolving it.... |
This is also related to #21 |
I've just seen the comment added to the commit linked above -- if my (very limited) understanding of that code is correct, this seems to be related to circular array / circular buffer functionality.
Possibly a better explanation: Assuming that this is true, if the array is always of length 2048 then fine.... but the 2048 must not normally be hard-coded (instead, it should be the length of the array). Edit: It would also be prudent to check at which point the index is incremented -- depending on how/when values change, choosing |
I think I get the circular array so far. @MysterAitch what index are you specifically referring to? |
I seem to have gotten a little carried away in trying to give a concrete answer rather than "maybe it might be x...". Here are some notes/thoughts thus far. Sorry this is not clearer/more succinct, but I need to step away for a while now and don't have time to edit it down further. Hope it is at least a little helpful! tldr... It is definitely a circular buffer in use (rationale and pointers to key parts quoted below), but I'm still not able to figure out what the 2048 refers to... Maybe it is the just the buffer size increase and got omitted from this commit?
It is the comment on line 220 of Reading a bit more of the context (it's within the While I'm leaning towards 2048 being the presumed length of the buffer, without stepping through with a debugger or the git history of that file I'm unsure if 2048 refers to the presumed size of the buffer or, as you suggest, a good place to begin with resizing the array (akin to a hashtable-based collection's load factor). if (maxNextCharInd == available) {
if (available == bufsize) {
// ?What is the reason for this?
if (tokenBegin > 2048) {
// tokenBegin is always assigned as 0, -1, or bufpos
// ...presumably this means that we've reached the end of the underlying array and need to "wrap around" back to buffer[0] ...?
bufpos = 0;
maxNextCharInd = 0;
available = tokenBegin;
} else if (tokenBegin < 0) {
// Same as above, but don't edit 'available'..?
bufpos = 0;
maxNextCharInd = 0;
} else {
// Expand the buffer, but don't worry about wrapping around..?
expandBuff(false);
}
} else if (available > tokenBegin) {
available = bufsize;
} else if ((tokenBegin - available) < 2048) {
// This jumps out as a key place to inspect -- expanding the buffer size..
expandBuff(true);
} else {
available = tokenBegin;
}
} See also the comments I've added below to the array copy parts of // src, srcPos, dest, destPos, length
// copy the "tail end" to the "start" (index 0) of the new buffer array
System.arraycopy(buffer, tokenBegin, newbuffer, 0, nPreservedChars);
// copy the remaining "wrap around" content of the buffer from the start of the original buffer (starting at srcPos index 0)
System.arraycopy(buffer, 0, newbuffer, nPreservedChars, bufpos);
// swap the new buffer in place of the old buffer
buffer = newbuffer; There is also this part of // Equivalent to:
// bufpos = (bufpos + 1) % bufsize;
// note that bufsize might not necessarily be equivalent to buffer.length (assuming that bufsize refers to the number of USED slots within the array, while buffer.length refers to the number of possible spaces - some of which might be empty/null/whatever)
// lines 335-340:
++bufpos;
if (bufpos == bufsize)
{
// Buffer overflow
bufpos = 0;
} This is my understanding of the purpose of each of the variables. All in all though, I'm inclined to suggest that it might be easier to just make use of / delegate to java.nio.CharBuffer // lines 208-210
// Increase buffer size
bufsize = nNewBufSize; // The number of array positions (i.e. buffer.length)
available = nNewBufSize; // The number of unoccupied array positions
tokenBegin = 0; // The first array index (of `buffer`) that the current token starts I include the below for completeness, from an earlier revision of this comment before I'd come to the firm conclusion that a circular buffer is used. If it is a circular buffer then looping back to index 0 the number used in the comparison must always be wherever the end of the underlying array is (i.e. its Consider this image -- I've adapted it from one the data structures resources I use (not originally created by me though!). In particular, note that it shows an array of length 100 (0 to 99) and a case where the next token is In this case, all wrapping around back to Edit: There is also this webpage which describes things much more clearly than the geeksforgeeks link (lots more pictures!): http://www.mathcs.emory.edu/~cheung/Courses/171/Syllabus/8-List/array-queue2.html |
Thanks alot for your input. Afterwards a solution with CharBuffer definitively sounds interesting. |
The second buffering mechanism in JavaCharStream was probably necessary to make sure that unicode code points are not split by accident in the normal buffer flow. |
At the risk of patronising (not intentionally!!), note that "Half the initial length" (always 2048) is subtly different to ...Which is also subtly different to "the buffer increase size/rate" (which, coincidentally/arbitrarily might be the same as half of the initial/current length if that's the algorithm chosen). It is probably inconsequential in this case to use them interchangeably, but worth bearing in mind if needing to debug further or storing these values in a named static final field.
Heh, I think I'll leave that can of worms for you and @szarnekow to work through 😉 I'm appreciating that you're both looking into this though --- thanks! 😄 |
Closing this for now - thank you all for the effort you put into this. PGCC 1.1.3 is on its way to Maven Central and the Maven Plugin will follow soon |
While working with the excellent https://github.com/javaparser/javaparser, that is using this fork of javacc along with the option to use the modern templates without Unicode handling, I ran into lexing issues with long comments / strings when certain buffer expansion behaviour was triggered.
The text was updated successfully, but these errors were encountered: