Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parser when not replacing entities and treating char references as entities #176

Merged
merged 3 commits into from
Nov 11, 2023

Conversation

gnodet
Copy link
Contributor

@gnodet gnodet commented Jul 13, 2023

The problem is especially visible when configuring woodstox with round-tripping.
In this mode, char references and entities are not properly reported.

@cowtowncoder
Copy link
Member

@gnodet Could you explain the problem in bit more detail (unless there's issue matching PR that has it)?
I know there are unit tests added (great!) but it's been a while so it'll take a while for me to remember logic wrt configuration settings.

@gnodet
Copy link
Contributor Author

gnodet commented Jul 19, 2023

@gnodet Could you explain the problem in bit more detail (unless there's issue matching PR that has it)?
I know there are unit tests added (great!) but it's been a while so it'll take a while for me to remember logic wrt configuration settings.

So my understanding is that:

  • round tripping tries to report every single event so that a reader/writer pair can write the exact xml that is read
  • this mode sets replaceEntityRefs to false (so that they are kept in the &xxx; form) and treatCharRefsAsEnts to true (to also keep them in the &xxx; form)

However, there are a few combinations that do not work well when external entities or char entities are found...

@cowtowncoder
Copy link
Member

Ah ok, thank you. That makes sense.

So treatCharRefsAsEnts is not taking effect in all combinations?

@gnodet
Copy link
Contributor Author

gnodet commented Jul 19, 2023

Ah ok, thank you. That makes sense.

So treatCharRefsAsEnts is not taking effect in all combinations?

It's actually worse than that. It can cause wrong data to be returned when using getLocalName() for example. That's how I found the problem. This is shown in the test https://github.com/FasterXML/woodstox/pull/176/files#diff-47d3ee19006ff84cab043d2aeb042ff71d39c0ba4693aad07e822cddffa8631dR39, where the call to getLocalName() returns foo instead of oelig if you remove the fixes from this PR.

And yes, treatCharRefsAsEnts is also not taking effect when not replacing entities. This is shown in the test at line https://github.com/FasterXML/woodstox/pull/176/files#diff-0cacc0085dd6a986502abe8aacc2a6391f3093c33bcbfa38a5eb99f4de649fa0R127 which fails, the call to next() returning CHARACTERS instead of ENTITY_REFERENCE.

@cowtowncoder
Copy link
Member

cowtowncoder commented Nov 4, 2023

@gnodet Apologies for very slow follow up here. I'd be happy to merge this now.

But just one minor thing: if I haven't gotten CLA from you for Woodstox, I'd need it from:

https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf

(yes, it's under Jackson repo but we use same CLA for Woodstox too).
It's easiest usually to print, fill & sign, scan/photo, email info at fasterxml dot com.
This only needs to be done once, so if you had sent one earlier please let me know.
Otherwise looking forward to merging this contribution.

Thank you again!

EDIT: note, I originally indicated cla at ... but realized that email alias is not set up -- CLA needs to be (re-)sent to info at fasterxml dot com instead.

@cowtowncoder
Copy link
Member

Quick note wrt ^^^ -- CLA actually needs to go to info at fasterxml dot com.
So @gnodet if you sent one already it'd need to be re-sent. Apologies for mix up here.

@gnodet
Copy link
Contributor Author

gnodet commented Nov 10, 2023

Resent to the correct address.

@cowtowncoder cowtowncoder added this to the 6.6.0 milestone Nov 11, 2023
@cowtowncoder cowtowncoder merged commit 9d698ab into FasterXML:master Nov 11, 2023
@cowtowncoder
Copy link
Member

Thank you again @gnodet ! I am hoping to get #174 (for #165) merged and can then release 6.6.0.

cowtowncoder added a commit that referenced this pull request Nov 11, 2023
@gnodet gnodet deleted the rountrip-char-refs branch November 17, 2023 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants