Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No way to enforce node content to be parsed "as-is" #272

Closed
atomobianco opened this issue Oct 19, 2017 · 4 comments
Closed

No way to enforce node content to be parsed "as-is" #272

atomobianco opened this issue Oct 19, 2017 · 4 comments

Comments

@atomobianco
Copy link

atomobianco commented Oct 19, 2017

I am facing a problem parsing many XML documents that contain nodes with HTML code.
Here is an example:

<?xml version="1.0" encoding="UTF-8"?>
<note>
  <p>This text could work fine</p>
  <p>And here comes the <span class="bold">piece</span> that gives problems</p>
</note>

The problem is that com.databricks.spark.xml.parsers.StaxXmlParserUtils#currentStructureAsString finds a closing element in </span> and will stop there.
As a consequence I have documents that are not completely parsed.

I was expecting that using a custom schema and declaring a node as a StringType, all text inside would be forced to be a string, without further XML interpretation. Any suggestion about how I could turn this to be the case?

@m-stafford
Copy link

m-stafford commented Aug 7, 2018

I think I've landed on the solution to this: bluelabsio#1
Still a work in progress as I test it out on my dataset, but I think it makes sense.

The problem seemed to be that the nested tag (<span> in this case) wasn't being parsed recursively by currentStructureAsString due to the prior parsing of the text before the tag, so upon encountering the closing tag it was interpreted as the closing tag to the <p> tag.

@m-stafford
Copy link

m-stafford commented Aug 8, 2018

Added one more fix to the above patch to handle the skipChildren function. Now it seems to work for all the cases I throw at it. I think it needs additional testing before being useful though. @HyukjinKwon, that's the case, right?

To be clear, this does not fix the inference problem when it comes to tags with child tags but also data. I'm gonna try and figure that out next as time allows.

@srowen
Copy link
Collaborator

srowen commented Dec 22, 2018

That's not valid XML, if the <span> tag isn't meant to be interpreted as a tag. You'd need to escape it as an XML entity.

@srowen srowen closed this as completed Dec 22, 2018
@atomobianco
Copy link
Author

It might not be valid, but it's a real case scenario. I am parsing government files with similar XML

...
<CONTENU>Lorem<br/>ipsum<br/></CONTENU>

and I can't get a column with the entire content "Lorem ipsum" or "Lorem
ipsum
"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants