No way to enforce node content to be parsed "as-is" #272

atomobianco · 2017-10-19T14:30:46Z

I am facing a problem parsing many XML documents that contain nodes with HTML code.
Here is an example:

<?xml version="1.0" encoding="UTF-8"?>
<note>
  <p>This text could work fine</p>
  <p>And here comes the <span class="bold">piece</span> that gives problems</p>
</note>

The problem is that com.databricks.spark.xml.parsers.StaxXmlParserUtils#currentStructureAsString finds a closing element in </span> and will stop there.
As a consequence I have documents that are not completely parsed.

I was expecting that using a custom schema and declaring a node as a StringType, all text inside would be forced to be a string, without further XML interpretation. Any suggestion about how I could turn this to be the case?

The text was updated successfully, but these errors were encountered:

m-stafford · 2018-08-07T17:17:35Z

I think I've landed on the solution to this: bluelabsio#1
Still a work in progress as I test it out on my dataset, but I think it makes sense.

The problem seemed to be that the nested tag (<span> in this case) wasn't being parsed recursively by currentStructureAsString due to the prior parsing of the text before the tag, so upon encountering the closing tag it was interpreted as the closing tag to the <p> tag.

m-stafford · 2018-08-08T01:36:00Z

Added one more fix to the above patch to handle the skipChildren function. Now it seems to work for all the cases I throw at it. I think it needs additional testing before being useful though. @HyukjinKwon, that's the case, right?

To be clear, this does not fix the inference problem when it comes to tags with child tags but also data. I'm gonna try and figure that out next as time allows.

srowen · 2018-12-22T02:59:40Z

That's not valid XML, if the <span> tag isn't meant to be interpreted as a tag. You'd need to escape it as an XML entity.

atomobianco · 2022-10-18T15:42:01Z

It might not be valid, but it's a real case scenario. I am parsing government files with similar XML

...
<CONTENU>Lorem<br/>ipsum<br/></CONTENU>

and I can't get a column with the entire content "Lorem ipsum" or "Lorem
ipsum
"

srowen closed this as completed Dec 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No way to enforce node content to be parsed "as-is" #272

No way to enforce node content to be parsed "as-is" #272

atomobianco commented Oct 19, 2017 •

edited

Loading

m-stafford commented Aug 7, 2018 •

edited

Loading

m-stafford commented Aug 8, 2018 •

edited

Loading

srowen commented Dec 22, 2018

atomobianco commented Oct 18, 2022

No way to enforce node content to be parsed "as-is" #272

No way to enforce node content to be parsed "as-is" #272

Comments

atomobianco commented Oct 19, 2017 • edited Loading

m-stafford commented Aug 7, 2018 • edited Loading

m-stafford commented Aug 8, 2018 • edited Loading

srowen commented Dec 22, 2018

atomobianco commented Oct 18, 2022

atomobianco commented Oct 19, 2017 •

edited

Loading

m-stafford commented Aug 7, 2018 •

edited

Loading

m-stafford commented Aug 8, 2018 •

edited

Loading