-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No way to enforce node content to be parsed "as-is" #272
Comments
I think I've landed on the solution to this: bluelabsio#1 The problem seemed to be that the nested tag ( |
Added one more fix to the above patch to handle the To be clear, this does not fix the inference problem when it comes to tags with child tags but also data. I'm gonna try and figure that out next as time allows. |
That's not valid XML, if the |
It might not be valid, but it's a real case scenario. I am parsing government files with similar XML
and I can't get a column with the entire content "Lorem ipsum" or "Lorem |
I am facing a problem parsing many XML documents that contain nodes with HTML code.
Here is an example:
The problem is that
com.databricks.spark.xml.parsers.StaxXmlParserUtils#currentStructureAsString
finds a closing element in</span>
and will stop there.As a consequence I have documents that are not completely parsed.
I was expecting that using a custom schema and declaring a node as a
StringType
, all text inside would be forced to be a string, without further XML interpretation. Any suggestion about how I could turn this to be the case?The text was updated successfully, but these errors were encountered: