Skip to content

Configuring Woodstox II ‐ Stax2 Properties

winfriedgerlach edited this page Nov 24, 2024 · 4 revisions

<-- back to previous chapter Configuring Woodstox I ‐ Basic Stax Properties

Another way to configure: profiles

As mentioned earlier, the standard Stax way of configuring anything is through factories, using setProperty(name, value) method. This applies to Stax2 as well.

But there is also another mechanism for applying “profiles”: group of settings aimed at setting configuration defaults meant to optimize specific aspect. These methods are named as configureFor[Goal], for example “configureForSpeed”.

XMLInputFactory2 has following profile-configuration methods:

  • configureForConvenience: enable features that should simplify handling: enable coalescing, report all text segments as CHARACTERS (and not CDATA), enable P_PRESERVE_LOCATION
  • configureForLowMemUsage: try to reduce amount of memory retained during processing by: disabling coalescing (allows parser to report smaller segments), disable P_PRESERVE_LOCATION
  • configureForRoundTripping: try preserving event information as much as possible such that direct writes would not alter physical aspects of XML — disable coalescing, preserve distinction between CHARACTERS and CDATA, disable automatic entity expansion (so entities may be written out)
  • configureForSpeed: try minimizing performance overhead of options: disable coalescing, disable P_PRESERVE_LOCATION; enable intern()ing of both element/attribute names and namespace URIs
  • configureForXmlConformance: enable features required to conform to XML 1.x specification — namespaces, DTD processing

XMLOutputFactory2 has following profile-configuration methods:

  • configureForRobustness: enable both validation and repairing options to try to ensure that output is valid, even if changes are needed (for example, in rare cases comment contents may need to be split, if caller tries to output sequence of two hyphens; or, for CDATA, two ] characters)
  • configureForXmlConformance: enable all validation options to try to prevent any potential well-formedness problems (f.ex wrt namespace bindings) — but not all repairing options
  • configureForSpeed: optimizes for output performance: will disable validation operations that require scanning over contents; in a way opposite of conformance/robustness profiles.

Stax2 configuration properties

Use of profiles sets values for multiple properties (sometimes both plain Stax and Stax2 properties). But it is always possible to also set individual properties directly. Let’s have a look at what Stax2-extension properties exist and are supported by Woodstox. Note: most are Boolean valued: I only mention type if it is something other than Boolean.

XMLInputFactory2 specifies following Stax2 properties (along with default values Woodstox uses):

  • P_AUTO_CLOSE_INPUT (default: false): if enabled, XMLStreamReader will automatically close underlying input source when reader is closed; if disabled will not do so. Stax 1.0 specification mandates that the default behavior is “disabled”, often leading to unintended “dangling” input streams.
  • P_DTD_OVERRIDE (default: null, value type DTDValidationSchema): property that may be set if specific DTD instance is to be used instead of what DOCTYPE declaration specifies (if anything).
    NOTE: reading DTDValidationSchema is worth its own article, but basically entry point is XMLValidationSchemaFactory.newInstance(XMLValidationSchema.SCHEMA_ID_DTD))
  • P_INTERN_NAMES (default: true): Whether element and attribute names (“local name” part) returned will be String.intern()‘ed first or not — usually doing so saves memory and helps speed, but occasionally it may be necessary to disable this feature if number of distinct names is unbounded: for example, if names are randomly generated (like UUIDs)
  • P_INTERN_NS_URIS (default: true): similar to above, but applies to namespace URIs.
  • P_LAZY_PARSING (default: true): Controls whether parsing is “lazy” or “eager”: “eager” meaning that each event is completely parsed when XMLStreamReader.next() is called; “lazy” that only small part is parsed at that point, and rest is only parsed if and as needed. Benefits of lazy parsing included much faster skipping of unneeded content (esp. textual content, comments and processing instructions); possible downside is that sometimes error reporting may occur later than expected (during actual content access or skipping, that is, when calling next() for following event).
  • P_PRESERVE_LOCATION (default: true): Controls whether XMLStreamLocation information is included in XMLEvent instances or not. Disabling this feature reduces memory usage and improves processing speed modestly, but only when using “Event API” (XMLEventReader).
  • P_REPORT_CDATA (default: true): Whether XML CDATA sections are reported as CDATA Stax event (true) or as general CHARACTERS (false)
  • P_REPORT_PROLOG_WHITESPACE (default: false): When disabled (false), white-space outside XML root element is skipped and not reported; only possible COMMENTs and PROCESSING_INSTRUCTIONs are reported. But if enabled, additional SPACE events are reported — this is mostly (only) useful if trying to fully replicate document indentation outside of root element

XMLOutputFactory2 specifies following Stax2 properties:

  • P_ATTR_VALUE_ESCAPER (default: null, value type EscapingWriterFactory): By default, default escaping rules for attribute values: minimal escaping is used. It is possibly to fully customize escaping details, however. Value to assign has to be of type EscapingWriterFactory which contains 2 methods for constructing Writer used for output. Typically used to extend set of characters that are to be escaped, although may be used for advanced usage such as filtering or even replacing specific contents of attribute values — for example, could be used to obfuscate certain types of ids (credit-card numbers, SSN).
  • P_TEXT_ESCAPER (default: null, value type EscapingWriterFactory): similar to P_ATTR_VALUE_ESCAPER but used for textual segments (“character data”, NOT included CDATA segments as they do not allow escaping). Similarly used either for changing escaping details, or for more advanced filtering/modifying textual content to output.
  • P_AUTO_CLOSE_OUTPUT (default: false): similar to P_AUTO_CLOSE_INPUT, determines whether underlying OutputStream or Writer is automatically closed when XMLStreamWriter is closed — default is false due to Stax 1.0 specification mandating this behavior.
  • P_AUTOMATIC_EMPTY_ELEMENTS (default: true): When a sequence of START_ELEMENT and END_ELEMENT is output — with possible attributes in-between, but no child elements or textual content, it is possible to output either so-called empty element (like <element />) or fully-written out pair (<element></element>). If set to true, empty element is written; if false, separate start/end tags are written.
  • P_AUTOMATIC_NS_PREFIX (default: "wstxns"): When using “repairing: writer mode in which namespace URIs are automatically bound, namespace prefixes are generated using this String as the beginning, followed by a sequence number to keep prefixes unique.

And last but not least, Woodstox-specific properties

Now that we have covered 2 out of 3 properties sets, we are almost ready to have a look at the largest set of properties: ones specific (for now) to Woodstox itself: Configuring Woodstox III ‐ Woodstox‐Specific Properties.