-
-
Notifications
You must be signed in to change notification settings - Fork 83
Configuring Woodstox I ‐ Basic Stax Properties
Beyond standard Stax and SAX configurability, Woodstox allows much wider variety of configuration. When using Stax API, all configuration uses the same standard Stax API, regardless of whether setting itself is defined as part of Stax:
XMLInputFactory inputF = XMLInputFactory.newFactory();
inputF.setProperty(<property-to-set>, <value>);
XMLOutputFactory outputF = XMLOutputFactory.newFactory();
outputF.setProperty(<property-to-set>, <value>);
where property-to-set
is a String
with one of pre-defined constant values. Value is often of type Boolean
, but not always: this depends on configuration setting in question.
These constants can be divided into 3 different groups:
- Standard Stax properties (see
XMLInputFactory
andXMLOutputFactory
for constants): implemented by all compliant Stax implementations -
Stax2 extension API (
XMLInputFactory2
andXMLOutputFactory2
): implemented by all Stax2-compliant parsers (currently this means Woodstox and Aalto) - Woodstox-specific properties (
WstxInputProperties
,WstxOutputProperties
), supported only by Woodstox itself
In the following, these categories are described in more detail.
Set of standard properties is covered by JDK Javadocs (and I link entries below). Most are Boolean
valued: I only mention type if it is something different.
XMLInputFactory
defines a few settings; most important are:
-
IS_COALESCING: if enabled, parser will ensure that all adjacent text (“cdata”) segments are combined into a single
CHARACTERS
event. If disabled, text segments may be returned in arbitrary number of events (of typeCHARACTERS
andCDATA
) — often split at places where entities are used - IS_NAMESPACE_AWARE: whether namespace-processing is enabled or not: if disabled, namespace-binding does not occur, and full element/attribute name is reported as “local name” (for example: xml:space would have local name of “xml:space”, and no namespace prefix or URI). If enabled, namespace declarations are handled and prefix/namespace binding applied as expected
-
SUPPORT_DTD: whether DTD subset (definition) processing is enabled or not. If enabled, DTD definitions are read (both internal and external), and parsed entities are expanded. If disabled, internal DTD subset is skipped and external subset is not read.
NOTE: if disabled, no DTD validation occurs, regardless of other settings -
IS_VALIDATING: whether DTD validation is enabled or not (note: does not affect XML Schema, Relax NG, or other validation settings).
NOTE: only takes effects ifSUPPORT_DTD
is also enabled -
RESOLVER: unlike other options, NOT of type
Boolean
butXMLResolver
. Allows overriding reading of external DTD subsets (and parsed external entities defined from there), to (for example) add caching, or allow rewriting, replacing or just removing external DTD definitions. Often used for security purposes to just prevent external reads -
IS_SUPPORTING_EXTERNAL_ENTITIES: if DTD processing is enabled (see
SUPPORT_DTD
), external entities (references to external resources outside of XML document or DTD subset itself) are recognized and processed. However, their expansion may be disabled if this setting is disabled. This is typically done for security reasons: if XML content comes from untrusted sources, enabling expansion is not a good idea.
If disabled, entities are only reported as entity references; if enabled, entities are expanded as per XML specification and reported as XML tokens.
XMLOutputFactory
has only one configuration setting:
- IS_REPAIRING_NAMESPACES: more properly should be called “automatic namespaces”, enabling of which removes need to declared namespace bindings before use. If enabled, passing of namespace prefixes is optional, and all namespace declarations are automatically written by XMLStreamWriter. If disabled, caller must explicitly write namespace declarations. Note that in latter case it is possible that output is not namespace-compliant in cases where namespace declarations (bindings) are missing or misplaced.
As mentioned earlier, the standard Stax way of configuring anything is through factories, using setProperty(name, value)
method. This applies to Stax2 as well.
But there is also another mechanism for applying “profiles”: group of settings aimed at setting configuration defaults meant to optimize specific aspect. These methods are named as configureFor[Goal]
, for example “configureForSpeed”.
XMLInputFactory2
has following profile-configuration methods:
-
configureForConvenience: enable features that should simplify handling: enable coalescing, report all text segments as
CHARACTERS
(and notCDATA
), enableP_PRESERVE_LOCATION
-
configureForLowMemUsage: try to reduce amount of memory retained during processing by: disabling coalescing (allows parser to report smaller segments), disable
P_PRESERVE_LOCATION
-
configureForRoundTripping: try preserving event information as much as possible such that direct writes would not alter physical aspects of XML — disable coalescing, preserve distinction between
CHARACTERS
andCDATA
, disable automatic entity expansion (so entities may be written out) -
configureForSpeed: try minimizing performance overhead of options: disable coalescing, disable
P_PRESERVE_LOCATION
; enableintern()
ing of both element/attribute names and namespace URIs - configureForXmlConformance: enable features required to conform to XML 1.x specification — namespaces, DTD processing
XMLOutputFactory2
has following profile-configuration methods:
-
configureForRobustness: enable both validation and repairing options to try to ensure that output is valid, even if changes are needed (for example, in rare cases comment contents may need to be split, if caller tries to output sequence of two hyphens; or, for CDATA, two
]
characters) - configureForXmlConformance: enable all validation options to try to prevent any potential well-formedness problems (f.ex wrt namespace bindings) — but not all repairing options
- configureForSpeed: optimizes for output performance: will disable validation operations that require scanning over contents; in a way opposite of conformance/robustness profiles.
Use of profiles sets values for multiple properties (sometimes both plain Stax and Stax2 properties). But it is always possible to also set individual properties directly. Let’s have a look at what Stax2-extension properties exist and are supported by Woodstox. Note: most are Boolean
valued: I only mention type if it is something other than Boolean.
XMLInputFactory2
specifies following Stax2 properties (along with default values Woodstox uses):
-
P_AUTO_CLOSE_INPUT (default:
false
): if enabled,XMLStreamReader
will automatically close underlying input source when reader is closed; if disabled will not do so. Stax 1.0 specification mandates that the default behavior is “disabled”, often leading to unintended “dangling” input streams. -
P_DTD_OVERRIDE (default:
null
, value typeDTDValidationSchema
): property that may be set if specific DTD instance is to be used instead of whatDOCTYPE
declaration specifies (if anything).
NOTE: reading DTDValidationSchema is worth its own article, but basically entry point isXMLValidationSchemaFactory.newInstance(XMLValidationSchema.SCHEMA_ID_DTD))
-
P_INTERN_NAMES (default:
true
): Whether element and attribute names (“local name” part) returned will beString.intern()
‘ed first or not — usually doing so saves memory and helps speed, but occasionally it may be necessary to disable this feature if number of distinct names is unbounded: for example, if names are randomly generated (like UUIDs) -
P_INTERN_NS_URIS (default:
true
): similar to above, but applies to namespace URIs. -
P_LAZY_PARSING (default:
true
): Controls whether parsing is “lazy” or “eager”: “eager” meaning that each event is completely parsed whenXMLStreamReader.next()
is called; “lazy” that only small part is parsed at that point, and rest is only parsed if and as needed. Benefits of lazy parsing included much faster skipping of unneeded content (esp. textual content, comments and processing instructions); possible downside is that sometimes error reporting may occur later than expected (during actual content access or skipping, that is, when callingnext()
for following event). -
P_PRESERVE_LOCATION (default:
true
): Controls whetherXMLStreamLocation
information is included inXMLEvent
instances or not. Disabling this feature reduces memory usage and improves processing speed modestly, but only when using “Event API” (XMLEventReader
). -
P_REPORT_CDATA (default:
true
): Whether XMLCDATA
sections are reported asCDATA
Stax event (true
) or as generalCHARACTERS
(false
) -
P_REPORT_PROLOG_WHITESPACE (default:
false
): When disabled (false
), white-space outside XML root element is skipped and not reported; only possibleCOMMENT
s andPROCESSING_INSTRUCTION
s are reported. But if enabled, additionalSPACE
events are reported — this is mostly (only) useful if trying to fully replicate document indentation outside of root element
XMLOutputFactory2
specifies following Stax2 properties:
-
P_ATTR_VALUE_ESCAPER (default:
null
, value typeEscapingWriterFactory
): By default, default escaping rules for attribute values: minimal escaping is used. It is possibly to fully customize escaping details, however. Value to assign has to be of typeEscapingWriterFactory
which contains 2 methods for constructingWriter
used for output. Typically used to extend set of characters that are to be escaped, although may be used for advanced usage such as filtering or even replacing specific contents of attribute values — for example, could be used to obfuscate certain types of ids (credit-card numbers, SSN). -
P_TEXT_ESCAPER (default:
null
, value typeEscapingWriterFactory
): similar toP_ATTR_VALUE_ESCAPER
but used for textual segments (“character data”, NOT includedCDATA
segments as they do not allow escaping). Similarly used either for changing escaping details, or for more advanced filtering/modifying textual content to output. -
P_AUTO_CLOSE_OUTPUT (default:
false
): similar toP_AUTO_CLOSE_INPUT
, determines whether underlyingOutputStream
orWriter
is automatically closed whenXMLStreamWriter
is closed — default isfalse
due to Stax 1.0 specification mandating this behavior. -
P_AUTOMATIC_EMPTY_ELEMENTS (default:
true
): When a sequence ofSTART_ELEMENT
andEND_ELEMENT
is output — with possible attributes in-between, but no child elements or textual content, it is possible to output either so-called empty element (like<element />
) or fully-written out pair (<element></element>
). If set totrue
, empty element is written; iffalse
, separate start/end tags are written. -
P_AUTOMATIC_NS_PREFIX (default:
"wstxns"
): When using “repairing: writer mode in which namespace URIs are automatically bound, namespace prefixes are generated using this String as the beginning, followed by a sequence number to keep prefixes unique.
Woodstox-specific property names are defined in 2 classes:
-
WstxInputProperties (to use via
XMLInputFactory
) -
WstxOutputProperties (to use via
XMLOutputFactory
)
As with all properties, configuration is done using methods XMLInputFactory.setProperty()
and XMLOutputFactory.setProperty()
.
First set of input-side properties are related to handling of DTD. They are undefined by default, but can be set to custom handlers to change default handling of DTD subsets and expansion of entities defined within.
-
P_DTD_RESOLVER (of type XMLResolver, default:
null
): allows defining a handler to replace reading of external DTD subset — to redirect reading, and/or add specific caching.
NOTE! Will override standard “XmlInputFactory#RESOLVER” setting (for this purpose) if defined -
P_ENTITY_RESOLVER (of type XMLResolver, default:
null
): similar to P_DTD_RESOLVER, but overrides resolution of External Parsed Entities defined within DTD subset (internal or external). NOTE! Will override standard “XmlInputFactory#RESOLVER” setting (for this purpose) if defined -
P_UNDECLARED_ENTITY_RESOLVER (of type XMLResolver, default:
null
): Similar to P_ENTITY_RESOLVER, but used in cases where entity has not been defined: allows graceful handling of situation that would otherwise result in exception. A common implementation would simply provide either “empty” contents (to effectively remove entity), or add a marker to indicate error for further processing.
Another set of properties added in Woodstox 4.2 allows specifying maximum limits for certain input constructs. These are typically used to protect against possible Denial-of-Service (DoS) attacks, wherein XML-based web services may be attacked by specifically crafted documents that could cause processing problems by excessive memory or computing power usage.
If one of limits is exceeded during parsing phase, an XMLStreamException
will be thrown (in future it might be nice to have a sub-type to allow catching specific type — for now there isn’t separate exception type).
All settings have reasonable default values for normal usage (including some settings as “unlimited”), but they may be changed to stricter (if specific attacks are observed or system has lower resource allocation) or looser (if input documents can legitimately exceed one or more of default limits).
- P_MAX_ATTRIBUTE_SIZE (default: 512,000 characters): specifies maximum allowed length of attribute values, in characters
- P_MAX_ATTRIBUTES_PER_ELEMENT (default: 1000): specifies maximum number of distinct attributes allowed for any single XML element.
-
P_MAX_CHARACTERS (default: unlimited): specifies maximum total length of the input XML document, in characters. Check is invoked regularly when reading input blocks and is not exact to character: for example, if you define limit as 5000 characters, limit violation may be reported after 5400 characters (based on buffer boundaries): but never with LESS than limit number. So it guarantees that you will be able to process documents up to and including the limit.
NOTE: this refers to raw input size and NOT input size after possible entity expansion (there is no limit for latter at this point) -
P_MAX_CHILDREN_PER_ELEMENT (default: unlimited): Similar to
P_MAX_ATTRIBUTES_PER_ELEMENT
but limits number of child elements within any given element. - P_MAX_ELEMENT_COUNT (default: unlimited): specifies maximum total number of elements within a single XML document
- P_MAX_ELEMENT_DEPTH (default: 1000): specifies maximum nesting level of elements: that is, number of elements that may be nested at any given point. This is one of the settings that some users have had to increase for legit documents.
- P_MAX_ENTITY_COUNT (default: 100,000): specifies maximum total number of entity expansions allowed per XML document (nested and non-nested)
- P_MAX_ENTITY_DEPTH (default: 500): specifies maximum nesting level of entity expansion (distinct from total number).
-
P_MAX_TEXT_LENGTH (default: unlimited): specifies maximum contiguous length of any character data segment (either regular text segment or
CDATA
section). Handling varies between coalescing mode (in which all adjacent textual segments are combined) and non-coalescing; in latter case limit is per-segment
And then there are many other input properties that may be configured
-
P_BASE_URL (default:
null
): optional context to use when resolving external DTD subsets and parsed entities, in cases where they use relative references. May be useful to define as a file path, or external http URL. -
P_CACHE_DTDS (default:
true
): allows disabling of DTD caching which is enabled by default. Most likely use case is that where caller has separate external caching of DTDs, or where public/system id used as key is not unique (and cache collisions would occur) -
P_CACHE_DTDS_BY_PUBLIC_ID (default:
true
): specifies which identifier is used for caching (if enabled as perP_CACHE_DTD
) — iftrue
Public id is used; iffalse
System id - P_INPUT_BUFFER_LENGTH (default: 4000): specifies length of internal read buffer, in characters.
-
P_INPUT_PARSING_MODE (default:
WstxInputProperties.PARSING_MODE_DOCUMENT
): allows changing operating mode of parser for non-conforming XML content. Default mode requires input of exactly one well-formed XML document (which means one and only one root element). Alternatives arePARSING_MODE_FRAGMENT
andPARSING_MODE_DOCUMENTS
— both of which allow zero or more root elements, but “documents” mode further allowing inclusion of full documents with their separate xml and DOCTYPE declarations. “Fragments” mode is most often used to read a subset of a full document, whereas “Documents” mode stream that contains individual well-formed documents. -
P_MIN_TEXT_SEGMENT (default: 64): when NOT using “coalescing” mode, this setting defines smallest partial text segment (that is, part of one contiguous text segment) that may be reported — intention being to reduce likelihood of returning tiny segments while allowing parser to avoid having to buffer longer segments completely.
Setting this value to
Integer.MAX_VALUE
will effectively prevent splitting of individual segments without forcing coalescing of adjacent segments, so that is one common override -
P_NORMALIZE_LFS (default:
true
): XML specification requires parsers to convert “alt linefeeds” (that is,\r\n
and\r
) into canonical linefeed (\n), but disabling this property allows exposing actual linefeed without normalization -
P_RETURN_NULL_FOR_DEFAULT_NAMESPACE (default:
false
): whether so-called unbound default namespace (one that non-prefixed attributes have, and non-prefixed elements when there is no explicit binding for default namespace) is reported asnull
(enabled) or empty String (disabled). -
P_TREAT_CHAR_REFS_AS_ENTS (default:
false
): Normally character references (like&
) are simply expanded and reported as part of character data; but if this property is set totrue
they will instead be reported asENTITY
tokens. This may occasionally be useful when trying to fully reproduce input representation of an XML document, including choice of escaping of special characters.
NOTE: this only works for textual content — it is not possible to support for attribute values (as there are no separate tokens; attributes are accessible viaSTART_ELEMENT
token only) -
P_VALIDATE_TEXT_CHARS (default:
true
): XML specification requires verification that character segments only contain characters legal for XML specification (version 1.0 or 1.1, depending on xml declaration), and reporting error (by throwing exception) if illegal character (such as most of control codes in 0x00–0x1F range) is encountered. By disabling this property it is possible to prevent this validation; either to allow inclusion of otherwise illegal characters (that is, processing of non-wellformed “xml” content), or to achieve minor performance gain (as validation adds some amount of processing cost — however, typically not enough for it to really matter)
NOTE! As of now (March 2018) this feature is NOT YET implemented — see this issue for details. - P_XML10_ALLOW_ALL_ESCAPED_CHARS — to be added in Woodstox 5.2 — will allow decoding of those XML 1.1 character entities (control characters in ASCII range of 0x00–0x1F) that are not otherwise valid in XML 1.0
On output side, a large group of settings is related to (optional) verification of well-formedness of content; and some related settings that allow working around problems that could occur if output was done exactly as implied by calls (but can be performed in modified form).
-
P_OUTPUT_FIX_CONTENT (default:
false
): setting that allows “fixing” of some use cases where input as indicated would fail, but where there is an alternative way. One example is of outputting sequence]]>
forCDATA
segment: this is NOT legal output, so a call towriteCData()
with argument like “xxx ]]> “ would produce illegal output. But if writer is allowed to fix things, it will instead split characters into 2 separateCDATA
sections. Similarly,CDATA
sections cannot contain double hyphen (—
) andPROCESSING_INSTRUCTION
s can not contain?>
. Splitting content into two token will produce well-formed (if not pretty) output in these cases. -
P_OUTPUT_INVALID_CHAR_HANDLER (default: undefined; type
InvalidCharHandler
): in case textual content would contain character that is NOT legal in XML, a custom handler may be installed to convert from invalidchar
into valid one (viaconvertInvalidChar()
method). Most often this would produce something like simple space, or perhaps a question mark. Without custom handler, default behavior is to throw aXMLStreamException
to indicate problem. -
P_OUTPUT_VALIDATE_ATTR (default:
false
): if enabled, will verify that no duplicate attribute values are output. This requires keeping track of attributes output and adds some memory and processing overhead. -
P_OUTPUT_VALIDATE_CONTENT (default:
true
): if enabled, with verify well-formedness of textual content with respect to problems listed onP_OUTPUT_FIX_CONTENT
and report problems (asXMLStreamException
s) if attempt is made for such output (and output fixing not enabled) -
P_OUTPUT_VALIDATE_NAMES (default:
false
): if enabled, will verify that names of elements and attributes only contain characters legal for XML names. Since there is no escape mechanism for names, inclusion of such characters can only be reported by throwingXMLStreamException
. -
P_OUTPUT_VALIDATE_STRUCTURE (default:
true
): if enabled, will verify structural well-formedness of output (that all start/end elements match; that there is only one root element). These checks do not impose measurable processing overhead since all bookkeeping (wrt start/end elements) has to be done regardless of whether problems are reported or not.
The main purpose for disabling this validation is the case where output is not meant to be a single document, but either a fragment or sequence of documents.
-
P_ADD_SPACE_AFTER_EMPTY_ELEM (default:
false
): setting that determines if an empty element is output as<elem />
(with space) or<elem/>
(without). -
P_AUTOMATIC_END_ELEMENTS (default:
true
): setting that determines what happens when anEND_ELEMENT
is written immediate afterSTART_ELEMENT
(excluding possible attribute writes in-between) — if enabled, output will use “empty element” notation like<elem/>
; otherwise start and end elements are written separately (<elem></elem>
).
NOTE: has no effect on explicit call towriteEmptyElement()
— only affects case of separatewriteStartElement()
/writeEndElement()
pair -
P_OUTPUT_CDATA_AS_TEXT (default:
false
): property that may be enabled to “convert” calls towriteCData()
to just produce “regular” text segments. If disabled, this will produce aCDATA
segment. -
P_OUTPUT_EMPTY_ELEMENT_HANDLER (default: undefined, type
EmptyElementHandler
): alternative toP_AUTOMATIC_EMPTY_ELEM
— optional handler that may be registered to fine-tune decision on whether to output empty element when possible to do so.
Main use case for this was to allow following (X)HTML rules wherein some tags do allow empty form (like<br/>
) and others not — there exists defaultHtmlEmptyElementHandler
implementation that may be of use for this purpose. -
P_OUTPUT_ESCAPE_CR (default:
true
): setting that determines whether output of\r
character will result in matching XML entity (when enabled) or simply character itself. Main distinction here is that embedded\r
characters will be normalized during parsing (…unless prevented byP_NORMALIZE_LFS
discussed earlier…), whereas results of entity expansion will be exposed as-is (never normalized). -
P_USE_DOUBLE_QUOTES_IN_XML_DECL (default:
false
): if enabled, xml declaration will use double quotes for version, encoding; if disabled, apostrophe (single quote).
-
P_COPY_DEFAULT_ATTRS (default:
false
): when usingXMLStreamWriter2
method for copying events from input to output, whether to explicitly write out values of default attributes (values that come from DTD definition and NOT input document) or not -
P_OUTPUT_UNDERLYING_STREAM (read-only — can NOT be set, accessed via
getProperty()
): used to get underlying OutputStream that is used for output (unlessWriter
passed — see next property) -
P_OUTPUT_UNDERLYING_WRITER (read-only — can NOT be set, accessed via
getProperty()
): used to get underlyingWriter
that is used for output (unlessOutputStream
passed — see previous property)