September 1, 2014

Input Sanitization: Invalid XML Data, Validation

How to sanitize invalid XML 1.0 characters using a precompiled regex in Java, and how to fix validation failures caused by curl request escaping.

TL;DR

Strip invalid characters with a regex: XML 1.0 allows only a narrow set of Unicode code points. A precompiled Pattern matching everything outside those ranges lets you remove or detect offending characters before passing data to an XML parser.
Unescape curl payloads before validation: curl serialises certain characters as Java-style escape sequences (e.g. \n, \uXXXX). Calling StringEscapeUtils.unescapeJava() on the raw payload normalises them, making the string comparable to what your tests produce in-process. Caveat: only do this when the sender is a known tool (e.g. a test harness) that produces Java-encoded strings. Do not apply it to arbitrary production payloads that may contain literal backslashes (e.g. Windows paths), where unescapeJava would corrupt the data.
Migrate from commons-lang3 to commons-text: StringEscapeUtils was deprecated in commons-lang3 starting from version 3.6 (the @Deprecated annotation appears in 3.6 and later). The replacement lives in org.apache.commons.text.StringEscapeUtils (Apache Commons Text), which is API-compatible so only the import needs to change.

What You'll Learn

The exact Unicode ranges that XML 1.0 considers valid, and which control characters are excluded
How to build a reusable Java method that identifies the first invalid character in a string
Why curl-submitted payloads behave differently from in-process test data, and how to fix it
How to migrate from commons-lang3's deprecated StringEscapeUtils to commons-text
How XML 1.1 relaxes (but does not remove) character restrictions compared to XML 1.0

The Problem

When you build a pipeline that accepts JSON, converts it to XML, and then validates or stores that XML, you run into a subtle compatibility problem: JSON is largely character-agnostic, while XML 1.0 is not.

The XML 1.0 specification defines a strict set of permitted characters. Anything outside that set (including most C0 control characters) makes the document structurally invalid. A JSON payload that contains, say, a form-feed character (\f, U+000C) or a vertical tab (\v, U+000B) will deserialise cleanly in Java but produce a document that no conformant XML parser will accept.

The situation gets worse when clients send data through curl. curl passes string arguments through the shell, which can introduce its own escaping layer. A field value that reads as hello\nworld in the raw HTTP body may arrive at your endpoint as the two-character sequence backslash-n rather than a real newline, or vice versa, depending on how quoting is handled. If your in-process unit tests create strings in Java source, the compiler handles escape sequences at compile time, so the test passes while the curl case fails.

Quick Answer

Compile a pattern once that matches any character outside the XML 1.0 legal ranges, then use it to scan (or clean) incoming strings:

private static final Pattern INVALID_XML_CHARS = Pattern.compile(
    "[^\u0009\r\n\u0020-\uD7FF\uE000-\uFFFD]"
);

For a quick strip: XML10_INVALID.matcher(input).replaceAll("").

For detection (to surface validation errors to callers): return the first match rather than silently deleting it.

Sanitizing Invalid XML Characters

The XML 1.0 Character Grammar

The XML 1.0 specification (§2.2) defines the production:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

In plain terms:

Code point(s)	Description
U+0009	Horizontal tab
U+000A	Line feed
U+000D	Carriage return
U+0020–U+D7FF	Latin, Greek, CJK and most BMP scripts
U+E000–U+FFFD	BMP private-use area (surrogate range U+D800–U+DFFF excluded)
U+10000–U+10FFFF	Supplementary planes

Everything else is forbidden: U+0000 through U+0008, U+000B (vertical tab), U+000C (form feed), U+000E through U+001F, and U+FFFE/U+FFFF.

Precompiled Pattern

Compiling the pattern on every call is wasteful. Declare it as a class-level constant:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class XmlSanitizer {

    /**
     * Matches any character that is NOT legal in an XML 1.0 document.
     * Based on the XML 1.0 Char production: https://www.w3.org/TR/xml/#charsets
     */
    private static final Pattern XML10_INVALID = Pattern.compile(
        "[^\u0009\r\n\u0020-\uD7FF\uE000-\uFFFD]"
    );

    /**
     * Returns the first invalid XML 1.0 character found in the input,
     * or null if the input is clean.
     */
    public static String firstInvalidXmlCharacter(String input) {
        if (input == null) {
            return null;
        }
        Matcher m = XML10_INVALID.matcher(input);
        if (m.find()) {
            return m.group();
        }
        return null;
    }

    /**
     * Strips all invalid XML 1.0 characters from the input.
     */
    public static String stripInvalidXmlCharacters(String input) {
        if (input == null) {
            return null;
        }
        return XML10_INVALID.matcher(input).replaceAll(""); // safe: pattern excludes valid surrogates
    }
}

Use firstInvalidXmlCharacter when you want to report a validation error to the caller with context. Use stripInvalidXmlCharacters when you own the data pipeline and silent cleanup is acceptable.

Unit Testing the Sanitizer

Always test with known bad inputs so regressions are caught immediately:

import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.*;

class XmlSanitizerTest {

    @Test
    void nullInputReturnsNull() {
        assertNull(XmlSanitizer.firstInvalidXmlCharacter(null));
    }

    @Test
    void validAsciiPassesThrough() {
        assertNull(XmlSanitizer.firstInvalidXmlCharacter("Hello, world!"));
    }

    @Test
    void tabNewlineCarriageReturnAreValid() {
        assertNull(XmlSanitizer.firstInvalidXmlCharacter("line1\nline2\r\n\ttabbed"));
    }

    @Test
    void formFeedIsInvalid() {
        // U+000C (form feed) is forbidden in XML 1.0
        assertNotNull(XmlSanitizer.firstInvalidXmlCharacter("bad\u000Cchar"));
    }

    @Test
    void stripRemovesInvalidCharacters() {
        String cleaned = XmlSanitizer.stripInvalidXmlCharacters("bad\u000Cchar");
        assertEquals("badchar", cleaned);
    }
}

Validate the cleaned string against a real XML parser (e.g. javax.xml.parsers.DocumentBuilder) in at least one integration test, not just against the regex, to catch any edge cases the pattern misses.

Handling Curl Request Escaping

Why curl Behaves Differently

When you test a REST endpoint from Java (a main method or a JUnit test), string literals are compiled: "hello\nworld" becomes a ten-character string with a real newline at position five.

When the same endpoint is exercised via curl, the story depends on the shell and quoting style:

# Single quotes — shell does NOT interpret \n; server receives literal backslash-n
curl -X POST -d 'hello\nworld' http://localhost:8080/api/xml

# $'...' syntax — shell DOES expand \n to a real newline
curl -X POST -d $'hello\nworld' http://localhost:8080/api/xml

If your XML validation runs directly on the raw request body string, a literal \n (two characters: \ and n) is valid XML: it is just a backslash followed by the letter n. But the intent was a newline, and downstream processing may re-interpret it, causing the validation logic to disagree with what the data actually becomes after parsing.

The practical fix, when the client is a known tool such as a test harness or internal script that deliberately sends Java-encoded strings, is to unescape the payload before validation using StringEscapeUtils.unescapeJava. For arbitrary production payloads (e.g. fields that may contain Windows file paths with literal backslashes), apply the regex directly to the raw string without unescaping, or the backslashes will be consumed and the data corrupted.

Using StringEscapeUtils: commons-lang3 vs commons-text

StringEscapeUtils was originally in commons-lang. In commons-lang3 it remained available but was marked @Deprecated starting from version 3.6, meaning 3.6 and later flag it at compile time, and if you are on any version of commons-lang3 you should migrate. The class was moved to the dedicated Apache Commons Text library.

Deprecated (commons-lang3 ≥ 3.6, @Deprecated annotation added in 3.6):

If you are on any version of commons-lang3, migrate to commons-text. The @Deprecated marker was added in 3.6, meaning 3.6 and later flag it at compile time, and earlier versions included it without the annotation.

<!-- pom.xml — avoid; @Deprecated starting from commons-lang3 3.6 -->
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-lang3</artifactId>
    <version>3.5</version>
</dependency>

import org.apache.commons.lang3.StringEscapeUtils;  // @Deprecated starting in 3.6

Current (commons-text):

<!-- pom.xml -->
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-text</artifactId>
    <version>1.12.0</version>
</dependency>

import org.apache.commons.text.StringEscapeUtils;   // current location

The method signatures are identical. Only the import changes.

Updated Validation Method

import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.commons.text.StringEscapeUtils;

public class XmlValidator {

    private static final Pattern XML10_INVALID = Pattern.compile(
        "[^\u0009\r\n\u0020-\uD7FF\uE000-\uFFFD]"
    );

    /**
     * Validates that the payload contains no XML 1.0 invalid characters.
     * Unescapes Java escape sequences first so that curl-submitted payloads
     * are treated the same as in-process strings.
     *
     * <p><strong>Caveat:</strong> {@code unescapeJava} is appropriate only when
     * the client is known to send Java-escape-encoded strings — for example, a
     * test harness or an internal tool that deliberately encodes newlines as
     * {@code \n}. Do <em>not</em> apply it to arbitrary production payloads
     * containing literal backslashes (e.g. Windows file paths such as
     * {@code C:\Users\alice}): those backslashes will be consumed and the data
     * corrupted. For untrusted or mixed-origin payloads, validate the raw string
     * directly without unescaping.</p>
     *
     * @param rawPayload the raw string from the HTTP request body (assumed to be
     *                   Java-escape-encoded, e.g. from a curl test harness)
     * @return the first invalid character found, or null if the input is clean
     */
    public static String validateXmlPayload(String rawPayload) {
        if (rawPayload == null) {
            return null;
        }
        // Normalise Java-style escape sequences that curl may have introduced.
        // Only safe when the sender is known to produce Java-encoded strings;
        // see Javadoc caveat above before using this in production pipelines.
        String unescaped = StringEscapeUtils.unescapeJava(rawPayload);

        Matcher m = XML10_INVALID.matcher(unescaped);
        if (m.find()) {
            return m.group();
        }
        return null;
    }
}

Call validateXmlPayload at the point where you receive the HTTP request body, before any JSON-to-XML conversion takes place.

Frequently Asked Questions

Q: What characters are invalid in XML 1.0?

A: The XML 1.0 specification excludes all C0 control characters except for the three whitespace characters it explicitly permits: tab (U+0009), line feed (U+000A), and carriage return (U+000D). That means form feed (U+000C), vertical tab (U+000B), null (U+0000), and characters in the range U+0001–U+0008 and U+000E–U+001F are all forbidden, even as character references. At the high end, the surrogates (U+D800–U+DFFF), U+FFFE, and U+FFFF are also prohibited. C1 control characters (U+0080–U+009F) are technically legal in XML 1.0 text, though many parsers warn about them. (U+007F is DEL, a C0 control. It is separately excluded by the XML 1.0 character grammar.)

Q: Does XML 1.1 have different rules?

A: Yes, but the practical benefit is narrow. XML 1.1 permits most C0 control characters (U+0001–U+001F, excluding U+0000) to appear in documents as character references (e.g. ). It also moves the C1 range (U+0080–U+009F) from "allowed directly" to "must be escaped as character references." The critical restriction that U+0000 (NUL) is forbidden both directly and as a character reference is unchanged in XML 1.1, because allowing NUL would break C/C++ string handling in countless XML processing libraries. In practice, most systems stick with XML 1.0 because XML 1.1 support in parsers is inconsistent and the use cases for embedding raw control characters in XML are rare.

Q: How do I test my sanitization logic?

A: Two layers of testing are important. First, unit-test the sanitizer with a table of known-bad code points: U+0000, U+000B, U+000C, U+001F, U+FFFE, U+FFFF, and a high-surrogate character. Assert that the sanitizer flags or strips each one and leaves valid characters untouched. Second, run an integration test that pipes sanitized output into a real XML parser (javax.xml.parsers.DocumentBuilderFactory or similar) and asserts that no SAXParseException is thrown. The regex validates the character set; the parser validates the document structure. Both checks are needed.

Key Takeaways

Precompile the pattern: The XML 1.0 invalid-character regex is constant; compile it once as a static final field to avoid per-call overhead and make the intent explicit.
Unescape before validating (when appropriate): curl payloads may contain Java-style escape sequences as literal text; calling StringEscapeUtils.unescapeJava() before your regex check ensures parity with in-process test data. This is safe only when the client is a known test harness or internal tool. For arbitrary production payloads with literal backslashes, skip the unescape step to avoid corrupting data.
Migrate to commons-text: org.apache.commons.lang3.StringEscapeUtils carries @Deprecated starting from version 3.6. Update your dependency to commons-text and change only the import. The API is unchanged.

What's Next?

Recommended Reading:

Invalid Characters in XML (Baeldung): thorough walkthrough with multiple Java approaches
Valid characters in XML (Wikipedia): concise reference table covering both XML 1.0 and 1.1

Action Items:

Search your codebase for usages of org.apache.commons.lang3.StringEscapeUtils and update the import to org.apache.commons.text.StringEscapeUtils, adding the commons-text dependency if it is not already present.
Add an integration test that passes a known-bad character (e.g. U+000C) through your full JSON-to-XML pipeline and asserts that the XML parser does not throw, confirming sanitization happens before the conversion step.

Resources & References

Written by the team at CloudCounsel — builders of Paywren and Mehfil.