TL;DR
- Strip invalid characters with a regex: XML 1.0 allows only a narrow set of Unicode code points. A precompiled
Patternmatching everything outside those ranges lets you remove or detect offending characters before passing data to an XML parser. - Unescape curl payloads before validation: curl serialises certain characters as Java-style escape sequences (e.g.
\n,\uXXXX). CallingStringEscapeUtils.unescapeJava()on the raw payload normalises them, making the string comparable to what your tests produce in-process. Caveat: only do this when the sender is a known tool (e.g. a test harness) that produces Java-encoded strings. Do not apply it to arbitrary production payloads that may contain literal backslashes (e.g. Windows paths), whereunescapeJavawould corrupt the data. - Migrate from commons-lang3 to commons-text:
StringEscapeUtilswas deprecated incommons-lang3starting from version 3.6 (the@Deprecatedannotation appears in 3.6 and later). The replacement lives inorg.apache.commons.text.StringEscapeUtils(Apache Commons Text), which is API-compatible so only the import needs to change.
What You'll Learn
- The exact Unicode ranges that XML 1.0 considers valid, and which control characters are excluded
- How to build a reusable Java method that identifies the first invalid character in a string
- Why curl-submitted payloads behave differently from in-process test data, and how to fix it
- How to migrate from
commons-lang3's deprecatedStringEscapeUtilstocommons-text - How XML 1.1 relaxes (but does not remove) character restrictions compared to XML 1.0
The Problem
When you build a pipeline that accepts JSON, converts it to XML, and then validates or stores that XML, you run into a subtle compatibility problem: JSON is largely character-agnostic, while XML 1.0 is not.
The XML 1.0 specification defines a strict set of permitted characters. Anything outside that set (including most C0 control characters) makes the document structurally invalid. A JSON payload that contains, say, a form-feed character (\f, U+000C) or a vertical tab (\v, U+000B) will deserialise cleanly in Java but produce a document that no conformant XML parser will accept.
The situation gets worse when clients send data through curl. curl passes string arguments through the shell, which can introduce its own escaping layer. A field value that reads as hello\nworld in the raw HTTP body may arrive at your endpoint as the two-character sequence backslash-n rather than a real newline, or vice versa, depending on how quoting is handled. If your in-process unit tests create strings in Java source, the compiler handles escape sequences at compile time, so the test passes while the curl case fails.
Quick Answer
Compile a pattern once that matches any character outside the XML 1.0 legal ranges, then use it to scan (or clean) incoming strings:
private static final Pattern INVALID_XML_CHARS = Pattern.compile(
"[^\u0009\r\n\u0020-\uD7FF\uE000-\uFFFD]"
);
For a quick strip: XML10_INVALID.matcher(input).replaceAll("").
For detection (to surface validation errors to callers): return the first match rather than silently deleting it.
Sanitizing Invalid XML Characters
The XML 1.0 Character Grammar
The XML 1.0 specification (§2.2) defines the production:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
In plain terms:
| Code point(s) | Description |
|---|---|
| U+0009 | Horizontal tab |
| U+000A | Line feed |
| U+000D | Carriage return |
| U+0020–U+D7FF | Latin, Greek, CJK and most BMP scripts |
| U+E000–U+FFFD | BMP private-use area (surrogate range U+D800–U+DFFF excluded) |
| U+10000–U+10FFFF | Supplementary planes |
Everything else is forbidden: U+0000 through U+0008, U+000B (vertical tab), U+000C (form feed), U+000E through U+001F, and U+FFFE/U+FFFF.
Precompiled Pattern
Compiling the pattern on every call is wasteful. Declare it as a class-level constant:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class XmlSanitizer {
/**
* Matches any character that is NOT legal in an XML 1.0 document.
* Based on the XML 1.0 Char production: https://www.w3.org/TR/xml/#charsets
*/
private static final Pattern XML10_INVALID = Pattern.compile(
"[^\u0009\r\n\u0020-\uD7FF\uE000-\uFFFD]"
);
/**
* Returns the first invalid XML 1.0 character found in the input,
* or null if the input is clean.
*/
public static String firstInvalidXmlCharacter(String input) {
if (input == null) {
return null;
}
Matcher m = XML10_INVALID.matcher(input);
if (m.find()) {
return m.group();
}
return null;
}
/**
* Strips all invalid XML 1.0 characters from the input.
*/
public static String stripInvalidXmlCharacters(String input) {
if (input == null) {
return null;
}
return XML10_INVALID.matcher(input).replaceAll(""); // safe: pattern excludes valid surrogates
}
}
Use firstInvalidXmlCharacter when you want to report a validation error to the caller with context. Use stripInvalidXmlCharacters when you own the data pipeline and silent cleanup is acceptable.
Unit Testing the Sanitizer
Always test with known bad inputs so regressions are caught immediately:
import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.*;
class XmlSanitizerTest {
@Test
void nullInputReturnsNull() {
assertNull(XmlSanitizer.firstInvalidXmlCharacter(null));
}
@Test
void validAsciiPassesThrough() {
assertNull(XmlSanitizer.firstInvalidXmlCharacter("Hello, world!"));
}
@Test
void tabNewlineCarriageReturnAreValid() {
assertNull(XmlSanitizer.firstInvalidXmlCharacter("line1\nline2\r\n\ttabbed"));
}
@Test
void formFeedIsInvalid() {
// U+000C (form feed) is forbidden in XML 1.0
assertNotNull(XmlSanitizer.firstInvalidXmlCharacter("bad\u000Cchar"));
}
@Test
void stripRemovesInvalidCharacters() {
String cleaned = XmlSanitizer.stripInvalidXmlCharacters("bad\u000Cchar");
assertEquals("badchar", cleaned);
}
}
Validate the cleaned string against a real XML parser (e.g. javax.xml.parsers.DocumentBuilder) in at least one integration test, not just against the regex, to catch any edge cases the pattern misses.
Handling Curl Request Escaping
Why curl Behaves Differently
When you test a REST endpoint from Java (a main method or a JUnit test), string literals are compiled: "hello\nworld" becomes a ten-character string with a real newline at position five.
When the same endpoint is exercised via curl, the story depends on the shell and quoting style:
# Single quotes — shell does NOT interpret \n; server receives literal backslash-n
curl -X POST -d 'hello\nworld' http://localhost:8080/api/xml
# $'...' syntax — shell DOES expand \n to a real newline
curl -X POST -d $'hello\nworld' http://localhost:8080/api/xml
If your XML validation runs directly on the raw request body string, a literal \n (two characters: \ and n) is valid XML: it is just a backslash followed by the letter n. But the intent was a newline, and downstream processing may re-interpret it, causing the validation logic to disagree with what the data actually becomes after parsing.
The practical fix, when the client is a known tool such as a test harness or internal script that deliberately sends Java-encoded strings, is to unescape the payload before validation using StringEscapeUtils.unescapeJava. For arbitrary production payloads (e.g. fields that may contain Windows file paths with literal backslashes), apply the regex directly to the raw string without unescaping, or the backslashes will be consumed and the data corrupted.
Using StringEscapeUtils: commons-lang3 vs commons-text
StringEscapeUtils was originally in commons-lang. In commons-lang3 it remained available but was marked @Deprecated starting from version 3.6, meaning 3.6 and later flag it at compile time, and if you are on any version of commons-lang3 you should migrate. The class was moved to the dedicated Apache Commons Text library.
Deprecated (commons-lang3 ≥ 3.6, @Deprecated annotation added in 3.6):
If you are on any version of commons-lang3, migrate to commons-text. The
@Deprecatedmarker was added in 3.6, meaning 3.6 and later flag it at compile time, and earlier versions included it without the annotation.
<!-- pom.xml — avoid; @Deprecated starting from commons-lang3 3.6 -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.5</version>
</dependency>
import org.apache.commons.lang3.StringEscapeUtils; // @Deprecated starting in 3.6
Current (commons-text):
<!-- pom.xml -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-text</artifactId>
<version>1.12.0</version>
</dependency>
import org.apache.commons.text.StringEscapeUtils; // current location
The method signatures are identical. Only the import changes.
Updated Validation Method
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.commons.text.StringEscapeUtils;
public class XmlValidator {
private static final Pattern XML10_INVALID = Pattern.compile(
"[^\u0009\r\n\u0020-\uD7FF\uE000-\uFFFD]"
);
/**
* Validates that the payload contains no XML 1.0 invalid characters.
* Unescapes Java escape sequences first so that curl-submitted payloads
* are treated the same as in-process strings.
*
* <p><strong>Caveat:</strong> {@code unescapeJava} is appropriate only when
* the client is known to send Java-escape-encoded strings — for example, a
* test harness or an internal tool that deliberately encodes newlines as
* {@code \n}. Do <em>not</em> apply it to arbitrary production payloads
* containing literal backslashes (e.g. Windows file paths such as
* {@code C:\Users\alice}): those backslashes will be consumed and the data
* corrupted. For untrusted or mixed-origin payloads, validate the raw string
* directly without unescaping.</p>
*
* @param rawPayload the raw string from the HTTP request body (assumed to be
* Java-escape-encoded, e.g. from a curl test harness)
* @return the first invalid character found, or null if the input is clean
*/
public static String validateXmlPayload(String rawPayload) {
if (rawPayload == null) {
return null;
}
// Normalise Java-style escape sequences that curl may have introduced.
// Only safe when the sender is known to produce Java-encoded strings;
// see Javadoc caveat above before using this in production pipelines.
String unescaped = StringEscapeUtils.unescapeJava(rawPayload);
Matcher m = XML10_INVALID.matcher(unescaped);
if (m.find()) {
return m.group();
}
return null;
}
}
Call validateXmlPayload at the point where you receive the HTTP request body, before any JSON-to-XML conversion takes place.
Frequently Asked Questions
Q: What characters are invalid in XML 1.0?
A: The XML 1.0 specification excludes all C0 control characters except for the three whitespace characters it explicitly permits: tab (U+0009), line feed (U+000A), and carriage return (U+000D). That means form feed (U+000C), vertical tab (U+000B), null (U+0000), and characters in the range U+0001–U+0008 and U+000E–U+001F are all forbidden, even as character references. At the high end, the surrogates (U+D800–U+DFFF), U+FFFE, and U+FFFF are also prohibited. C1 control characters (U+0080–U+009F) are technically legal in XML 1.0 text, though many parsers warn about them. (U+007F is DEL, a C0 control. It is separately excluded by the XML 1.0 character grammar.)
Q: Does XML 1.1 have different rules?
A: Yes, but the practical benefit is narrow. XML 1.1 permits most C0 control characters (U+0001–U+001F, excluding U+0000) to appear in documents as character references (e.g. ). It also moves the C1 range (U+0080–U+009F) from "allowed directly" to "must be escaped as character references." The critical restriction that U+0000 (NUL) is forbidden both directly and as a character reference is unchanged in XML 1.1, because allowing NUL would break C/C++ string handling in countless XML processing libraries. In practice, most systems stick with XML 1.0 because XML 1.1 support in parsers is inconsistent and the use cases for embedding raw control characters in XML are rare.
Q: How do I test my sanitization logic?
A: Two layers of testing are important. First, unit-test the sanitizer with a table of known-bad code points: U+0000, U+000B, U+000C, U+001F, U+FFFE, U+FFFF, and a high-surrogate character. Assert that the sanitizer flags or strips each one and leaves valid characters untouched. Second, run an integration test that pipes sanitized output into a real XML parser (javax.xml.parsers.DocumentBuilderFactory or similar) and asserts that no SAXParseException is thrown. The regex validates the character set; the parser validates the document structure. Both checks are needed.
Key Takeaways
- Precompile the pattern: The XML 1.0 invalid-character regex is constant; compile it once as a static final field to avoid per-call overhead and make the intent explicit.
- Unescape before validating (when appropriate): curl payloads may contain Java-style escape sequences as literal text; calling
StringEscapeUtils.unescapeJava()before your regex check ensures parity with in-process test data. This is safe only when the client is a known test harness or internal tool. For arbitrary production payloads with literal backslashes, skip the unescape step to avoid corrupting data. - Migrate to commons-text:
org.apache.commons.lang3.StringEscapeUtilscarries@Deprecatedstarting from version 3.6. Update your dependency tocommons-textand change only the import. The API is unchanged.
What's Next?
Recommended Reading:
- Invalid Characters in XML (Baeldung): thorough walkthrough with multiple Java approaches
- Valid characters in XML (Wikipedia): concise reference table covering both XML 1.0 and 1.1
Action Items:
- Search your codebase for usages of
org.apache.commons.lang3.StringEscapeUtilsand update the import toorg.apache.commons.text.StringEscapeUtils, adding thecommons-textdependency if it is not already present. - Add an integration test that passes a known-bad character (e.g. U+000C) through your full JSON-to-XML pipeline and asserts that the XML parser does not throw, confirming sanitization happens before the conversion step.