Those of us who usually only read and write English tend to remain blissfully unaware of the problems associated with character encoding. I recently helped one my of ThoughtWorker colleges (
Mike Williams) with a character encoding problem, and so I thought I would blog a little on the topic.
What Mike ran into was an XML document that included non-breaking spaces. The non-breaking space character is the (non-ASCII) Latin-1 character
A0. The problem was that some system beyond our control was writing an XML file as Latin1 characters, but the XML document was missing the XML
character encoding declaration.
What many of us do not realize is that, without a character encoding declaration, XML parsers will assume a
UTF-8 encoding (unless it starts with a
special "byte-order mark" for
UTF-16) . Normally this is not a problem since UTF-8 maps the plain ASCII characters to themselves. However, all non-ASCII characters, including the extended Latin1 characters such as the non-breaking space and accented Latin characters like é, are mapped to two or more bytes per character. What's more, there are no valid UTF-8 character encodings that begin with the byte
A0, and so the XML parser "blew-up" when it came across the
A0.
For more information, some useful sites include:
UTF-16 - Wikipedia and
ISO 8859 Alphabet SoupI also found the chapter on Internationalization in my old Java 1.1 edition of "Java in a Nutshell" a very good introduction to the topic. I have not looked, but I assume the later editions continue to be so..