David Kemp's Blog: June 2005

Those of us who usually only read and write English tend to remain blissfully unaware of the problems associated with character encoding. I recently helped one my of ThoughtWorker colleges (Mike Williams) with a character encoding problem, and so I thought I would blog a little on the topic.

What Mike ran into was an XML document that included non-breaking spaces. The non-breaking space character is the (non-ASCII) Latin-1 character A0. The problem was that some system beyond our control was writing an XML file as Latin1 characters, but the XML document was missing the XML character encoding declaration.

What many of us do not realize is that, without a character encoding declaration, XML parsers will assume a UTF-8 encoding (unless it starts with a special "byte-order mark" for UTF-16) . Normally this is not a problem since UTF-8 maps the plain ASCII characters to themselves. However, all non-ASCII characters, including the extended Latin1 characters such as the non-breaking space and accented Latin characters like é, are mapped to two or more bytes per character. What's more, there are no valid UTF-8 character encodings that begin with the byte A0, and so the XML parser "blew-up" when it came across the A0.

For more information, some useful sites include: UTF-16 - Wikipedia and ISO 8859 Alphabet Soup

I also found the chapter on Internationalization in my old Java 1.1 edition of "Java in a Nutshell" a very good introduction to the topic. I have not looked, but I assume the later editions continue to be so..

David Kemp's Blog

Saturday, June 18, 2005

Atom vs RSS

Beware of assuming ASCII encodings

About Me

Previous Posts

Archives