Saturday, June 18, 2005

Beware of assuming ASCII encodings

Those of us who usually only read and write English tend to remain blissfully unaware of the problems associated with character encoding. I recently helped one my of ThoughtWorker colleges (Mike Williams) with a character encoding problem, and so I thought I would blog a little on the topic.

What Mike ran into was an XML document that included non-breaking spaces. The non-breaking space character is the (non-ASCII) Latin-1 character A0. The problem was that some system beyond our control was writing an XML file as Latin1 characters, but the XML document was missing the XML character encoding declaration.

What many of us do not realize is that, without a character encoding declaration, XML parsers will assume a UTF-8 encoding (unless it starts with a special "byte-order mark" for UTF-16) . Normally this is not a problem since UTF-8 maps the plain ASCII characters to themselves. However, all non-ASCII characters, including the extended Latin1 characters such as the non-breaking space and accented Latin characters like é, are mapped to two or more bytes per character. What's more, there are no valid UTF-8 character encodings that begin with the byte A0, and so the XML parser "blew-up" when it came across the A0.

For more information, some useful sites include: UTF-16 - Wikipedia and ISO 8859 Alphabet Soup

I also found the chapter on Internationalization in my old Java 1.1 edition of "Java in a Nutshell" a very good introduction to the topic. I have not looked, but I assume the later editions continue to be so..

3 Comments:

Anonymous Anonymous said...

There's actually quite a good article on character sets by Joel on Software. Have a look at this.

1:29 PM  
Anonymous Anonymous said...

Hi :-) Great Blog. Check out my ascii convert ebcdic site. I think you'll like it ;)

3:09 AM  
Blogger Manish Chhetri said...

have a separator.xml(in UTF8) file which contains a field space with value “ ”, which represents a non breaking space. In my java code, I concatenate this space with some text ,lets say to name.append(the value of the attribute goes in here).append(“some more data”…and send it to the client. The problem is that, in the first jsp it displays correctly as MyName space some more data. When I pass the text as a parameter, the special characters start to act weird, it passes it as “Benefit+Sample%C2%A0+258%C2%A0-%C2%A0James+Smith”



Any suggestions?

11:16 PM  

Post a Comment

<< Home