Tuesday, October 22, 2013

Linux vs. Windows Character Set Encoding Question

Something of a technical question that most readers will read and go, "Huh?"  That's okay.

There are a multiple character set encodings that are available for transferring information.  In this case, the database is Informix; we are using the IBM JDBC driver to exchange information between the Java middleware and Informix.  We are using Tomcat 7.

The problem is that Microsoft Word uses the extended ASCII character set to represent emdash, endash, and the "smart quotes" and "smart apostrophes."  Using CP1252 encoding when Tomcat7 is running on Windows, the extended ASCII character set values are correctly stored in BLOBs on Informix.  Using the same encoding on Linux, the extended ASCII character set values seem to be turned into something unrecognizable -- and what used to be individual characters comes out as several fairly random characters in the extended character set.

Questions:

1. Is this difference between Linux and Windows behavior because CP1252 encoding is not properly supported by Linux?

2. Is there an encoding character set that will allow the extended ASCII character set to be stored correctly in the Informix database?

13 comments:

J Greely said...

"what used to be individual characters comes out as several fairly random characters" sounds like the data is being stored in the DB as UTF-8.

-j

Sigivald said...

Using the same encoding on Linux, the extended ASCII character set values seem to be turned into something unrecognizable -- and what used to be individual characters comes out as several fairly random characters in the extended character set.

That smells like something's turning it into Unicode; UTF-8 encoding, for instance, uses two bytes for everything above 127.

(CP1252 is really just ISO-8859-1, which probably ought to be supported "on linux", but that doesn't mean the DB/Server/Middlware won't need some massaging.

Ideally the solution would be to ignore CodePage stuff entirely and store it all as Unicode text, which Word can handle just fine anymore, or ought to...)

I still think the real

Widget said...

1. Yes, to a first approximation.

2. Switch everything to UTF-8 and it should work on both Windows and Linux. Note that you may have to set both compile- and run-time options in several places.

(The expansion of chars you are seeing is almost certainly a bogus Cp1252 /UTF-8 conversion roundtrip. The linux side is probably in UTF-8 mode already, by default.)

TM Lutas said...

CP1252 should be supported on Linux. Can you be more specific about what you're using?

Anonymous said...

Seeing "several random characters", as others have said, is usually a Cp1252/UTF-8 mixup. Specifically, it's the usual symptom of text data being stored as UTF-8, but being interpreted as Cp1252.

I would be willing to bet real money that the "several fairly random characters" you're seeing generally start with †(a with circumflex accent, followed by the Euro sign). That's what curly quotes (single or double) turn into in that scenario. Accents turn into other sequences: http://www.i18nqa.com/debug/utf8-debug.html has a pretty complete set.

There is really no excuse for any software using Cp1252 in this day and age. Switch to a Unicode encoding (UTF-8 is the de facto Internet standard) and you should have fewer problems.

Widget said...

CP1252 is not just iso-8859-1, especially for the chars you mentioned like em-dash and curly quotes (which have binary codes in the middle of iso-8859-X's extended control-character range).

Switch all your stuff to UTF-8 explicitly and you will be pleased.

Clayton said...

Thanks. Yes, the Euro is in there -- amusingly so, because I explained to the younger engineer with whom I am trying to unravel this mess that we should look for a general solution, because before you know, our customers will be pasting Euro symbols into these offender reports. And there they were! (Although for that reason.)

Switching to UTF-8 is my preference, but we were having some problems debugging the middleware on a Linux box until today.

Sigivald said...

Widget is quite correct - since the text encoding I use for work has long since been all standardized to UTF-8, I had to bone up on what CP1252 was, and totally misread the difference between it and ISO 8859-1.

(It's a superset, that differs in the "high-ascii" range; I misread it as differing in the low control-character range...)

Clayton said...

It appears that because the Informix database was originally created (and populated) with data in CP1252 encoding, the only way to redo it UTF-8 involves unloading it, recreating it, and reloading it. This is no practical at this point.

It was several ugly hacks, but we beat the middleware into handling the differences in a consistent manner.

Anonymous said...

It appears that because the Informix database was originally created (and populated) with data in CP1252 encoding ...

It appears that somebody (whoever set up the database) didn't read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Clayton said...

The critters who set up the database and wrote most of this software I would not trust to buy bread at the market without assistance. This is an absolute mess.

hga said...

These critters would include ones involved in a case where you gave expert testimony?

Clayton said...

I did not give expert testimony. I was just one of many people interviewed for a report by the Office of Professional Standards.