AdmServ: ISO-8859-1 vs. UTF-8 (updated)

Posted by: mstauber Category: Development

How the "massive" BlueOnyx update currently in the pipe came about. It was a small problem that needed a pretty big solution.

The Problem:

The BlueOnyx GUI had a problem where the textbox for the AutoResponder message shows empty. Even if there is an autoresponder message defined.

This only happens if the text entered into that particular field contains umlauts or any "special" characters not defined within the ISO-8859-1 charset.

For kicks and giggles I edited /etc/admserv/conf/httpd.conf and changed ...

AddDefaultCharset ISO-8859-1

... to ...

AddDefaultCharset UTF-8

.. and all the textboxes started to take umlauts, Japanese Kanji, Chinese, Hebrew and whatever else I tried.

So this is a nice little fix for a small problem?

Wrong!

It breaks the German and Danish translation. The locale files of these have headers which set the charset of the locales to ISO-8859-1. And when an UTF-8 using PHP script tries to use the ISO-8859-1 locales, then they display garbled.

Well, another easy fix: Change the charset in the locale *.po files from ISO-8859-1 to UTF-8. Lots of search and replace, but doable. OK, done.

Net result:

All BlueOnyx modules need to be rebuild and released through YUM again.

Which makes this the biggest update that was ever in the release pipe and amasses to +260 updated RPMs. \o/

Ok, I could just release the updated German and Danish locales. But then we have RPMs from the same module with different version numbers. Which is very untidy.

As of now I haven't released it yet. Hell, just the building of the RPMs took an hour on my 5108R build box. And there are still some small problems with a few broken locales here and there.

But I think there will be no alternative to eventually push this update out.

Is there an alternative?

I tried other work arounds. In libPhp/uifc/FormFieldBuilder.php (the function that builds the GUI form fields) I tried to detect and convert strings and arrays that contain non-ISO-8859-1 charsets to the right charsets. This had to be a two way transaction: Read the input, do a conversion if needed and display them in UTF-8. And when you write it off to CODB, you have to make sure that the stuff is also stored in the right format (UTF-8 and not ISO-8859-1 anymore). It gets even more complicated if the text we store in CODB contains multiple languages. Like a vacation message in English, French, Japanese and German for example. How would you encode that? The only choice would be (you guessed it!): UTF-8, which supports it all.

Hell, even PHP's iconv(), htmlspecialchars() and mb_encoding*() functions leave a lot to be desired. They all have problems with some charset, do certain things only half way and are generally a pain in the gluteus maximus.

Short answer: No.

Japanese Locales:

This is another pain in the gluteus maximus. When we progressed from 5102R to 5106R we already had to switch our German locales from 'de' to 'de_DE' and the Danish locales from 'da' to 'da_DK'. And English from 'en' to 'en_US'.

We kept Japanese at 'ja', as it was always handled a bit out of the ordinary anyway. This has historical reasons, because back in the Cobalt days (and later on with more work from Hisao) external methods were used to display the Japanese locales with their own charset set through Apache and PHP.

On 5106R we could run the AdmServ with ISO-8859-1 and all locales but Japanese would use it. Japanese on the other hand was then using 'ja_JP.euc' (or more specifically: EUC-JP), although the LANG variable was set to 'ja'. But on 5107R/5108R this broke again. Or at least it started to toggle between Japanese, English or undefined on page reloads. On certain boxes. Which wasn't nice either way.

So during this massive update I also tackled this issue:

All Japanese locales got copied to 'ja_JP' as well. The new 'ja_JP' locales will be forced in as the base-alpine module will now require these locales. It doesn't really need them, but due to the defined dependencies we trick YUM into installing them automatically.

The locale 'ja' is being retired and 'ja_JP' is now the new default Japanese language on BlueOnyx.

However, this leaves one remaining problem with Japanese: Like mentioned above the GUI uses EUC-JP as character set when the GUI is set to Japanese. But the way this is done in Sausalito is a hack bordering on the extreme. Especially when it comes to passing on (and parsing) form submitted data this is getting utterly problematic. The built in conversion mechanism may be able to handle anything from within the EUC-JP character set just fine. But try to save a vacation message in the Japanese GUI that contains Umlauts or special characters such as accents or acutes. The conversion mechanism will "wreck" these characters and will save them in escaped HTML format. Which then - again - leads to display problems in the GUI and blank vacation message fields, although a vacation message is entered. Likewise such special characters (hell - even Kanji characters!) are dropped from "hidden" form fields when you switch between tabbed GUI page elements.

Right now our Japanese support works - more or less. But it is horrible and needs a very large rewrite from the ground up. Which I will save for another day (or decade - if you really ask me).

Odds and sods:

Yes, this update is massive. There were also a lot of bits and pices that had to be changed around, as we're now taking UIFC (the HTML rendering engine of Sausalito) off into uncharted waters by switching to UTF-8. It speaks for the Cobalt Network engineers that the engine seems to handle this well and whatever was boken after switching to UTF-8 might as well have been less of an oversight of theirs, but modifications that were done at later times by others (inlcuding me).

There were some problems with broken umlauts in buttons, labels and headers, but they were trivial to fix. Another problem was a weird timeout of AdmServ when trying to show an old vacation message or user comment block that contained data in the old ISO-8859-1 format. But that's just natural, as these functions used preg_match() over a string that now contained non ASCII data.

Finally the mailer for the autoresponder had to be switched from enocing emails in ISO-8859-1 to encoding them in UTF-8 instead (for all but Japanese). But that was trivial as well, as the Perl-MIME-Lite module that we use for the mailings supports it just fine. But for mailing of Japanese auto-responder messages I had to use another dirty hack to decode the vacation message into something that can be sent via a text only email in the EUC-JP character set. Which still will generate an email with a somewhat malformed subject not really conforming to RFC standards. It works, but only barely so for Japanese.

But yes: Chances are that this massive update will cause some ill side effects when it is installed. It shouldn't, but it cannot entirely be ruled out either. So I'll sit on it for another day until I can get more testing done.

In closing:

Using the character set ISO-8859-1 for AdmServ was a bad design decission to begin with. UTF-8 should have been used from day one of BlueOnyx. So it is better to fix this now (although late), than even much later down the road.


Return
General
Mar 20, 2012 Category: Development Posted by: mstauber
Previous page: API Documentation Next page: Downloads