Thursday, February 2, 2012

Handling Codepage and ASCII/EBCDIC issues

We Are Not All The Same

Dealing with character sets is never easy.  We all wish the industry would just pick one codepage and go with it, but the realities of multi-language and multi-platform support make that an impossible dream.

Even if you know you will never deal with an EBCDIC-based platform you can't escape the need for multi-language support.  The fact remains that not all ASCII codepages are the same.

It's easy to write code that ignores this problem.  It's hard to retrofit a solution into code that absolutely ignored the existence of the problem.

I'll Worry About That Later
Suppose you write your complex application, totally ignoring the codepage issues.  How bad can it be to fix it later when a customer demands NLS support, or you need to port it to another platform?  After all, that's an issue for the porting engineers isn't it?

Remember we are all in this together.  The goal is to produce a product as quickly as possible so the company makes money and pays our salary.  Unless you want your porting team to take up the head count budget, it pays you and the company to plan ahead.

Difficulty In Retrofitting
You've spent months writing a complex application with multiple I/O points.  The application displays web pages and sends socket requests to the four corners of the world.  It talks to databases and reads files and extracts information from all manner of data sources.  You've completely ignored the codepage issue and now you want to deal with it.  Where do you start?

Chances are the web server is UTF-8.  Everything else is platform-dependent.

You read data from the database: is the database on the mainframe? on AS/400? on Windows?  What codepage did you use to ask for the data?  Did you even specify what codepage you wanted it in, or just take the default at the database?

You read data from a file: is the file a PDS on the mainframe with naming convention issues that look nothing like any other platform?  Is it UNIX-based, where capitalization is an issue and the slashes go in one particular direction?  Is it Windows, where the capitalization might not be an issue, spaces are allowed to be embedded, and the slashes go the other direction?

You want to display data on a web page or end-user application.  Where did the data come from?  What codepage was it produced with, and what codepage do you want to display it with?

Thinking about this after the fact is painful and error-prone.  Every hole you patch shows six other holes, and further testing shows more.  If you manage to ship the code, how do you explain to the user that the error message '   ?>    /    _>' really means "Invalid Input"?

Going on a Wumpus Hunt
You've shipped the code, or testing has shown that somewhere, buried in the code there's a message that isn't getting translated properly.  Either it was never translated or, even worse, it's been translated twice.  I usually don't mind the never-translated.  On some systems you can see the hex values for the message and do the translation by hand and hunt down the message.  But doubly translated?  It's incomprehensible gobbledy-gook.  You can't grep for '  ?>  /  _>'.   How do you hunt down a message when you don't even know what the content was?

Unintentional Lack Of Portability
Suppose your code was written with some level of portability.  You have spots in your code where translation takes place.  Most things are working, but every time you add a test case or make minor revisions, you could be introducing unintentional lack of portability.

Suppose you are reading a line from a file and you want to check for end-of-line.  You add something like this:

if (myString[i] == '\x0A')

or you want to check for a blank and you accidentally slip in:

if (myString[i] == '0x20')

Or you want to make sure the characters you are about to display are readable and you do this:

if (myString[i] >= '0x30' &&
    myString[i] <= '0x7E')

All of these examples are going to fail a portability test, as they tie your code to a specific codepage.

Any one attempting to port this code won't find this unless they look through every line of code that you have written.

Magic Numbers and Phrases
Does your code look like this?

rc = 100;              /* assume all is well */
strcpy(status, "All is well");
rc = doFunction(parm1, parm2, "MakePumpkins", 121);
if (rc != 100)
{
    printf("error %d");
}

This code is full of magic numbers and phrases.  This makes porting things difficult and maintenance of the code costly.

Suppose you decide that the "good" error code should no longer be 100.  If you've hard-coded it every place, then every location needs to be updated to the new "good" error code.

The function call has magic phrases and magic numbers.  Finding and updating these phrases is next to impossible.

Magic phrases add "another layer of impossible" when you have codepage issues.  That magic phrase might have to be in a different codepage, depending on when and where this code is invoked and handled.  In a system with both ASCII and EBCDIC modules, this function call might cross the boundary between an EBCDIC front-end module and an ASCII compiled internal function.

Save yourself a nightmare and put all these magic numbers into a header file as #defines or constants.  You will find maintenance much easier with only one spot to update the values, and no grepping around to find them all.  Your porting team will be able to focus on what values might cause issues and find ways to handle them, while being able to grep for the specific usages of the values.

Configuration Information
In your test libraries you likely have tons of these magic phrases and numbers.  If you are using this test library on multiple platforms, it's quite likely that these values will not be the same for all environments.  Putting them in a header file is nice but there's a better solution.  Externalize them in a file and write a simple application to suck the values into the test programs.

If you go the header route, then you have to compile the code, test it, change the value to something appropriate for the platform, recompile, retest.  Continue until every value in the header file is checked against the test cases.  Then, for every platform you support, add an #ifdef until the file is unreadable. Instead of wasting time with recompiling, putting these values into an external file means all you have to do is update the file and rerun, with no intervening compilation steps.

Steps For More Portable Coding
What can you do to produce code that is more portable and protects itself from codepage issues?

1. Stop, Think, then Code
Don't spend months agonizing over all the places things can go wrong,  but before every coding project give some thought about where a portability issue might come into place.  It's much easier to structure the code to isolation the porting problems when you first write it than afterwards.

Regardless of the rush you are in to finish the code, a few minutes thinking and planning ahead of time will always make the code better.

2. Watch for Magic Phrases, Magic Numbers and Hex Values
Isolate these stumbling blocks into header files or external files.  While coding both the product and the test cases, make these things obvious and easy to find.

3. Comments
Every module, function, and structure should have a comment block describing its usage already.  While you are writing or adding to these comment blocks, include a line or two about what codepage it should be dealing with.  Functions that interface with outside data will most likely be in the external codepage.  Internal functions will most likely be in the product's codepage.  It may seem obvious to you while you are writing the function, but it won't be obvious three years later or to any other programmer.

4. Conversion Functions
Even if you don't need the codepage conversions now while you are writing the bulk of your code, you know where they will be needed.  Add a dummy function that doesn't do anything yet.  doConvertExternalToInternalCodePage() and doConvertInternalToExternalCodePage() will make it clear where the conversions are supposed to take place.  These functions will likely have different implementations for different platforms, but now it will be easy to port the code and spend the company's man-hours writing new products instead of porting.



No comments:

Post a Comment