Supportable Cross Platform Programming

Sunday, March 25, 2012

How do you know you have well written code?

If you’re lucky, in your career as a software programmer you will be called upon to write a whole new product—something you get to design from the ground up, with a chance to decide the variable names, class organization,file names, etc.

When you start it will be clear where you want to go and what you want to do. It's all shiny new code and your hopes will be high for the perfect software system.

Chances are, the first couple of times you do this you'll end up with the proverbial Gordian Knot. All your ideas for a clear design will get bogged down in implementation and plans that you made will end up in ashes.

It takes practice to write a good software system. You don't get good at it by maintaining software: creating something from nothing takes an effort and experience.

It's easy to recognize when you are working on good maintainable code. It's easy to work on. Easy to figure out where the fix should go. When you fix something it stays fixed and doesn't cause other problems.

These same things should be your goals for a new software system.

Clarity

Good, maintainable code has a clarity of thought that is so easy to understand that you forget someone had to work at it. While maintaining it, you don't have to think about why the designer did things a certain way because it feels obvious, it feels like there is no other way to do it.

Bad code has you scratching your head and asking "Why?" continuously as you look through the code.

That doesn't mean the good code won't have areas that need to be clarified. It does mean that every line of code and every subroutine doesn't require extensive documentation to justify its existence. If your code has been broken into subroutines artificially or without regards to any cohesive logic, then it’s hard to follow why it was broken along those lines. Since the goal is to write code that is easy to follow, creating artificial and illogical divisions makes the code difficult to maintain.

I advocate a top-down approach. I like to make my top-level modules and functions as clean and clear as possible. I keep pushing the implementation details down into subroutines until clarity is achieved. It isn't done in one throw: you don't know everything you need to know when you first start writing the code. You take a stab at it, then come back and comb through it, smoothing the bumps and jaggies until you get to an equitable arrangement.

One of my favorite software engineers always said to write the system, then throw it ALL away and start over from scratch. Everything you learned from the first try can be applied to the new system. By throwing the first try away completely, you don't tie yourself to fuzzy implementations. It's tempting to reach back and grab some stuff from the first implementation but it works best if you fight that impulse. If you start grabbing code from the original implementation then you start borrowing the structure and fuzziness that was created before you totally understood your problem. The more code you borrow, the more like the original implementation the code becomes, defeating the point of rewriting the code from scratch.

Lack of Problems

It's easy to recognize problem areas in the code. Those are the areas that you keep coming back to over and over to apply fixes. Every time you fix something, two more things pop up.This is a sign of code that desperately needs to be rewritten.

One technique is to “black box” that code. Pretend you know nothing about the inner works. Analyze the old code, isolate the inputs and the outputs, then rewrite it from the ground up using none of the old code. If your analysis was right you are likely to have fixed the trouble spots.

If the entire product is like this, then it's time to rewrite the product. It doesn't have to be done all at once; iteratively works too. Isolate an area, figure out the inputs and outputs, and rewrite from the ground up. What you learned by maintaining that code will help you in the rewrite. First you have to understand what the code was doing, why it was doing it, and why the approach that was made initially didn't work.

Help out the future programmers

It's fun to write throwaway code. We do this in college many times. Quickly write an application, ignore the error handling and any edge conditions. Stuff it in the computer and turn it in, never to look at it again.

This doesn't work in the real world. The real world is full of edge conditions and most of your real world code is error handling.

What you write today you might be maintaining 5 years, 10 years from now. If you make it hard for yourself by ignoring error conditions and edge conditions you will loathe yourself years from now. Consider what a stranger attempting to maintain that same code will think of you: after a few rounds of maintenance your IQ will get lower and lower as they struggle to handle the stuff you blithely ignored when you started.

Think about that when you write new code. Handle ALL the error conditions. Have a structure that can deal with those errors. Don't assume that an error condition is unthinkable. Even if most of them are, the few that are thinkable will be the ones the customer finds first and every time. Ponder ALL your edge conditions. Edge conditions take the most amount of effort to handle—that last byte in the file, a message that is one byte too long. Figure out how to handle those things in the beginning and save the future programmers.

Give it to someone else

At some point you’ll have to test whether the code is easy for you to maintain because you wrote it, or because you wrote it well.

The only way to do that is to have someone else work on it. It takes some personal security and bravery to do this, and you will have to reign in your ego. Taking criticism from outside parties on your blood, sweat, and tears is hard. But it’s definitely worth it.

Accept the feedback because that’s what you are looking for. If a sequence isn’t easy for them to understand, no matter how blindingly obvious it is to you, then you have failed to communicate your purpose and architecture. No one knows what goes on inside your head. No one can ever know that. You need to come to an agreement between what you see and what people outside your head can understand. You need to “speak” in such a way that people with many different viewpoints understand you.

Once you receive that feedback, act on it. Change the structure so it fits with a more worldly view. Internalize that view until you can see it. Sometimes the feedback doesn’t work directly with the architecture, but you should be able to marry the two until the code makes sense to everyone.

Reap the rewards

I know I've written a good software system when I get the Snowball Effect.The Snowball Effect happens when you've finished writing the system and now you are enhancing it and maintaining it. If it was well designed then the design makes the new stuff easy to add. Full new features go in with minimal effort, sometimes in less than a day. A bug reported by a customer gets fixed in minutes instead of weeks. If you keep to your good coding practices, this gets faster and faster until the work you do seems like magic.

As you enhance and maintain the system, look for places that are causing problem, either from bad original design or because the evolution of the product strayed from the original design. Don't be afraid to rewrite to help those areas fit in better.

Aim for your own personal Snowball Effect.

Thursday, February 2, 2012

Commenting For the Future

What, me comment?

You are in the middle of a large coding project. You've got the entire design in your head, and you are banging out lines and lines of code. You've done your best to name your functions and your variables clearly. It all seems so obvious, why would you need to add comments too?

The simple answer is: "What you know now is not what you'll know later.". Another simple answer is: "You aren't the only one maintaining the code.". No matter how obvious something seems to you now, it isn't going to be obvious to the next person who comes along, even if that person is you!

Comments have a definite purpose. They are a road map for future usage. If done properly, comments can help you restructure the code, and keep your functionality tidy.

What To Comment
Modules: Every source code module needs a paragraph describing the content of the module. This allows people to find the proper module to look for a function, and directs them to the proper place to put a new function.

The comment should describe the general usage of the module, and what types of functionality should be found in this module. If applicable, it can give a general idea of where in an architecture the module belongs.

Functions: Every function in a source module needs a paragraph describing when to call the function, what the operands are, what input and output are expected. This is also a good place to describe interactions with other parts of the code. If this function can't be called until some part of the architecture is initialized, then put that in the paragraph. If this function relies on code sequences elsewhere in the product, add that too. No one should have to read the code for an entire product to figure out what is happening.

Code: Code should have comments when there is a reason to have comments. If the code is tricky to understand, or it has some non-obvious code interactions, write that down—don't make other people figure it out the hard way.

If you've had to break up logic, then comment it so the flow is obvious. You can also use code comments to describe what conditions would lead to the code, and remind you of issues that exist in the code.

You should definitely leave comments behind about suspicious code sequences, code that you recognize is error-prone or overly delicate. Remind yourself of code that needs rewritting or moved. You can't always take care of all problems immediately when you are looking at code, but if you leave a note you can always get back to it.

Avoid stupidly obvious comments like:

rc = 0; /* no error yet */
myVar = value; /* copy because we need this later for <some reason> */

That's just clutter. You want your comments to be meaningful and worthwhile to read.

Comments are necessary. There isn't a programming language in existence that makes the "why and how" of code obvious enough to avoid comments.

Don't fix the bug...yet!

Why fixing that bug should be delayed

You've just finished coding a project and the code is being tested. You find an erroneous error message, or an error is flagged that you didn't expect. You whip into Bug Fixing mode, hunt down the cause of the issue, and slap in some code to fix that problem. Tadah! All is well again!

But you've missed an opportunity: unless you have diligently forced your code down every error path, you've got code paths that you have never explored. Before you fix that bug, you have a unique opportunity to make sure that code path does exactly what you want it do.

The fact is, bugs are in your code—there is no getting around them. Unexpected system interactions, unexpected user input, and sun spots all affect the code in ways you didn't imagine when you wrote it. Given that you can't fix all the bugs, what you want is a reliable way to handle them. You want an error reporting system that tells you exactly what went wrong, where it went wrong, and gives you a clear idea how to fix the problem without having to recreate the issue or spend time in a lengthy exploratory debugging session.

So, don't fix the bug right away. Keep in mind how you want to fix the bug, and give yourself a bit of time to explore the error .

Is the text of the error message clear?
You have the error code and/or error message. Take a moment to look at that message: is it clear to the user what action they should perform? "Unexpected Error" conveys no information. Rid yourselves of useless error messages. Every time you find one of them, take the opportunity to convert it into an error message that tells you where the failure is and what the likely causes of the failure are.

Make sure the spelling is correct, that the message identifier is correct and documented, and that the syntax of the message is proper. Error messages that are phrased "Unexpected Error, unknown, failure of" might have made sense when you wrote them but really, they just garble your intent. Fix the content, then language, then the spelling.

Is that the error message you want?
You've caught the code in an unexpected position. Is that the message you want to write when it gets there? You might have a message that says "Unexpected Error, this should never happen", or "Unexpected Error call Bob at (800)123-4567", but this isn't what you really want to say when the error happens. Now that you know what causes the failure and understand the code so that you can project what else might cause execution to reach this spot, make an error message that says what happened, why, and how to fix it.

Is that where you want the error to be detected?
You now understand the problem and you know what had to happen to get here. Think about it for a second and see if maybe you couldn't have caught this issue earlier. Is there a better way to head off this issue? Was this the result of a subtly bad user input? Could you screen the user input better and flag this earlier?

If you do redirect the error handling, make sure that works too. If you've got an error stack (you do have one, don't you?), make sure all the messages in the stack make sense, are written correctly, and convey information too.

Leave some comments behind.
Oh, and comment what you've found. You spent all that time tracking down the error and what happens when. Leave yourself a note about how things flow so you don't have to rethink all this stuff again.

Then fix the problem
You've got the error handling worked out: the code is following the correct paths and detecting the issue as early as is feasible, don't forget to fix the problem.

It gets better
Diligently applying this method will make your code better and more readable. It will make your error messages more useful. It has one added bonus for code that has loads of bad error messages: .it concentrates your efforts on errors that actually happen.

Handling Codepage and ASCII/EBCDIC issues

We Are Not All The Same

Dealing with character sets is never easy. We all wish the industry would just pick one codepage and go with it, but the realities of multi-language and multi-platform support make that an impossible dream.

Even if you know you will never deal with an EBCDIC-based platform you can't escape the need for multi-language support. The fact remains that not all ASCII codepages are the same.

It's easy to write code that ignores this problem. It's hard to retrofit a solution into code that absolutely ignored the existence of the problem.

I'll Worry About That Later
Suppose you write your complex application, totally ignoring the codepage issues. How bad can it be to fix it later when a customer demands NLS support, or you need to port it to another platform? After all, that's an issue for the porting engineers isn't it?

Remember we are all in this together. The goal is to produce a product as quickly as possible so the company makes money and pays our salary. Unless you want your porting team to take up the head count budget, it pays you and the company to plan ahead.

Difficulty In Retrofitting
You've spent months writing a complex application with multiple I/O points. The application displays web pages and sends socket requests to the four corners of the world. It talks to databases and reads files and extracts information from all manner of data sources. You've completely ignored the codepage issue and now you want to deal with it. Where do you start?

Chances are the web server is UTF-8. Everything else is platform-dependent.

You read data from the database: is the database on the mainframe? on AS/400? on Windows? What codepage did you use to ask for the data? Did you even specify what codepage you wanted it in, or just take the default at the database?

You read data from a file: is the file a PDS on the mainframe with naming convention issues that look nothing like any other platform? Is it UNIX-based, where capitalization is an issue and the slashes go in one particular direction? Is it Windows, where the capitalization might not be an issue, spaces are allowed to be embedded, and the slashes go the other direction?

You want to display data on a web page or end-user application. Where did the data come from? What codepage was it produced with, and what codepage do you want to display it with?

Thinking about this after the fact is painful and error-prone. Every hole you patch shows six other holes, and further testing shows more. If you manage to ship the code, how do you explain to the user that the error message ' ?> / _>' really means "Invalid Input"?

Going on a Wumpus Hunt
You've shipped the code, or testing has shown that somewhere, buried in the code there's a message that isn't getting translated properly. Either it was never translated or, even worse, it's been translated twice. I usually don't mind the never-translated. On some systems you can see the hex values for the message and do the translation by hand and hunt down the message. But doubly translated? It's incomprehensible gobbledy-gook. You can't grep for ' ?> / _>'. How do you hunt down a message when you don't even know what the content was?

Unintentional Lack Of Portability
Suppose your code was written with some level of portability. You have spots in your code where translation takes place. Most things are working, but every time you add a test case or make minor revisions, you could be introducing unintentional lack of portability.

Suppose you are reading a line from a file and you want to check for end-of-line. You add something like this:

if (myString[i] == '\x0A')

or you want to check for a blank and you accidentally slip in:

if (myString[i] == '0x20')

Or you want to make sure the characters you are about to display are readable and you do this:

if (myString[i] >= '0x30' &&
myString[i] <= '0x7E')

All of these examples are going to fail a portability test, as they tie your code to a specific codepage.

Any one attempting to port this code won't find this unless they look through every line of code that you have written.

Magic Numbers and Phrases
Does your code look like this?

rc = 100; /* assume all is well */
strcpy(status, "All is well");
rc = doFunction(parm1, parm2, "MakePumpkins", 121);
if (rc != 100)
{
printf("error %d");
}

This code is full of magic numbers and phrases. This makes porting things difficult and maintenance of the code costly.

Suppose you decide that the "good" error code should no longer be 100. If you've hard-coded it every place, then every location needs to be updated to the new "good" error code.

The function call has magic phrases and magic numbers. Finding and updating these phrases is next to impossible.

Magic phrases add "another layer of impossible" when you have codepage issues. That magic phrase might have to be in a different codepage, depending on when and where this code is invoked and handled. In a system with both ASCII and EBCDIC modules, this function call might cross the boundary between an EBCDIC front-end module and an ASCII compiled internal function.

Save yourself a nightmare and put all these magic numbers into a header file as #defines or constants. You will find maintenance much easier with only one spot to update the values, and no grepping around to find them all. Your porting team will be able to focus on what values might cause issues and find ways to handle them, while being able to grep for the specific usages of the values.

Configuration Information
In your test libraries you likely have tons of these magic phrases and numbers. If you are using this test library on multiple platforms, it's quite likely that these values will not be the same for all environments. Putting them in a header file is nice but there's a better solution. Externalize them in a file and write a simple application to suck the values into the test programs.

If you go the header route, then you have to compile the code, test it, change the value to something appropriate for the platform, recompile, retest. Continue until every value in the header file is checked against the test cases. Then, for every platform you support, add an #ifdef until the file is unreadable. Instead of wasting time with recompiling, putting these values into an external file means all you have to do is update the file and rerun, with no intervening compilation steps.

Steps For More Portable Coding
What can you do to produce code that is more portable and protects itself from codepage issues?

1. Stop, Think, then Code
Don't spend months agonizing over all the places things can go wrong, but before every coding project give some thought about where a portability issue might come into place. It's much easier to structure the code to isolation the porting problems when you first write it than afterwards.

Regardless of the rush you are in to finish the code, a few minutes thinking and planning ahead of time will always make the code better.

2. Watch for Magic Phrases, Magic Numbers and Hex Values
Isolate these stumbling blocks into header files or external files. While coding both the product and the test cases, make these things obvious and easy to find.

3. Comments
Every module, function, and structure should have a comment block describing its usage already. While you are writing or adding to these comment blocks, include a line or two about what codepage it should be dealing with. Functions that interface with outside data will most likely be in the external codepage. Internal functions will most likely be in the product's codepage. It may seem obvious to you while you are writing the function, but it won't be obvious three years later or to any other programmer.

4. Conversion Functions
Even if you don't need the codepage conversions now while you are writing the bulk of your code, you know where they will be needed. Add a dummy function that doesn't do anything yet. doConvertExternalToInternalCodePage() and doConvertInternalToExternalCodePage() will make it clear where the conversions are supposed to take place. These functions will likely have different implementations for different platforms, but now it will be easy to port the code and spend the company's man-hours writing new products instead of porting.

Thursday, November 3, 2011

Dropping Error Codes

Making Life Easier
The customer and I have one major common goal: I never want to debug a critical issue that only happens at a customer site. While I love solving problems and pursuing issues, doing that while a customer is desperate for a solution is not my idea of fun.

There is no way to guarantee there won't be problems. No matter how well you write your code or how good your test bed is, some unexpected situation will crop up, usually three of them at the same time. At some point you will do a massive face palm and ask, "Why didn't this fail earlier?". It's the nature of the job. Accept it and move on.

Given that there will always be problems in the field, our goal is to solve those problems as fast as possible. A customer who raises an issue and has it solved quickly and professionally is a happy customer, possibly happier than a customer who never had a problem and is unsure of your support abilities.

A major impediment to problem solving is dropped error codes. Dropped error codes typically happens when a system function is invoked and the error condition is not checked, but you can find them on your own internal function calls as well. We've all done it on fclose(), and how many of us carefully check every single malloc() for a failure? Code that is running on a PC with gigabytes of memory today may be running on a smart card with 4K of memory tomorrow.

The problem with a dropped error code is that the problem shows up downstream as an inexplicable failure.

Open a file at the start of the program, then only use it after four days of execution. The resulting error message is "unable to update database" or some other indirect failure.
An fclose() failure could result in a later fopen() failure.
Depending on how obscure your error reporting is you may wind up with "Unexpected Failure Abort/Retry/Fail" or the dreaded Blue Screen of Death.

If you've ever dealt with this sort of fall-out, you know that the support times on these types of errors are extended and brutally painful. In some cases they aren't reproducible in-house. I've seen companies pay to have the customer's computer sent in-house to debug a problem. This is not professional software support.

Another method of dealing with this issue it to guess, send out a patch, iterate until the customer gives up (or your support personnel quit).

A more obscure problem is porting failures. When we write code for a particular platform we can carefully check all the expected error codes from a function. Sometimes we hit a function that has no known failures. Is it ok to not check for a failure? No. Not if that code will ever be ported to another platform.

A classic example of this is any POSIX function on z/OS, like socket() or pthread_mutex_init(). All the error codes on Windows are fairly innocuous; there are few parameters to these functions and it's highly unlikely that the function can fail. On z/OS it's quite possible that it will fail because the POSIX environment is not enabled. You can't spend your life reading every platform's Reference Guide to find all the possible failures, but you can catch all failures and decide how to deal with them. If that function call fails, you aren't going to be doing any useful processing anyway. How much can it hurt to check the error code?

Benefits
What would you rather do with your time? Would you rather spend your time pursuing weird fallout bugs because of bad error handling? While I gain great satisfaction in problem solving there are problems that are fun to solve and problems that make me beat my head against a desk. Problems that feel like incarnations of Stupid when you find them. For the want of a nail...

Handle your errors properly and your customers will be happier, your need for customer support will decrease, and your departments will be far more productive.

How To Handle An Error - Method 1
Once you've caught the error what do you do?

My favorite method involves some infrastructure. I prefer single use error numbers, but I can live with error number reuse if the error messages have replaceable parameters. This sort of thing doesn't happen by magic. Before you write new code you really should ponder this. It's far easier to have an error table with fixed messages but far less useful. I find that the error table method leads to reusing the same messages for multiple disparate situations. At that point all your error messages might as well be "Something Bad Happened".

Ideally you want an error message that reads something like "Malloc() for 24 bytes failed at line 245 in mysub.c" with a unique error number associated. True, the customer is not going to know what to do with that error. But they can search your knowledge-base, and when they do call your technical support, you will know exactly what happened when the first problem occurred.

One of the things the IBM Mainframe division does really well is their concept of FFST (First Failure Support Technology). IBM's concept is that you gather all the information you need when the failure occurs. You don't wait until the issue can be recreated on a computer in-house before you can start to work the problem. One failure, in many cases, is a failure too many. IBM was driven to this concept by very large mainframe customers to whom down time on critical software meant thousands of customers dead in the water, or in some cases a company that goes under in 24 hours.

On a more selfish note, if you get all the information you need to diagnose an issue when it happens the first time, you will save yourself weeks of attempting to recreate, questioning your customer's environment, and fiddling with the code to force failures. A well-written error message will allow you solve many issues in one try. Just think of the hours you'll have to write new code when you free up all those maintenance hours!

How to Handle an Error - Method 2
This method also may involve some infrastructure. It's infrastructure that you will never regret. It has all sorts of uses and every good piece of software has it:

Logging.

If your OS doesn't support a Process Log Facility write one. Give yourself someplace to raise issues and log failures. Your application user interface may be restrained by the I/O of a soda machine but that doesn't mean you can't find someplace to keep track of what is going on.

Even if you are already doing Method 1 - Add a logging facility. Do it now! You will never regret this.

Right now you may be looking at your code and saying "I wish I knew the control flow through here"; after you add logging you will say, "If only I had an additional Log message in here". The first is solvable with a debugger (if one is available, and never at a customer site). The second is a problem that is solvable over time. If you add a logging facility that is easy to use, even if you only use it for one message right now, you will find yourself using it more and more. You have no idea what information is getting lost until you have a place to put it.

Bonus Points
You support error messages with replaceable parameters and process level logging. What else can you do?

Here's a thought: document all your error codes. Even if your first step is to just make a list of all the error codes and associated messages, it's a start. Your customers will thank you, that is, if they haven't demanded this document already.

Once you have a list of all the errors, analyze why the errors happen and give your staff and your customers suggestions for why the errors can happen. If you are reading this blog, chances are you have a sophisticated enough program that one engineer doesn't know every line of code. This document will make fixing the error easier when it occurs.

The best time to add a message to this document is when you write the code to produce the error. At that time you know exactly what conditions cause the problem, what is happening at the time of the failure and what probable steps will solve the issue.

In a Nutshell
Handle your error codes. Save time and heartache. Make your code enterprise stable. Give yourself time to write new code. Stop bashing your head into the desk.

Introducing My Blog

Welcome to my Blog.

Since this is my first post I'll start by introducing myself and the purpose of this Blog.

I am a 25-year (and counting) veteran of Software Engineering. I learned programming on a TRS-80 Model III. My heart belongs in personal computer land. In college I was an avowed language junkie and learned every programming language I could get my hands on. I still have a soft spot for LISP, and APL will always tickle my fancy. I used to think in Pascal but over lack of use that has morphed into C and is moving to C++. When I graduated college there were very few personal computing jobs available. At that time I was swept into the dominant market of mainframe programming. After a bit of a culture shock I learned the ins and outs of writing and supporting a mainframe COBOL debugger written in MVS assembler. As my career progressed and the personal computing market took hold I have had the delight of working in cross-platform mode. Most of my employment has been in companies whose main software is on the PC but need a mainframe component, port or some level of mainframe support. In those companies I have produced code that runs in either or both environments (including Windows and various *NIX brands). I have had some awesome mentors and in many cases have had to support my own code for longer than five years.

I have seen loads of software. I have taken over projects written by other programmers who are no longer with the company. I have supported my own code through loads of production releases. Some of the projects I have taken over went smoothly, despite no knowledge of the product. The code was written in a way that learning it with no help, was easy. Everything was just where it belonged or had a logical pointer to figure it out.

Other code was hopeless spaghetti. Even with intimate knowledge of other software that performed the same task and help from the originator, the code was difficult to follow and touchy to maintain. I have experienced the same types of things in my own code. There were places where I didn't quite know what I wanted to do and did the quick-and-dirty solution, which resulted in maintenance nightmares until I grew smart enough to rewrite my own mess. There were places where I took the time to design things properly, left it to work on another project for multiple years, and picked up where I left off with no problems at all.

Which brings me to the point of this Blog.

In this Blog I will approach one major impediment to software support in every post. These are issues that tripped up myself and the companies I worked for-things that are easy to brush off as a new programmer, but wind up more important than the expediency of the "just-ship-it" mentality.

I hope you enjoy my Blog. I welcome constructive discussion.

Thanks!
GoodCoder