Supportable Cross Platform Programming: Dropping Error Codes

Making Life Easier
The customer and I have one major common goal: I never want to debug a critical issue that only happens at a customer site. While I love solving problems and pursuing issues, doing that while a customer is desperate for a solution is not my idea of fun.

There is no way to guarantee there won't be problems. No matter how well you write your code or how good your test bed is, some unexpected situation will crop up, usually three of them at the same time. At some point you will do a massive face palm and ask, "Why didn't this fail earlier?". It's the nature of the job. Accept it and move on.

Given that there will always be problems in the field, our goal is to solve those problems as fast as possible. A customer who raises an issue and has it solved quickly and professionally is a happy customer, possibly happier than a customer who never had a problem and is unsure of your support abilities.

A major impediment to problem solving is dropped error codes. Dropped error codes typically happens when a system function is invoked and the error condition is not checked, but you can find them on your own internal function calls as well. We've all done it on fclose(), and how many of us carefully check every single malloc() for a failure? Code that is running on a PC with gigabytes of memory today may be running on a smart card with 4K of memory tomorrow.

The problem with a dropped error code is that the problem shows up downstream as an inexplicable failure.

Open a file at the start of the program, then only use it after four days of execution. The resulting error message is "unable to update database" or some other indirect failure.
An fclose() failure could result in a later fopen() failure.
Depending on how obscure your error reporting is you may wind up with "Unexpected Failure Abort/Retry/Fail" or the dreaded Blue Screen of Death.

If you've ever dealt with this sort of fall-out, you know that the support times on these types of errors are extended and brutally painful. In some cases they aren't reproducible in-house. I've seen companies pay to have the customer's computer sent in-house to debug a problem. This is not professional software support.

Another method of dealing with this issue it to guess, send out a patch, iterate until the customer gives up (or your support personnel quit).

A more obscure problem is porting failures. When we write code for a particular platform we can carefully check all the expected error codes from a function. Sometimes we hit a function that has no known failures. Is it ok to not check for a failure? No. Not if that code will ever be ported to another platform.

A classic example of this is any POSIX function on z/OS, like socket() or pthread_mutex_init(). All the error codes on Windows are fairly innocuous; there are few parameters to these functions and it's highly unlikely that the function can fail. On z/OS it's quite possible that it will fail because the POSIX environment is not enabled. You can't spend your life reading every platform's Reference Guide to find all the possible failures, but you can catch all failures and decide how to deal with them. If that function call fails, you aren't going to be doing any useful processing anyway. How much can it hurt to check the error code?

Benefits
What would you rather do with your time? Would you rather spend your time pursuing weird fallout bugs because of bad error handling? While I gain great satisfaction in problem solving there are problems that are fun to solve and problems that make me beat my head against a desk. Problems that feel like incarnations of Stupid when you find them. For the want of a nail...

Handle your errors properly and your customers will be happier, your need for customer support will decrease, and your departments will be far more productive.

How To Handle An Error - Method 1
Once you've caught the error what do you do?

My favorite method involves some infrastructure. I prefer single use error numbers, but I can live with error number reuse if the error messages have replaceable parameters. This sort of thing doesn't happen by magic. Before you write new code you really should ponder this. It's far easier to have an error table with fixed messages but far less useful. I find that the error table method leads to reusing the same messages for multiple disparate situations. At that point all your error messages might as well be "Something Bad Happened".

Ideally you want an error message that reads something like "Malloc() for 24 bytes failed at line 245 in mysub.c" with a unique error number associated. True, the customer is not going to know what to do with that error. But they can search your knowledge-base, and when they do call your technical support, you will know exactly what happened when the first problem occurred.

One of the things the IBM Mainframe division does really well is their concept of FFST (First Failure Support Technology). IBM's concept is that you gather all the information you need when the failure occurs. You don't wait until the issue can be recreated on a computer in-house before you can start to work the problem. One failure, in many cases, is a failure too many. IBM was driven to this concept by very large mainframe customers to whom down time on critical software meant thousands of customers dead in the water, or in some cases a company that goes under in 24 hours.

On a more selfish note, if you get all the information you need to diagnose an issue when it happens the first time, you will save yourself weeks of attempting to recreate, questioning your customer's environment, and fiddling with the code to force failures. A well-written error message will allow you solve many issues in one try. Just think of the hours you'll have to write new code when you free up all those maintenance hours!

How to Handle an Error - Method 2
This method also may involve some infrastructure. It's infrastructure that you will never regret. It has all sorts of uses and every good piece of software has it:

Logging.

If your OS doesn't support a Process Log Facility write one. Give yourself someplace to raise issues and log failures. Your application user interface may be restrained by the I/O of a soda machine but that doesn't mean you can't find someplace to keep track of what is going on.

Even if you are already doing Method 1 - Add a logging facility. Do it now! You will never regret this.

Right now you may be looking at your code and saying "I wish I knew the control flow through here"; after you add logging you will say, "If only I had an additional Log message in here". The first is solvable with a debugger (if one is available, and never at a customer site). The second is a problem that is solvable over time. If you add a logging facility that is easy to use, even if you only use it for one message right now, you will find yourself using it more and more. You have no idea what information is getting lost until you have a place to put it.

Bonus Points
You support error messages with replaceable parameters and process level logging. What else can you do?

Here's a thought: document all your error codes. Even if your first step is to just make a list of all the error codes and associated messages, it's a start. Your customers will thank you, that is, if they haven't demanded this document already.

Once you have a list of all the errors, analyze why the errors happen and give your staff and your customers suggestions for why the errors can happen. If you are reading this blog, chances are you have a sophisticated enough program that one engineer doesn't know every line of code. This document will make fixing the error easier when it occurs.

The best time to add a message to this document is when you write the code to produce the error. At that time you know exactly what conditions cause the problem, what is happening at the time of the failure and what probable steps will solve the issue.

In a Nutshell
Handle your error codes. Save time and heartache. Make your code enterprise stable. Give yourself time to write new code. Stop bashing your head into the desk.

Supportable Cross Platform Programming

Thursday, November 3, 2011

Dropping Error Codes

No comments:

Post a Comment