Thursday, November 3, 2011

Dropping Error Codes

Making Life Easier 
The customer and I have one major common goal: I never want to debug a critical issue that only happens at a customer site. While I love solving problems and pursuing issues, doing that while a customer is desperate for a solution is not my idea of fun.

There is no way to guarantee there won't be problems. No matter how well you write your code or how good your test bed is, some unexpected situation will crop up, usually three of them at the same time.  At some point you will do a massive face palm and ask, "Why didn't this fail earlier?". It's the nature of the job. Accept it and move on.

Given that there will always be problems in the field, our goal is to solve those problems as fast as possible. A customer who raises an issue and has it solved quickly and professionally is a happy customer,  possibly happier than a customer who never had a problem and is unsure of your support abilities.

A major impediment to problem solving is dropped error codes.  Dropped error codes typically happens when a system function is invoked and the error condition is not checked, but you can find them on your own internal function calls as well. We've all done it on fclose(), and how many of us carefully check every single malloc() for a failure? Code that is running on a PC with gigabytes of memory today may be running on a smart card with 4K of memory tomorrow.

The problem with a dropped error code is that the problem shows up downstream as an inexplicable failure.

  • Open a file at the start of the program, then only use it after four days of execution. The resulting error message is "unable to update database" or some other indirect failure.  
  • An fclose() failure could result in a later fopen() failure. 
  • Depending on how obscure your error reporting is you may wind up with "Unexpected Failure Abort/Retry/Fail" or the dreaded Blue Screen of Death.


If you've ever dealt with this sort of fall-out, you know that the support times on these types of errors are extended and brutally painful.  In some cases they aren't reproducible in-house. I've seen companies pay to have the customer's computer sent in-house to debug a problem. This is not professional software support.

Another method of dealing with this issue it to guess, send out a patch, iterate until the customer gives up (or your support personnel quit).

A more obscure problem is porting failures. When we write code for a particular platform we can carefully check all the expected error codes from a function. Sometimes we hit a function that has no known failures. Is it ok to not check for a failure? No. Not if that code will ever be ported to another platform.

A classic example of this is any POSIX function on z/OS, like socket() or pthread_mutex_init().  All the error codes on Windows are fairly innocuous; there are few parameters to these functions and it's highly unlikely that the function can fail.  On z/OS it's quite possible that it will fail because the POSIX environment is not enabled.  You can't spend your life reading every platform's Reference Guide to find all the possible failures, but you can catch all failures and decide how to deal with them.  If that function call fails, you aren't going to be doing any useful processing anyway.  How much can it hurt to check the error code?

Benefits
What would you rather do with your time?  Would you rather spend your time pursuing weird fallout bugs because of bad error handling? While I gain great satisfaction in problem solving there are problems that are fun to solve and problems that make me beat my head against a desk. Problems that feel like incarnations of Stupid when you find them. For the want of a nail...

Handle your errors properly and your customers will be happier, your need for customer support will decrease, and your departments will be far more productive.

How To Handle An Error - Method 1 
Once you've caught the error what do you do?

My favorite method involves some infrastructure. I prefer single use error numbers, but I can live with error number reuse if the error messages have replaceable parameters. This sort of thing doesn't happen by magic. Before you write new code you really should ponder this.  It's far easier to have an error table with fixed messages but far less useful. I find that the error table method leads to reusing the same messages for multiple disparate situations. At that point all your error messages might as well be "Something Bad Happened".

Ideally you want an error message that reads something like "Malloc() for 24 bytes failed at line 245 in mysub.c" with a unique error number associated. True, the customer is not going to know what to do with that error. But they can search your knowledge-base, and when they do call your technical support, you will know exactly what happened when the first problem occurred.


One of the things the IBM Mainframe division does really well is their concept of FFST (First Failure Support Technology). IBM's concept is that you gather all the information you need when the failure occurs. You don't wait until the issue can be recreated on a computer in-house before you can start to work the problem. One failure, in many cases, is a failure too many. IBM was driven to this concept by very large mainframe customers to whom down time on critical software meant thousands of customers dead in the water, or in some cases a company that goes under in 24 hours.

On a more selfish note, if you get all the information you need to diagnose an issue when it happens the first time, you will save yourself weeks of attempting to recreate, questioning your customer's environment, and fiddling with the code to force failures. A well-written error message will allow you solve many issues in one try. Just think of the hours you'll have to write new code when you free up all those maintenance hours!

How to Handle an Error - Method 2
This method also may involve some infrastructure. It's infrastructure that you will never regret.  It has all sorts of uses and every good piece of software has it:

Logging.

If your OS doesn't support a Process Log Facility write one. Give yourself someplace to raise issues and log failures. Your application user interface may be restrained by the I/O of a soda machine but that doesn't mean you can't find someplace to keep track of what is going on.

Even if you are already doing Method 1 -  Add a logging facility. Do it now! You will never regret this.

Right now you may be looking at your code and saying "I wish I knew the control flow through here"; after you add logging you will say, "If only I had an additional Log message in here". The first is solvable with a debugger (if one is available, and never at a customer site). The second is a problem that is solvable over time. If you add a logging facility that is easy to use, even if you only use it for one message right now, you will find yourself using it more and more.  You have no idea what information is getting lost until you have a place to put it.

Bonus Points
You support error messages with replaceable parameters and process level logging. What else can you do?

Here's a thought:  document all your error codes. Even if your first step is to just make a list of all the error codes and associated messages, it's a start. Your customers will thank you, that is, if they haven't demanded this document already.

Once you have a list of all the errors, analyze why the errors happen and give your staff and your customers suggestions for why the errors can happen. If you are reading this blog, chances are you have a sophisticated enough program that one engineer doesn't know every line of code. This document will make fixing the error easier when it occurs.

The best time to add a message to this document is when you write the code to produce the error. At that time you know exactly what conditions cause the problem, what is happening at the time of the failure and what probable steps will solve the issue.

In a Nutshell
Handle your error codes. Save time and heartache. Make your code enterprise stable. Give yourself time to write new code.  Stop bashing your head into the desk.

Introducing My Blog

Welcome to my Blog.

Since this is my first post I'll start by introducing myself and the purpose of this Blog.

I am a 25-year (and counting) veteran of Software Engineering. I learned programming on a TRS-80 Model III. My heart belongs in personal computer land. In college I was an avowed language junkie and learned every programming language I could get my hands on. I still have a soft spot for LISP, and APL will always tickle my fancy. I used to think in Pascal but over lack of use that has morphed into C and is moving to C++. When I graduated college there were very few personal computing jobs available. At that time I was swept into the dominant market of mainframe programming. After a bit of a culture shock I learned the ins and outs of writing and supporting a mainframe COBOL debugger written in MVS assembler. As my career progressed and the personal computing market took hold I have had the delight of working in cross-platform mode. Most of my employment has been in companies whose main software is on the PC but need a mainframe component, port  or some level of mainframe support. In those companies I have produced code that runs in either or both environments (including Windows and various *NIX brands). I have had some awesome mentors and in many cases have had to support my own code for longer than five years.

I have seen loads of software. I have taken over projects written by other programmers who are no longer with the company. I have supported my own code through loads of production releases. Some of the projects I have taken over went smoothly, despite no knowledge of the product.  The code was written in a way that learning it with no help, was easy. Everything was just where it belonged or had a logical pointer to figure it out.

Other code was hopeless spaghetti. Even with intimate knowledge of other software that performed the same task and help from the originator, the code was difficult to follow and touchy to maintain. I have experienced the same types of things in my own code. There were places where I didn't quite know what I wanted to do and did the quick-and-dirty solution, which resulted in maintenance nightmares until I grew smart enough to rewrite my own mess. There were places where I took the time to design things properly, left it to work on another project for multiple years, and picked up where I left off with no problems at all.

Which brings me to the point of this Blog.

In this Blog I will approach one major impediment to software support in every post.  These are issues that tripped up myself and the companies I worked for-things that are easy to brush off as a new programmer, but wind up more important than the expediency of the "just-ship-it" mentality.

I hope you enjoy my Blog. I welcome constructive discussion.

Thanks!
GoodCoder