New Hanover County

Information Technology

Problem Resolution Guidelines

 

 Overview

This document discusses one of the most interesting and difficult parts of the programmer's job, and sometimes one of the most frustrating - dealing with problems. In most installations, the programmer is the problem solver of last resort - the expert to whom all others turn when a problem becomes too difficult or obscure for them to solve themselves. The mystique and reputation of programmers rests above all on their ability to deal with these situations - so cultivate this skill!

Effective problem resolution depends on a number of factors:

·        Adopting a systematic and thorough approach to dealing with problems

·        Understanding where to look for diagnostic information and how to obtain it

·        Knowing where to look to find the correct interpretation of diagnostic information

·        Knowing when and how to turn to others for assistance

This document cannot tell you how to solve every individual problem you might encounter - every problem is different (if only in the way it is presented by the person experiencing it) and even the well-defined problems occupy many volumes of message and diagnostic manuals. But it will attempt to cover:

·        An effective approach to problems

·        How to obtain and use dumps - perhaps the most mystifying of all the systems programmer's secret weapons!

·        What other diagnostic information and tools are available to you and when to use them

·        How to get help from vendors

 

 

An effective approach

To be consistently effective in handling problems, you must adopt a systematic approach. The fundamental steps of such an approach can be quite clearly defined - they are illustrated in Figure 18.1, and examined in detail below.

Step 1 - Identify the Problem

One of the easiest mistakes to make in problem diagnosis is to assume that you know what the problem is. The next easiest is to assume that the person reporting the problem knows what it is! In practice, the vast majority of problems reported to programmers are simple user errors, even where there is an intelligently-staffed help desk filtering out the more obvious problems. Often, the person experiencing the problem has made a basic procedural or data entry error, but has assumed the problem is really something more obscure. On the other hand, you must avoid assuming that every problem reported to you is a user error - some of them will inevitably be something more serious.

Furthermore, the initial problem description you receive is likely to be highly partial - depending not only on the level of the user's familiarity with the system but also on what assumptions they have already made about the problem's cause, what their attitude is to you and the IT department in general, what side of the bed they got out of this morning, etc., etc.

The only way to start your diagnosis on a sure footing is to:

·        Speak directly to the person experiencing the problem.

·        Find out exactly what they were trying to do - ideally get a copy of any JCL, source code, etc., involved - including details of when the problem occurred, which workstation they were using if it is an online problem, and any other potentially relevant information.

·        Find out whether they have done exactly the same thing before and what happened then.

·        Find out what they have changed since the last time they attempted to do it. Do this (and pursue it mercilessly!) even if they do claim to have done "exactly" the same thing before. Nine times out of ten they have changed something which to them seems trivial but is actually the cause of the problem.

·        Obtain hard evidence of the symptoms of the problem - if it includes a message or a return code, ensure you have a copy of the log or output dataset showing it. If it is an online problem or there is no hard evidence for any other reason, get the user to reproduce the problem, or show you how to reproduce it, so you can see the symptoms for yourself - and ideally take screen prints.

·        If the user is unable to reproduce the problem (a) be skeptical - perhaps they've just realized and corrected what they did wrong and are ashamed to admit it, but also (b) be conscious that some problems really are intermittent and may be very difficult to reproduce. You can generally cover both possibilities by telling them you can't do anything unless they can reproduce the problem or produce some other hard evidence of it, and encourage them to take comprehensive notes and call you immediately if the problem recurs.

Step 2 - Document the problem

There are several good reasons for documenting each problem as you deal with it. For example:

·        If you need to call in external support, such as the IBM Support Center, they are going to treat you in much the same way as you have treated the user who reported the problem to you - they are going to ask for a very precise definition of the problem, and if they cannot solve it at once they are going to ask for hard evidence.

If anyone else is called in to deal with a related problem later, it may be possible for them to save a vast amount of time if they have full information as to what you have done already. Indeed, you yourself may need this written record if the problem recurs in a few months time or if it becomes so complex that it's hard to remember everything that has happened.

You may need the full story carefully documented if there is a management post-mortem or if you need to argue the case for some serious changes in order to prevent recurrence of the problem.

In practice, the process of collecting and documenting the evidence should already have begun as part of Step 1 and continues until the problem has been solved. The components of that process are:

·        Collect and keep all relevant job output, dumps, copies of the system log at the time, screen prints, etc

·        Note all the information obtained from the user

·        Log everything you learn about the problem as you learn it, along with supporting evidence, dates, times, and names of contacts

·        Log all contact with software suppliers concerning the problem, including dates of phone calls, who you spoke to, and what their response was

·        Log any actions taken to solve the problem, including testing done to prove whether they worked or not

·        Log the eventual resolution

 

Step 3 - Take immediate recovery actions, if required

If the problem has interrupted a service, you will generally want to restore that service before analyzing the problem further. When that service is your whole system or a major subsystem, this will be an urgent requirement. Most problems that bring down major subsystems are not likely to recur immediately when you restart the subsystem, so it is usually possible to restart the system and restore the service to the user before conducting detailed analysis of the reason for the failure. If it seems likely from the start that the problem is due to a recent change, you may choose to back out the change when restoring the service, to minimize the danger of the problem recurring.

Remember, though, that some diagnostic information might be destroyed as a result of restoring the service. The obvious example is the contents of processor storage when you re-IPL. In cases like this you must ensure that any available diagnostic information is secured before you restore the service - in this case by taking a system dump. The delay in restoring the service may be irritating and even expensive for your business users, but without this step you may be unable to diagnose the problem that brought the system down, and so unable to prevent further costly disruptions.

In some cases the original problem will prevent you from restoring the service - a hard I/O error on a critical system dataset, for example. If the situation is so bad that you do not have a usable system to work with, you will have to invoke emergency recovery procedures. If it is not, you will still find management attention becoming heavily focused on your attempts to find an early resolution. Such attention can be helpful, as long as it does not lead to constant interruptions to your work on the problem, and you should encourage your managers to establish constructive (as opposed to interfering) ways to manage major problems. A five minute meeting every hour, for example, can help to maintain your perspective on the issues as well as your management's, and could be used to provide you with any extra resources you need. A two minute interruption every five minutes, however, is likely to prevent the problem from ever being solved. It's best to get this universally understood before you experience this kind of problem.

Step 4 - Analyze the Problem

This is the difficult bit! Be realistic about your capabilities - always try to understand the problem and at least have a quick look at the relevant manuals, but if it becomes obvious then that the problem is outside of your capabilities, don't hesitate to move on to the next step - call for help. You are a far more effective programmer when you pass a problem on to someone else who can fix it after ten minutes than when you grapple uselessly with it for hours and still end up calling for help. If you do call for help, though, always try to learn how to deal with it better next time, by asking the person who does fix it to explain the problem and how they diagnosed it.

As you gain experience, you will deal with more and more of the problems yourself. Usually the evidence you have collected will point to several fairly obvious lines of inquiry - looking up messages in manuals, considering what has changed recently that could have affected the relevant area, reviewing reference manuals which tell you how to do what the user experiencing the problem was attempting, and following up any ideas as to possible causes which spring to mind. The exact action will depend very much on the type of problem and the tools available to you. Familiarity with these tools is an enormous asset when it comes to problem diagnosis, so you should aim to develop your expertise with them before you need them in earnest.

Here we will confine ourselves to a few more general points. Perhaps the most important of these is to take a step back from the problem if it becomes clear that you are not going to solve it quickly. Be conscious that there are usually several different angles from which you can attack the problem - try listing them and spending a few seconds evaluating which is most likely to lead to a diagnosis of the problem. Consider whether you could involve someone else in following up one approach while you look at another. Once you have selected an approach follow it for a reasonable period of time, but stop occasionally, take that step back, and re-evaluate whether this is the most useful angle of attack. Effective problem resolution is a loop between generating ideas as to possible causes, evaluating them to decide which ones are worth following up, and then investigating the most promising.

If you are sufficiently lucky and/or intelligent, and the problem is one that you have the tools to resolve, this process will produce a plausible explanation of the cause of the problem. If so, you will then move on to step 6 - resolving the problem. If not, however, you will soon need to take step 5 - call for help. Indeed, you may decide to call for help in parallel with your own problem analysis, and this is often an intelligent and effective course of action.

 Step 5 - Call for Help

As we have seen, it is often necessary or useful to call for help. This might be because you do not understand the problem, because your analysis leads you to suspect someone else is responsible for causing the problem so they are the right person to diagnose it, or simply because you are not making clear progress and you feel it would be useful to get someone else working on it in parallel. There is also the old theory that "two minds are better than one", and one common experience is that when you start to explain a problem to someone else you realize what it was that you missed when you were looking at it on your own!

There are generally three groups of people you might turn to for help:

·        Colleagues - not only more experienced programmers but also your peers and even your juniors can often generate new ideas to feed into the analysis process or follow up lines of enquiry for you.

·        Software suppliers - whenever it seems possible that a problem is related to a malfunction in a third-party software product, you should contact the supplier sooner rather than later and ask if there are any known problems with symptoms similar to those you are experiencing. When other lines of inquiry fail to produce progress, or when the problem is very obviously in one of their software products, most software suppliers are quite willing to analyze dumps and other diagnostic information for you. However, they can do nothing unless you are able to provide them with the relevant diagnostic information.
 

·        Third party sources of systems programming expertise - consultancy companies, training companies, and software suppliers will generally jump at the opportunity of sending in an expert (for a fee) to help you with the diagnosis. If it seems very obvious that the problem lies with a particular piece of software, you might even be able to persuade the software supplier to do this for nothing. Frustrated managers are fond of suggesting this course if it seems to be taking a long time to solve a problem. It is only a good idea if the person to be brought in really is an expert in the problem area, and even then it is likely to lead to a minimum of several hours being wasted while they get on site, pick your brains as to what has happened, and retrace some of your steps. On the whole you should resist this option if you still have promising avenues to explore, and welcome it if you don't. 

Step 6 - Implement the Problem Resolution

Once step 4 or step 5 has produced an explanation of why the problem occurred, the steps to be taken to resolve it are usually fairly obvious. If a user error was to blame, they will simply have to correct it and carry on. If a software implementation error was the cause, you will have to determine the correct resolution yourself and apply it. If a third-party software product was at fault, the supplier will usually provide a fix (though you may have to wait a while for them to develop the fix if it is a new problem) or a "workaround" - a way of avoiding rather than solving the problem.

There is a temptation when fixing a problem to "slap on" the fix or the resolution and get it into production right away. Unless the service to your users is very badly affected, this should be resisted. Problem fixes and resolutions are no more likely to be error-free than any other change. They need to be tested before they are put into production (a dedicated test machine, or a virtual machine or logical partition for programmers comes in handy at this point). Not only is it possible that the fix will fail to solve the original problem; it is also quite possible that it will introduce new problems, which could be more serious than the original one. A small but significant proportion of even vendor fixes turn out to be "in error", so you should aim to test not only that the fix solves the problem but that the system still works after you have applied it. Of course, it is usually impractical to exhaustively test all system functions after applying every fix to your test system, but you should at least prove that the system comes up OK and that the functions with a logical relationship to the component being fixed still work.

Step 7 - Close the problem

Having tested and implemented your problem resolution, you should check that the original problem has indeed been resolved to the user's satisfaction. And make sure they know you fixed it – it’s astounding how often users are left to believe their problem "just went away" - or worse still, carry on with a workaround for months or years after the original problem was fixed.

At this stage you should also check that the problem record has been fully written up. If you have an automated problem management system, you will probably have to formally "close" the problem and obtain a management sign-off that the problem has been satisfactorily resolved.

 Sadly, there will be a few problems that remain unresolved even in the best-run shops. One of the skills of problem management is to know when to give up. If you have been given insufficient information to diagnose the problem, if it is impossible to reproduce it, if you have somehow lost or forgotten to obtain full diagnostics, and if your relevant software suppliers have no record of anything similar, you will never resolve the problem. Close it off anyway, but make sure that procedures are put in place so that next time you will have the information you need.