Tuesday, 28 December 2010

Well has the test failed or hasn’t it?

When should you classify a test as Failed?  This sounds such a simple question and you may think the answer is obvious; however there are some factors that mean a well thought out approach can have significant benefits to the test manager.

Introduction

Generally one of the states used in test reporting is Failed.  A common assumption, one that is generally sound, is that failed tests mean you have problems.  Given typical practice a less well founded extension of this goes that failed tests indicate the system has problems doing the things that those tests were testing. Years of attempting to understand what is really going on inside projects show that this is the point at which the complexity of the real world overwhelms the abstract model of tests and test failures.

Think about the simple abstract model.  Tests have actions and, hopefully, expected outcomes.  If the action is done correctly and the outcome does not match the expectation then the test has Failed. Simple or what?  This model is applied on a regular basis all over the world so what is the issue?  Issues come in many forms and will be illustrated here using three examples.

Example One - Environmental Problem

Our system allows external users to submit transactions through a web-portal.  There is a change to the way these submissions are to be presented to internal users on the backend system.  If the submission has an attachment this is flagged to the user.  One type of transaction has three modes; two tests are passed and the third is failed.  Over a number of days a common understanding across both the test and development team builds up that the change works for two of the three modes and does not work for the third.  Only when we dig into the detail to decide whether to release with the issue or not do we discover that transactions for the third mode fail to submit at the portal.  No on had managed to get this transaction in; the handling of it in the backend had not been tried.

The real problem was a test environment configuration issue that derailed this test. The test was marked as Failed and the story began to develop that the third mode did not work.  This test had not Failed it was blocked and unable to progress and discharge its purpose.

Example Two - Incorrect Search Results

To test that billing accurately consolidates associated accounts these associations have to be created and then the accounts billed. To associate accounts one account is selected as the master and then a search facility is used to obtain the list of accounts that can be associated; selections are then made from the list.  After this billing can be tested.  When the search is done it returns the wrong accounts and association attempts fail.  Has the test failed?

If the test is classified as failed this tends to (well should) indicate that when you bill associated accounts then the bill is wrong.  So marking tests like this as failed sends the wrong message.  The test can't be completed and a fault has been observed and can't be ignored, but this fault is not to do with the thing being tested.

Example Three - Missing Input Box

A test navigates through a sequence of common HCI areas.  On one page it is observed that one of the expected input boxes is missing.  This doesn't bother us as the test doesn't use it.  Everything works well for the test.  Has it Passed?

The most meaningful outcome for the test is that it Passed; but then that leaves the defect that was observed floating around so shouldn't it be marked as failed to ensure it is re-tested?

An Alternative model of Failure.

Those were just three examples. There are many similar variations; so what rules should be used to decide whether to claim Failure?  Generally a test should have a purpose and should include explicit checks that assess whether the thing tested by that purpose has or has not worked correctly.  An expected result after an action may be such a check; alternatively a check may require more complex collection and analysis of data.  Checks should relate to the purpose of the test.  Only if a check is found to be false should the test be marked as Failed.  If all the checks are ok then the test is not Failed even if it reveals a defect.

The role of Expected Results

So are all expected results checks?  Often there are expected results at every step; from logging in through navigation to finally leaving the system.  Given this the position is a very very strong no.  Many expected results in tests serve a utility purpose.  They verify some step has been done as required; they often say little about the thing the test is actually needed to prove.  If you don't get the expected result then it means there is a problem some where; a problem with the test, with the way it is executed or with the system; however it does not necessarily mean that there is a problem with the thing being tested. Only when there is a definite problem with that should the test claim to be a Failure.

Orphaned Defects

That leaves defects that are triggered when running tests but that don't mean the test has Failed.  We could end up with no tests Failed, perhaps even all Passed, and a stack of defects; this is counter intuitive so what is going on?  Actually the discipline of refusing to fail tests unless an explicit check fails provides very useful feedback.The statistical discrepancy can indicate:

(a) That the tests do not have adequate checks; they are revealing errors in the thing being tested that can be seen but nothing in the test itself says check for that.  Time to improve the test and then mark it as Failed. Improving the test is required to make the defect detection delivered by the tests consistent; we should only depend on explicitly defined error detection.

(b) That we are finding errors in things that are not being tested as no test is failing as a result of the defect.  For control purposes add tests that do Fail because of the defects.  Also is this indicating a major hole in regression or testing of the changes?  If so is action required?

(c) That there are environmental problems disrupting test activities.

Conclusion

Adopting an approach that governs, actually restricts, when a test can be marked as Failed to circumstances where an explicit check has shown an issue provides more precise status on the system and improved feedback on the quality of the testing.  Furthermore this reduces the discrepancy between the picture painted by test results and the actual state of the release and the management time required to resolve this.

Wednesday, 15 December 2010

Maintaining Focus

If you want testing to be effective and want it to be manageable in the wider sense of the word (understood by others, amenable to peer and expert review and controllable) then everything has to be focussed.  Each constituent part of the effort needs a clear purpose and this has to extend down to quite a fine grained level.   Macros level building blocks such as Functional Test, Performance Test and Deployment Test don’t do it.  What is required is to break the work into a set of well defined heterogeneous testing tasks each one focussing on certain risks.

This approach originated when myself and a guy called Stuart Gent were working through the challenge of shaping and scoping the testing for a major telecommunications system programme.  We had a team of twelve analysts simply trying to understand what need to be tested.  We had already divided the work into twelve workstreams but recognised we needed something more.  We also had the experience of not using an adequate analysis approach on preceding releases of the system. These were far smaller and less complex than this one but we had learnt the dangers of inadequate independent analysis, of tending to follow the development centric requirements, specifications and designs, of testing what was highlighted by these and of missing less obvious but vitally important aspects.

Out of this challenge the concept of Focus Area based test management emerged.  The name isn’t ideal but it services it purposes.    The fundamental approach is that test activity should be divided up into a number of packages each being a Focus Area.  Each has a tight well defined remit.  There can be quite a few Focus Areas on large projects we are not talking about single digits; inventories exceeding a hundred, possibly approaching two, have been known.

A key thing is that a focus area is coherent and people can understand what it aims to cover and what it does not cover.  This enables far clearer assessment of whether a group of tests is adequate; because the focus is clear it is a tractable intellectual challenge to judge whether the tests do the job; divide and conquer.  Looking from the other end of the telescope how well are the overall risks of the system covered? If you have one thousand test cases with no way of telling what they really do, other than reading them, then you haven’t got a chance of finding the gaps.  If you have forty three well defined Focus Areas around which the tests are structured then you are in a much better shape.

What makes up a Focus Area definition?  This is something that flexes and depends on how formal you want to be but there are some basic things that should always be present:
(a)     The aspects of the system’s behaviour to be covered.
(b)     Distinct from this the conditions and scenarios that behaviour is being exercised under.
(c)     The sorts of malfunctions in this behaviour that we are trying to make sure aren’t there or at least that we need to catch before they get into the wild.
(d)     Any particular threats to be exercised.
(e)     Whether we are after hard faults or ones that don’t always manifest themselves even when the things we are doing to try and make a fault happen appear the same.

Look at how this works.  If you don’t apply a Focus Area approach and ask a team to create tests for some system then what is it that you are actually doing?  Well putting this situation into our basic Focus Area form you are saying:

“(a) Test all aspects of the system’s behaviour. (b) Do this under arbitrary conditions and usage scenarios.  (c) Whilst you are at it look for anything that could possibly go wrong.  (d) We aren’t telling you what particular things have a high probability of breaking it. (e) We are not highlighting whether things that may manifest themselves as reliability issues need to be caught.”

That is a lot of ground to cover both in area and types of terrain.  Thinking will be difficult as there are lots of different concerns all mixed in together.  Our experience is that you will tend to get homogenous testing using a small number of patterns that focuses on primary behaviour.  Much of the terrain will not get tackled; particularly the stuff that is harder to traverses.  Also, as discussed above, it is very difficult to review a set of tests covering such wide concerns and when you do you will probably find gaps all over the place.

Alternatively perhaps experienced people should define a number of Focus Areas to shape the work.  An example high level brief for a focus area might be:

“(a) Test the generation of keep the customer informed messages sent to the customer during order handling. (b) Test this for straightforward orders and for orders that the customer amends or cancels (don’t cover internal order fulfilment situations as they are covered elsewhere).  (c) Testing should check for the occurrence of the message and the accuracy of the dynamic content of the message.  Testing should check for spurious messages.  Static content and presentation need not be checked.  The latency of the message issue mechanism is outside the scope of this package. (d) Particular concerns are orders for multiple products and orders where the customer has amended contact rules after placing the order.  The impact of load on operation is outside the scope of this package.  (e) It is accepted that this package should provide reliable detection of consistent failures and will not be implemented to detect issues that manifest themselves as reliability failures.”

A definition likes this helps to focus the mind of the test designers; it should help to shape the pattern of testing so as to most effectively cover the ground.  It should ensure there are fewer gaps around its target and it should make reviewing more effective.  The overall set of well thought out focus areas allows the Test Architect to shape the overall coverage delivered by the testing exercise.

Personally I would never consider even reviewing a set of tests without first having my Focus Areas to hand.

Friday, 3 December 2010

The return of an old friend.

I have just encountered an old friend of mine; one that I see most places I go.  My friend is that recurring defect - the different date format bug.  In its most common and insidious form it is a mix of DD/MM/YYYY and MM/DD/YYYY representations of dates as strings.  Date format clashes of any sort cause defects but this is the worst ones because for many cases it appears to work waiting to create problems in future or corrupting data that passes through it.

How come by appearing to work for certain days it manages to slip through the net?  Dates presented in the DD/MM/YYYY format up to the 12th of the month will happily get converted into meaningful, though incorrect, dates by something that is looking for MM/DD/YYY.  So the 11th of October 2010 starts of in the first format as 11/10/2010 and then gets analysed by something looking for the MM/DD/YYYY and is interpreted as the 10th of November 2010.  If this is simply validation then the data entered is let through and no one is the wiser; but wait until the 13th.  However if the outcome of the incorrect interpretation of the date is stored in this form then we get the wrong date passed on for further processing.

Generally the presence of the issue can only be revealed when values of the day in the month part of the date that are greater than twelve are used.  For example the 13th of October 2010 in the first format is 13/10/2010.  If you look at it as being in the form of MM/DD/YYYY then we have MM=13 which is obviously, at least to the human brain, invalid.  I caveat the last point because though in many cases presenting this date will trigger some behaviour that reveals the fault it cannot always be guaranteed that this will be the case.

Why this post? It is because seeing the same problem again today has reminded me that this problem is like the common cold; it is all around us and is not going to go away.  Despite all the progress in software engineering technology none of it seems to tackle this type of issue.  Perhaps it is deemed to be too unimportant to worry about and deal with. After all once found it is an 'easy fix'. Actually it may be quick to change but the change often has the potential for massive downstream ramifications.  So perhaps not tackling this is a mistake; I would say so given the many developer hours I have watched being burnt on figuring out what is going on and the million pound per week project I saw extended by weeks through a myriad of issues of this sort.

What can testers do to help in this area?  Well they can start by remembering to test every date value and every date input control with dates that have their day part greater than twelve.  Keep a short list of key dates to use and make certain their use is comprehensive.  Thirteen may turn out to be your lucky number.