Tuesday, 28 December 2010

Well has the test failed or hasn’t it?

When should you classify a test as Failed?  This sounds such a simple question and you may think the answer is obvious; however there are some factors that mean a well thought out approach can have significant benefits to the test manager.

Introduction

Generally one of the states used in test reporting is Failed.  A common assumption, one that is generally sound, is that failed tests mean you have problems.  Given typical practice a less well founded extension of this goes that failed tests indicate the system has problems doing the things that those tests were testing. Years of attempting to understand what is really going on inside projects show that this is the point at which the complexity of the real world overwhelms the abstract model of tests and test failures.

Think about the simple abstract model.  Tests have actions and, hopefully, expected outcomes.  If the action is done correctly and the outcome does not match the expectation then the test has Failed. Simple or what?  This model is applied on a regular basis all over the world so what is the issue?  Issues come in many forms and will be illustrated here using three examples.

Example One - Environmental Problem

Our system allows external users to submit transactions through a web-portal.  There is a change to the way these submissions are to be presented to internal users on the backend system.  If the submission has an attachment this is flagged to the user.  One type of transaction has three modes; two tests are passed and the third is failed.  Over a number of days a common understanding across both the test and development team builds up that the change works for two of the three modes and does not work for the third.  Only when we dig into the detail to decide whether to release with the issue or not do we discover that transactions for the third mode fail to submit at the portal.  No on had managed to get this transaction in; the handling of it in the backend had not been tried.

The real problem was a test environment configuration issue that derailed this test. The test was marked as Failed and the story began to develop that the third mode did not work.  This test had not Failed it was blocked and unable to progress and discharge its purpose.

Example Two - Incorrect Search Results

To test that billing accurately consolidates associated accounts these associations have to be created and then the accounts billed. To associate accounts one account is selected as the master and then a search facility is used to obtain the list of accounts that can be associated; selections are then made from the list.  After this billing can be tested.  When the search is done it returns the wrong accounts and association attempts fail.  Has the test failed?

If the test is classified as failed this tends to (well should) indicate that when you bill associated accounts then the bill is wrong.  So marking tests like this as failed sends the wrong message.  The test can't be completed and a fault has been observed and can't be ignored, but this fault is not to do with the thing being tested.

Example Three - Missing Input Box

A test navigates through a sequence of common HCI areas.  On one page it is observed that one of the expected input boxes is missing.  This doesn't bother us as the test doesn't use it.  Everything works well for the test.  Has it Passed?

The most meaningful outcome for the test is that it Passed; but then that leaves the defect that was observed floating around so shouldn't it be marked as failed to ensure it is re-tested?

An Alternative model of Failure.

Those were just three examples. There are many similar variations; so what rules should be used to decide whether to claim Failure?  Generally a test should have a purpose and should include explicit checks that assess whether the thing tested by that purpose has or has not worked correctly.  An expected result after an action may be such a check; alternatively a check may require more complex collection and analysis of data.  Checks should relate to the purpose of the test.  Only if a check is found to be false should the test be marked as Failed.  If all the checks are ok then the test is not Failed even if it reveals a defect.

The role of Expected Results

So are all expected results checks?  Often there are expected results at every step; from logging in through navigation to finally leaving the system.  Given this the position is a very very strong no.  Many expected results in tests serve a utility purpose.  They verify some step has been done as required; they often say little about the thing the test is actually needed to prove.  If you don't get the expected result then it means there is a problem some where; a problem with the test, with the way it is executed or with the system; however it does not necessarily mean that there is a problem with the thing being tested. Only when there is a definite problem with that should the test claim to be a Failure.

Orphaned Defects

That leaves defects that are triggered when running tests but that don't mean the test has Failed.  We could end up with no tests Failed, perhaps even all Passed, and a stack of defects; this is counter intuitive so what is going on?  Actually the discipline of refusing to fail tests unless an explicit check fails provides very useful feedback.The statistical discrepancy can indicate:

(a) That the tests do not have adequate checks; they are revealing errors in the thing being tested that can be seen but nothing in the test itself says check for that.  Time to improve the test and then mark it as Failed. Improving the test is required to make the defect detection delivered by the tests consistent; we should only depend on explicitly defined error detection.

(b) That we are finding errors in things that are not being tested as no test is failing as a result of the defect.  For control purposes add tests that do Fail because of the defects.  Also is this indicating a major hole in regression or testing of the changes?  If so is action required?

(c) That there are environmental problems disrupting test activities.

Conclusion

Adopting an approach that governs, actually restricts, when a test can be marked as Failed to circumstances where an explicit check has shown an issue provides more precise status on the system and improved feedback on the quality of the testing.  Furthermore this reduces the discrepancy between the picture painted by test results and the actual state of the release and the management time required to resolve this.

Wednesday, 15 December 2010

Maintaining Focus

If you want testing to be effective and want it to be manageable in the wider sense of the word (understood by others, amenable to peer and expert review and controllable) then everything has to be focussed.  Each constituent part of the effort needs a clear purpose and this has to extend down to quite a fine grained level.   Macros level building blocks such as Functional Test, Performance Test and Deployment Test don’t do it.  What is required is to break the work into a set of well defined heterogeneous testing tasks each one focussing on certain risks.

This approach originated when myself and a guy called Stuart Gent were working through the challenge of shaping and scoping the testing for a major telecommunications system programme.  We had a team of twelve analysts simply trying to understand what need to be tested.  We had already divided the work into twelve workstreams but recognised we needed something more.  We also had the experience of not using an adequate analysis approach on preceding releases of the system. These were far smaller and less complex than this one but we had learnt the dangers of inadequate independent analysis, of tending to follow the development centric requirements, specifications and designs, of testing what was highlighted by these and of missing less obvious but vitally important aspects.

Out of this challenge the concept of Focus Area based test management emerged.  The name isn’t ideal but it services it purposes.    The fundamental approach is that test activity should be divided up into a number of packages each being a Focus Area.  Each has a tight well defined remit.  There can be quite a few Focus Areas on large projects we are not talking about single digits; inventories exceeding a hundred, possibly approaching two, have been known.

A key thing is that a focus area is coherent and people can understand what it aims to cover and what it does not cover.  This enables far clearer assessment of whether a group of tests is adequate; because the focus is clear it is a tractable intellectual challenge to judge whether the tests do the job; divide and conquer.  Looking from the other end of the telescope how well are the overall risks of the system covered? If you have one thousand test cases with no way of telling what they really do, other than reading them, then you haven’t got a chance of finding the gaps.  If you have forty three well defined Focus Areas around which the tests are structured then you are in a much better shape.

What makes up a Focus Area definition?  This is something that flexes and depends on how formal you want to be but there are some basic things that should always be present:
(a)     The aspects of the system’s behaviour to be covered.
(b)     Distinct from this the conditions and scenarios that behaviour is being exercised under.
(c)     The sorts of malfunctions in this behaviour that we are trying to make sure aren’t there or at least that we need to catch before they get into the wild.
(d)     Any particular threats to be exercised.
(e)     Whether we are after hard faults or ones that don’t always manifest themselves even when the things we are doing to try and make a fault happen appear the same.

Look at how this works.  If you don’t apply a Focus Area approach and ask a team to create tests for some system then what is it that you are actually doing?  Well putting this situation into our basic Focus Area form you are saying:

“(a) Test all aspects of the system’s behaviour. (b) Do this under arbitrary conditions and usage scenarios.  (c) Whilst you are at it look for anything that could possibly go wrong.  (d) We aren’t telling you what particular things have a high probability of breaking it. (e) We are not highlighting whether things that may manifest themselves as reliability issues need to be caught.”

That is a lot of ground to cover both in area and types of terrain.  Thinking will be difficult as there are lots of different concerns all mixed in together.  Our experience is that you will tend to get homogenous testing using a small number of patterns that focuses on primary behaviour.  Much of the terrain will not get tackled; particularly the stuff that is harder to traverses.  Also, as discussed above, it is very difficult to review a set of tests covering such wide concerns and when you do you will probably find gaps all over the place.

Alternatively perhaps experienced people should define a number of Focus Areas to shape the work.  An example high level brief for a focus area might be:

“(a) Test the generation of keep the customer informed messages sent to the customer during order handling. (b) Test this for straightforward orders and for orders that the customer amends or cancels (don’t cover internal order fulfilment situations as they are covered elsewhere).  (c) Testing should check for the occurrence of the message and the accuracy of the dynamic content of the message.  Testing should check for spurious messages.  Static content and presentation need not be checked.  The latency of the message issue mechanism is outside the scope of this package. (d) Particular concerns are orders for multiple products and orders where the customer has amended contact rules after placing the order.  The impact of load on operation is outside the scope of this package.  (e) It is accepted that this package should provide reliable detection of consistent failures and will not be implemented to detect issues that manifest themselves as reliability failures.”

A definition likes this helps to focus the mind of the test designers; it should help to shape the pattern of testing so as to most effectively cover the ground.  It should ensure there are fewer gaps around its target and it should make reviewing more effective.  The overall set of well thought out focus areas allows the Test Architect to shape the overall coverage delivered by the testing exercise.

Personally I would never consider even reviewing a set of tests without first having my Focus Areas to hand.

Friday, 3 December 2010

The return of an old friend.

I have just encountered an old friend of mine; one that I see most places I go.  My friend is that recurring defect - the different date format bug.  In its most common and insidious form it is a mix of DD/MM/YYYY and MM/DD/YYYY representations of dates as strings.  Date format clashes of any sort cause defects but this is the worst ones because for many cases it appears to work waiting to create problems in future or corrupting data that passes through it.

How come by appearing to work for certain days it manages to slip through the net?  Dates presented in the DD/MM/YYYY format up to the 12th of the month will happily get converted into meaningful, though incorrect, dates by something that is looking for MM/DD/YYY.  So the 11th of October 2010 starts of in the first format as 11/10/2010 and then gets analysed by something looking for the MM/DD/YYYY and is interpreted as the 10th of November 2010.  If this is simply validation then the data entered is let through and no one is the wiser; but wait until the 13th.  However if the outcome of the incorrect interpretation of the date is stored in this form then we get the wrong date passed on for further processing.

Generally the presence of the issue can only be revealed when values of the day in the month part of the date that are greater than twelve are used.  For example the 13th of October 2010 in the first format is 13/10/2010.  If you look at it as being in the form of MM/DD/YYYY then we have MM=13 which is obviously, at least to the human brain, invalid.  I caveat the last point because though in many cases presenting this date will trigger some behaviour that reveals the fault it cannot always be guaranteed that this will be the case.

Why this post? It is because seeing the same problem again today has reminded me that this problem is like the common cold; it is all around us and is not going to go away.  Despite all the progress in software engineering technology none of it seems to tackle this type of issue.  Perhaps it is deemed to be too unimportant to worry about and deal with. After all once found it is an 'easy fix'. Actually it may be quick to change but the change often has the potential for massive downstream ramifications.  So perhaps not tackling this is a mistake; I would say so given the many developer hours I have watched being burnt on figuring out what is going on and the million pound per week project I saw extended by weeks through a myriad of issues of this sort.

What can testers do to help in this area?  Well they can start by remembering to test every date value and every date input control with dates that have their day part greater than twelve.  Keep a short list of key dates to use and make certain their use is comprehensive.  Thirteen may turn out to be your lucky number.

Friday, 26 November 2010

Integration; the puzzle at the heart of the project.

We have recently started working with a new client on changes to their testing and delivery practice. The aims is to increase the throughput of development and at the same time accelerate delivery and maintain quality.  This has been running for a few weeks now and enough time has elapsed for us to start hearing stories about previous projects and what went well and what was problematic.

Today we had a planning session for a project that involves the connection and interoperation of two systems.  In this session it became clear that their experiences of this type of endeavour were very similar to ones we have seen elsewhere.  Connecting systems is always more complex than expected, there is always lots of stuff that is not adequately prepared, lots of things that go wrong and it always takes far longer than anyone thought.

On the plus side it was reassuring to hear their head of development recounting similar experiences and holding a similar position to my own on how what has to be done next time if there is to be any chance of avoiding the same fate.  There was a common understanding of the need for someone being accountability for getting things to work.  There was similar alignment over the need to use virtual teams, the importance of preparation, the risk from environmental problems and the need for hands on technical capability and determination.

It was some years ago that we identified integration as one of the number one issues affecting projects both large and small.  A distinguishing aspect of our thinking is the major distinction we make between the act of getting it working and the act of testing whether it is working,  We always try and get clients to think of the discipline of Integration (see Integration Papers ) as something that stands apart from testing; even from testing called Integration Testing.

Saturday, 20 November 2010

Testing is easy; isn’t it?

I heard a comment recently; it went something along the lines of “if they can’t deliver testing to us then they won’t be able to do anything”.  Was I surprised to hear this coming from a senior test manager?  Well actually no; I wasn’t surprised.  It illustrates that even people with many years in senior testing posts can fail to understand what first class testing is, how different it is from run of the mill work and how complex and difficult it is to do first class testing well and at speed. This was not the first time I have come across this view and I doubt it will be the last.

Perhaps one day there will be a more general recognition of the downside of viewing testing as something that can always be done on the cheap and as one of the easiest things to give to the lowest bidder.  Until that day it seems it will always be testing that is the first target for cost cutting.  However I think I may have a very long wait for any change of attitude.  After all if senior test managers hold the view that testing is far easier to do than development then what chance is there of a change in the wider development space; never mind in the views of finance and procurement teams.

Tuesday, 16 November 2010

To release or not to release, that is the question.

Here are two interesting propositions.  Number one; test managers should focus on getting as quickly as possible to a state where it is obvious that further testing offers little benefit compared with finding out how the system survives in the wild.  Number two; it is easier to make the decision to release a system when delaying the release to permit further testing is not likely to put you in any better position than you are already in.   The interplay of these two propositions is discussed below.

For a number of years I was part of the programme leadership team that governed the development and release of a very large and very critical telecoms OSS system.  This system was so large, so complex and so important that release decisions were never simple. We would spend a lot of time converging on a good deployment position; one that realised the maximum benefits from the release whilst containing the risks. 

As you might expect sometimes making a decision was hard; things were not clear and it could go either way.  These decisions often involved long debates based on uncertain information. We found that ways of thinking evolved that made decisions easier.  One of the most powerful tools that evolved was a very simple question – “If we delay the release another two weeks and carry on testing then will we be in any better position to make a decision?”.

When the answer to that question was “no” we knew it was time to take a deep breath go for it and deal with any consequences that arose (and we became quite effective at dealing with those occasions when there were consequences you would not want to experience).  This question worked well in that environment because the cost of not deploying was high; it was a high intensity delivery environment with a heavy emphasis on deploying and moving onto the next release.  That said the question is a tool that can be used in many environments.

Returning to test managers and to their aims.  If a key part of the decision to release a system is a question of the form “Can any more testing be of benefit?” then test managers should plan to get to a position where the answer would be “No” as soon as possible and to manage execution to achieve this answer as soon as possible.  In doing this they accelerate delivery of the system.  The sooner the answer can be “more testing is a waste of time” the sooner the benefits of the system will be seen.

Epilogue

Just to be clear.  It is very easy to get the answer “more testing is a waste of time” if testing is simplistic and ineffective testing or worse is simplistic and ineffective testing executed ineffectively.  This approach is not recommended.  Rather do well thought out highly effective testing and do it quickly.  You and your colleagues on the development side should hold similar opinions as to when the optimum point has been reached.  If there is a caveat that goes something like “but we would spend more time testing if the testing were better” then there is some need for improvement.

Saturday, 13 November 2010

Performance by request.

After doing a fair bit of performance testing and troubleshooting we have seen the effects of performance only receiving attention at the end of the project. We encounter teams making herculean efforts to ring acceptable performance out of systems; we encounter systems that do not reach and never will reach acceptable levels; we encounter cancellations.

Few organisations spend much time and effort worrying about performance at the start of a project. Many spending an awful lot of time and money at the end dealing with the consequences. This pattern is not limited to naive first offenders; there are major organisations, ones that most people would expect to have sophisticated performance risk controls, that fall foul of this problem. It would be safe to say that, in general, the software industry doesn’t do performance engineering it does performance mend and make do.


What makes this madness is that simple techniques can make things a lot better; there is no need to turn to rocket science. These techniques may not be up to delivering the performance certainty required by an air traffic control system but they can certainly reduce risk for your average web application. Some thought and a little effort can provide a major reduction in performance risk. The first trick is to ask for what you want.

Ask and you might receive.

This may sound obvious but if it is so obvious then why is it not done? The people who need the system have to ask the people supplying the system to deliver a certain level of performance. Once that has been done you can look them in the eyes and let them try and provide evidence that will convince you that this will be achieved. This is founded on the adage "if you don’t ask then you don’t get". When you think about it if you don’t ask for something then what is the chance you will get it?

For this to work well two things are necessary. Firstly the people doing the asking have to understand what they need and have to express it in an organised way. Secondly they have to be sensible and avoid asking for the impossible; if you do you won’t get it and you won’t be taken seriously so you may end up with something worse than you could have had.

How to describe what you need.

Has anyone seen a performance requirement of the form "all response times must be less than 3 seconds". How much difference do you think that makes to the way developers approach the implementation of individual features. Not a jot; it has no real influence on the end game what so ever. How can this be done better? Three techniques provide the right framework.

(1) Recognise that the amount of time a user can wait for a response without it becoming a usability or throughput issue depends upon what the user is doing and what they are waiting for. Reflect these different needs as separate performance requirements with different and appropriate targets for each. Differentiation of types of responses is essential.

(2) Accept that, generally, real systems go slower when busy. With no one on it may be lighting fast; on a normal day it may be quick; during the busiest period of the year ir will almost inevitably be slower. Think about the different loads it will be used under and set distinct targets for each one. The limits may be close or it may be that at your busiest time you relax them; which ever it is good to be explicit about it.

This discipline avoids there being a covert interpretation that your targets are for ‘normal’ load conditions and an unstated assumption that in more extreme periods slower responses are acceptable. Also it can point architectural design in the right direction. Trade offs become possible; particularly when some aspects must remain constant under all conditions whilst some can slow down under heavier loads.

(3) Don’t use a simple limit; this can have strange side effects. You might pick the number that reflects the speed you want in the vast majority of cases but specify it as a maximum. Its origins are likely to mean that it is too challenging to achieve in all cases; if this is glaringly obvious the requirement is discredited. Alternatively you might pick the worst acceptable duration; now you have not constrained the middle ground; suppose they all come in around this limit. Targets should be percentile distributions; not single upper limits nor single percentile limits.

In summary identify things or classes of things with different response requirements, have distinct targets for different periods and use percentile distribution profiles to define each target.

Remaining realistic.

The second trick is to ask for things that you stand a chance of getting. Base you requirements on what the sort of technology you are mandating or are willing to pay for is able to deliver. Web technology has its strengths but does not deliver 95% interface updates for zone selection in under 0.5 seconds under any circumstances. The targets set have to be achievable or they will be ignored.

Reflect on what is plausible given the technology and the environment it is used in. What does this mean if you have an activities that must complete in a time that is unrealistic? It means you have to step back and reassess the concept. Redesign the interaction and the task structure to reduce the time criticality. Alternatively ask have we choosen the right technology?

Targets have to be achievable; unachievable ones will either be ignored or will consume lots of resources and attention and then fail. Where the tasking and interaction design mandates targets that cannot be met you need to redesign or reassess the technology options.

Concluding

One of the biggest mistakes possible is to fail to put enough thought and care into specifying performance requirements. If you don’t ask or if your request is nonsense then you risk getting something far removed from what you need. When you do decide to ask properly you have to really understand your need, define it in the right structure and ensure that what you ask for is possible. Once this framework is in place developers have something to work to and you have a firm basis for performance assurance activities.

Sunday, 7 November 2010

Testing the discipline that lives in Flatland

Flatland: A Romance of Many Dimensions is a novella set in world whose citizens are only aware of two dimensions; the third one is a secret.  After many years of observing the way that organisations approach software testing I have an ever strengthening belief that testing is hindered by a failure to recognise dimensions along which layered approaches should be used.  Testing is a discipline where anonymous uniform interchangeable tests exist and managers think in two dimensions these being effort and schedule.  These Flatland style limitations leads to testing that is both ineffective and inefficient,

So after that philosophical introduction what am I really getting at.  There are a number of things about the way testing is generally approached, resourced and executed that lack a layered approach (layering denoting a dimension) and that suffer as a result.  In this post I will describe the main ones that are repeatedly found in organisation we work with.  Later I hope to make time to explore each in more detail.  The four recurring themes are:
  1. People. There are testers and well there are testers; that is it.  Compare this with enterprise level development organisations where we see architects, lead end-to-end designers, platform architects, platform designers, lead developers and developers.  This is not necessarily anything to do with the line or task management structures; this is people with different levels of skill and experience who are matched to the different challenges to be faced when delivering the work.  Compare again testing where organisations generally think in terms of a flat interchangeable population of testers.  A source of problems or not; what do you think?
  2. Single step test set creation.  At one point there is nothing other than a need to have some tests, usually to have them ready very quickly, then there are several hundred test cases often described as a sequence of activities to be executed.  Any idea how we got from A to B; any idea whether B is anywhere near the right place never mind whether it is optimal; any chance of figuring it out retrospectively? No; not a chance.  Its like starting off with a high level wish for a system and coding like mad for two weeks and expecting to get something of value (actually come to think of it isn't there something called Agile...).    Seriously an effective test set is a complex optimised construct; complex constructs generally do not get to be coherent and optimised without a layered process of decomposition and design.  In most places test set design lacks any layered systematic approach and has no transparency; it depends on the ability and the on the day performance of the individual tester. Then once it is done it is done; you can't review and inspect quality into something that is not in the right place to start off with.
  3. Tiers of testing. Many places and projects have separate testing activities; for example system testing, end-to-end testing, customer experience testing, business testing and acceptance testing. How often is the theoretical distinction clear; how often does the reality match the theory?   Take a look and in many cases you will see that the tests are all similar in style and coverage. There is a tendency to converge on testing that the system does what it says it does and to do this in the areas and ways that are easy to test.  This can lead to a drive to merge the testing into one homogenous mass to save time and cost; given that the tests had already become indistinguishable it is drive that it is hard to resist.  Distinct tiered testing has a high value but the lack of clear recognition of what makes the tiers different is the start of the road to failure.
  4. The focus of tests.  When you see a test can you tell what sort of errors it is trying to find?  Is it designed to find reliability problems, to ensure user anomalies are handled, to ensure a user always knows what is going on or to check that a sale is reflected correctly in the accounting system?  A different focus requires a different type of test.  Yet generally there are just tests and more tests.  No concept of a specific focus for a particular group of tests, little concept of  different types of test to serve different purposes.  Testers lack clear guidance on what the tests they are designing need to do and so produce generic tests that deliver generic test results.
These four themes demonstrate a common lack of sophistication in the way that testing is approached.  A view of testing as set of uniform activities to be exercised by standardised people in a single step process is the downfall of many testing activities.  It is a Flatland approach and testing practices need to invade and spread out along these other dimensions for testing to become more effective and valued.  Hopefully I will be able to provide some ideas on how to escape from Flatland at a later date.