Properties of good assessment

From Testiwiki
Jump to: navigation, search

The heande-page on this topic now links here. The old heande-version can be accessed here (password required). The old discussion are shown on the discussion page.

Properties of good assessment

The Properties of Good Assessment framework was originally published by Tuomisto and Pohjola (2007). Here an updated version of the framework is described as was applied in evaluating the Assessment of the health impacts of H1N1 vaccination. Another recent example of applying the framework is the evaluation of the Biofuel assessments, in which the framework was slightly simplified and merged with the dimensions of openness framework.

The framework consists of three categories, which are broken down into nine properties that jointly constitute the performance. The framework is designed to be applicable for evaluating both quantitative and qualitative information. Also the evaluation can be made either quantitatively or qualitatively, depending on the specific evaluation methods that are applied. The framework itself does not bind into using any particular evaluation methods, as long as their application is in line with how the properties in the framework are defined. The framework is scalable so that it can be applied for evaluating a whole model or assessment, one of its parts, e.g. a sub-model, node, variable, parameter etc., or for a set of models, assessments, or their parts. In terms of management towards effective interaction between modelling or assessment and use of their outputs, the framework is probably most intuitively comprehensible on the level of considering one model or assessment.

The categories and properties of the framework are presented in Table 1 and discussed in more detail below. In the table, the description column provides a general explanation of the meaning of each property. The question column then attempts to explicate what is intended by the description by providing example questions that could be asked in evaluating a model or assessment in terms of that property. For clarity the example questions are formulated on the level of evaluating a whole assessment consisting of only one assessment question and one corresponding answer (along with its reasoning), unless otherwise indicated.

Table 1. Properties of good assessment.
Category Property Description Question
Quality of content Informativeness Specificity of information, e.g. tightness of spread for a distribution. How many possible worlds does the answer rule out? How few possible interpretations are there for the answer?
Calibration Exactness or correctness of information. In practice often in comparison to some other estimate or a golden standard. How close is the answer to reality or real value?
Coherence Correspondence between questions and answers. Also between sets of questions and answers. How completely does the answer address the assessment question? Is everything addressed? Is something unnecessary?
Applicability Relevance Correspondence between output and its intended use. How well does the information provided by the assessment serve the needs of the users? Is the assessment question good?
Availability Accessibility of the output to users in terms of e.g. time, location, extent of information, extent of users. Is the information provided by the assessment available when, where and to whom is needed?
Usability Potential of the information in the output to trigger understanding in its user(s) about what it describes. Can the users perceive and internalize the information provided by the assessment? Does users' understanding increase about the assessed issue?
Acceptability Potential of the output being accepted by its users. Fundamentally a matter of its making and delivery, not its information content. Is the assessment result (output), and the way it is obtained and delivered for use, perceived as acceptable by the users?
Efficiency Intra-assessment efficiency Resource expenditure of producing the assessment output. How much effort is spent in the making of an assessment?
Inter-assessment efficiency Resource expenditure of producing assessment outputs in a series of assessments. If another (somewhat similar) assessment was made, how much (less) effort would be needed?

Quality of content

As the name implies, the properties in the first category, quality of content, address characteristics of the information content in the assessment output. These properties characterize performance in relation to the general purpose of modelling and assessment, describing reality.

Informativeness and calibration are tightly interlinked properties and it makes most sense to consider them together as describing the truthlikeness (cf. Niiniluoto, 1997) of the answers provided by an assessment. Informativeness and calibration form a couple, which is quite similar to e.g. accuracy and precision of quantitative information (see e.g. accuracy and precision in Wikipedia,, but has a somewhat different and more flexible interpretation particularly in terms of non-quantitative information. The basic challenge regarding calibration is that in most cases it is not possible to know what is the absolute truth, and in practice calibration often needs to be evaluated against e.g. gold standards or estimates obtained by other means or indirectly through evaluating the calibration of the source of information. Clarification and examples on informativeness and calibration in expert elicitation can be found e.g. from Cooke (1991) and Tuomisto et al. (2008).

Coherence considers the match between the questions asked and answers provided in an assessment; how completely are the questions answered? It necessitates explication of the assessment questions (and sub-questions). It does not, however, consider the goodness of the questions.


The properties in the second category, applicability, consider the assessment output, i.e. information product delivered to its use. As not only the information content and the structural features of the output, but also the delivery to use is considered, these properties extend to consider aspects of the making as well as using the output. The applicability properties characterize performance in relation to the instrumental purposes of serving practical needs. This necessitates identification and explication of the purposes.

Fundamentally the question in applicability is about the performance of the output in triggering the intended cognitive processes among the users that lead to increased understanding and consequential decisions and actions guided by that understanding. The potential of achieving this is considered to be a function of whether the questions looked at are right in relation to the needs (relevance), how well the information produced by modelling and assessment reaches its targets (availability), to what extent the receivers can make use of the information (usability), and if it is accepted or rejected by the users (acceptability). We say potential of achieving, because the ultimate overall applicability is a result of multiple factors of which many, e.g. the cognitive capacities of the users and many situational factors, can be considered to be beyond the influence of modellers and assessors. Consequently, whereas the first two applicability properties, availability and usability, are explicitly, although not necessarily easily, measurable, the last two properties, usability and acceptability, are more tricky as they vary significantly from an individual user as well as situation to another. The issues of availability are also addressed in the dimensions of openness, a framework for designing and managing effective processes for participatory assessment and policy making, in terms of scope of participation, access to information, timing of openness, scope of contribution and impact of contribution (Pohjola and Tuomisto., 2011).


The third category, efficiency, takes a relatively simple and straightforward approach to characterizing the process of modelling and assessment. It consists of two measures of resource expenditure. Intra-assessment efficiency considers resource expenditure for given output in one assessment. Inter-assessment efficiency considers the change in efficiency, or a corresponding change in resource expenditure, for given output in a series of assessments. Whereas the first measure is probably intuitive and easy to grasp, the latter may require some explanation.

The idea behind the inter-assessment efficiency is that given the output that was produced in an assessment and the corresponding expenditure of resources, it can be assumed that a related assessment could be made with less resource expenditure for comparable output or better output with same resource expenditure. This can take place e.g. through the learning of modellers and assessors, but particularly through development, dissemination and sharing of re-usable assessment modules, sub-models etc. (cf. Haas and Jaeger, 2005; Harmsze, 2000). This saves the efforts of unnecessary duplicate work, and allows focusing on the most important or complicated aspects of modelling or assessment exercises.


The first two categories, quality of content and applicability, characterize the potential of an assessment to deliver the intended outcomes by both describing reality and serving practical needs. The third category characterizes the efficiency by which this potential is produced. The overall performance constituted as an aggregate of all properties can be called effectiveness. Above effectiveness was defined according to Hokkanen and Kojo (2003) as the likelihood of an assessment process achieving the desired results and the goals set for it. The Properties of Good Assessment framework can be considered as providing an operationalization of this definition by stating that effectiveness of a model or assessment is a function of the quality of content and applicability of its output and the efficiency of its making and delivery. It should be reminded, however, that this measure of effectiveness characterizes the likelihood for delivering the outcomes, not the actual realization thereof. Anyhow, the Properties of Good Assessment framework provides a major step forward towards bridging the modelling and assessment outputs with their intended outcomes. The information provided by evaluations according to the framework serves well the needs in designing and managing effective modelling and assessment endeavours. In retrospective follow-up evaluations of model and assessment effectiveness the information needs to be complemented with direct outcome evaluations e.g. as proposed by Matthews et al. (2011).


Cooke, R.M., 1991. Experts in Uncertainty: Opinion and Subjective Probability in Science. Oxford University Press, New York.

Haas, A., Jaeger, C., 2005. Agents, Bayes, and Climatic Risks – a modular modelling approach. Advances in Geosciences 4, 3–7.

Harmsze, F.A.P., 2000. A modular structure for scientific articles in an electronic environment. A Doctor's Thesis, University of Amsterdam. Available:

Hokkanen, P., Kojo, M., 2003. How environmental impact assessment influences decision-making [in Finnish]. Ympäristöministeriö, Helsinki.

Matthews, K.B., Rivington, M., Blackstock, K.L., McCrum, G., Buchan, K., Miller, D.G., 2011. Raising the bar? - The challenges of evaluating the outcomes of environmental modelling and software. Environmental Modelling & Software 26 (3), 247-257.

Niiniluoto, I., 1997: Reference invariance and truthlikeness. Philosophy of Science 64, 546-554.

Pohjola, M.V., Tuomisto, J.T., 2011. Openness in participation, assessment, and policy making upon issues of environment and environmental health: a review of literature and recent project results. Environmental Health 10, 58. doi:10.1186/1476-069X-10-58

Tuomisto, J.T., Pohjola, M.V., 2007. Open Risk Assessment - A new way of providing information for decision-making. Publications of the National Public Health Institute B18/2007. KTL - National Public Health Institute, Kuopio

Tuomisto, J.T., Wilson, A., Evans, J.S., Tainio, M., 2008. Uncertainty in mortality response to airborne fine particulate matter: Combining European air pollution experts. Reliability Engineering & System Safety 93, 732-744.

See also