Evaluating the performance of information retrieval systems using test collections
Paul Clough
Information School, Univeristy of Sheffield, Regent Court, 211 Portobello Street, Sheffield, S10 2TN
Mark Sanderson
School of Computer Science and Information Technology, RMIT University, GPO Box 2476, Melbourne 3001, Victoria, Australia
Introduction to information retrieval evaluation
To evaluate means to "ascertain the value or amount of something or to appraise it" (Kiewitt 1979: 3) and involves identifying suitable success criteria that can be measured in some way. Evaluation is important for designing, developing and maintaining effective Information Retrieval (or search) systems as it enables the success of an information retrieval system to be quantified and measured. This can involve evaluating characteristics of the information retrieval system itself, such as its retrieval effectiveness, or assessing consumers' acceptance or satisfaction with the system (Taube 1953). How to conduct information retrieval system evaluation has been an active area of research for the past 50 years and the subject of much discussion and debate (Saracevic 1995, Robertson 2008, Harman 2011). This is due, in part, of the need to incorporate users and user interaction into evaluation studies and the relationship between results of laboratory-based vs. operational tests (Robertson and Hancock-Beaulieu 1992).
Traditionally in information retrieval there has been a strong focus on measuring system effectiveness: the ability of an information retrieval system to discriminate between documents that are relevant or not relevant for a given user query. This focus on the system has, in part, been influenced by the focus of the information retrieval community on the development of retrieval algorithms, together with the organization of large information retrieval evaluation events, such as the Text REtrieval Conference (or TREC) in the USA. Such events have also focused on measuring system effectiveness in a controlled experimental setting (Robertson 2008, Voorhees and Harman 2005). Efficiency of an information retrieval system has also been assessed, e.g. measuring how long the system takes to return results or memory/disk space required to store the index. Measuring the effectiveness and efficiency of an information retrieval system has commonly been conducted in a laboratory setting, with little involvement of end users and focused on assessing the performance of the underlying search algorithms; therefore, this is commonly referred to as system-oriented evaluation.
Robertson and Hancock-Beaulieu (1992) suggest that the scope of 'system' in information retrieval has slowly broadened to include more elements of the retrieval context, such as the user or the user's environment, which must be included in the evaluation of information retrieval systems. Similar remarks have been made by Saracevic (1995) and Ingwersen and Järvelin (2005). Therefore, instead of focusing on just the system (i.e., its inputs and outputs), a more user-oriented approach can be taken. This may take into account the user, the user's context and situation, and their interactions with an information retrieval system, perhaps in a real-life operational environment (Borlund 2009). This includes, for example, assessing the usability of the search interface or measuring aspects of the user's information searching behaviour (e.g., a user's satisfaction with the search results or the number of items viewed/saved). Taking into account the wider context of search is part of interactive information retrieval evaluation (Kelly 2009).
In practice it is common to utilize various evaluation approaches that will be used throughout the development of an information retrieval system, from using test collections to develop, contrast and optimize search algorithms; to conducting laboratory-based user experiments for improving design of the user interface; to evaluation carried out in situ as the information retrieval system is used in a practical setting. Järvelin (2011) contrasts system- and user-oriented approaches to information retrieval evaluation in more controlled and artificial settings with the evaluation of information retrieval systems in operational real-life settings, where it is possible to employ both user- and system-oriented approaches. The rest of this chapter focuses on system-oriented approaches to information retrieval evaluation using test collections, also referred to as the Cranfield approach.
Evaluation of information retrieval systems using test collections
For decades the primary approach to evaluation has been system-oriented, focusing on assessing how well a system can find documents of interest given a specification of the user's information need. One of the most used methodologies for conducting experiments that can be repeated and conducted in a controlled laboratory-based setting is test collection-based evaluation (Robertson 2008, Sanderson 2010, Järvelin 2011, Harman 2011). This approach to evaluation has its origins in experiments conducted at Cranfield library in the UK, which ran between 1958 and 1966, and is often referred to as the Cranfield approach or methodology (Cleverdon 1991). The aim of the Cranfield experiments was to create "a laboratory type situation where, freed as far as possible from the combination of operational variables, the performance of index languages could be considered in isolation" (Cleverdon 1967). This approach defined the need for a common document collection, query tasks and ground truths to evaluate different indexing strategies under controlled conditions abstracted away from the operational environment.
The Cranfield approach to information retrieval evaluation uses test collections: re-useable and standardised resources that can be used to evaluate information retrieval systems with respect to system. The main components of an information retrieval test collection are the document collection, topics, and relevance assessments. These, together with evaluation measures, simulate the users of a search system in an operational setting and enable the effectiveness of an information retrieval system to be quantified. Evaluating information retrieval systems in this manner enables the comparison of different search algorithms and the effects on altering algorithm parameters to be systematically observed and quantified. Although proposed in the 1960s, this approach was popularised through the NIST-funded TREC series of large-scale evaluation campaigns that began in 1992, and subsequently other large-scale evaluation campaigns (e.g., CLEF, NTCIR and FIRE), that have stimulated significant developments in information retrieval over the past 20 years (Voorhees and Harman 2005).
The most common way of using the Cranfield approach is to compare various retrieval strategies or systems, which is referred to as comparative evaluation. In this case the focus is on the relative performance between systems, rather than absolute scores of system effectiveness. To evaluate using the Cranfield approach typically requires these stages: (1) select different retrieval strategies or systems to compare; (2) use these to produce ranked lists of documents (often called runs) for each query (often called topics); (3) compute the effectiveness of each strategy for every query in the test collection as a function of relevant documents retrieved; (4) average the scores over all queries to compute overall effectiveness of the strategy or system; (5) use the scores to rank the strategies/systems relative to each other. In addition, statistical tests may be used to determine whether the differences between effectiveness scores for strategies/systems and their rankings are significant. This is necessary if one wants to determine the 'best' approach. In the TREC-style version of the Cranfield approach, there is a further stage required prior to (2) above, whereby the runs for each query are used to create a pool of documents that are judged for relevance, often by domain experts. This produces a list of relevant documents (often called qrels) for each query that is required in computing system effectiveness with relevance-based measures (e.g., precision and recall).
Test collection-based evaluation is highly popular as a method for developing retrieval strategies. Benchmarks can be used by multiple researchers to evaluate in a standardised manner and with the same experimental set up, thereby enabling the comparison of results. In addition, user-oriented evaluation, although highly beneficial, is costly and complex and often difficult to replicate. It is this stability and standardization that makes the test collection so attractive. However, there are a number of limitations to test collection-based evaluation due to its abstraction from reality (Ingwersen and Järvelin 2005, pp. 6-9). Test collections experiments make a number of assumptions: that the relevance of documents is independent of each other; that all documents are equally important; that the user's information need remains static; that a single set of judgments for a query is representative of the user population; that the lists of relevant documents for each query are exhaustive (Voorhees 2002).
Building test collections
A test collection usually consists of a document collection, a set of topics that describe a user's information need and a set of relevance judgments indicating which documents in the collection are relevant to each topic. When constructing a test collection there are typically a number of practical issues that must be addressed (Sanderson and Braschler 2009). By modifying the components of a test collection and evaluation measures used, different retrieval problems and domains can be simulated. The original and most common problem modelled is ad hoc retrieval: the situation in which an information retrieval system is presented with a previously unseen query. However, test collection-based evaluations have also been carried out on tasks including question answering, information filtering, text summarization, topic detection and tracking, image and video retrieval, and text summarization.
Document collection
IR systems index documents that are retrieved in response to users' queries. A test collection must contain a static set of documents that should reflect the kinds of documents likely to be found in the operational setting or domain. This might involve digital library collections or sets of Web pages; texts or multimedia items (e.g., images and videos). The notion of a static document collection is important as it ensures that results can be reproduced upon re-use of the test collection. Specific questions that might be considered when gathering documents include:
- How many items should be gathered? In the case of smaller digital library collections, all items could be included in the test collection. However, in situations involving larger collections or constantly changing collections, e.g. the Web, then a sample of documents must be collected.
- What items should be sampled to create the document collection? Items in the test collection should represent as faithfully as possible those found in an operational setting. If items must be selected for a document collection then it may be appropriate to select a random sample or a specific subset of documents.
- What about copyright constraints? If the document collection is to be shared with others then an important aspect of which documents to include may involve considering copyright issues.
Topics
Information retrieval systems are evaluated for how well they answer users' search requests. In the case of ad hoc retrieval, the test collection must contain a set of statements that describe typical users' information needs. These might be expressed as queries that are submitted to an information retrieval system, questions or longer written descriptions. For example, TREC uses the notion of a topic, which typically consists of three fields: query, title and description. The query field represents a typical set of keywords a user might issue for a given topic. The title field provides a longer description, normally a sentence, of an information need. The description field describes in more detail a user's information and what they are attempting to find. This often includes a description of what constitutes relevant (and non-relevant) documents and may be used by people judging relevance (if not the person who generated the topic). Practical issues that often arise during creation of topics include:
- How should a suitable set of topics be generated? Ideally realistic queries should be developed for the test collection. These could be derived, for example, from analysing the query logs of operational systems, from the results of conducting user studies with end users of existing systems, or based on the input of domain experts.
- How many topics are required for obtaining reliable evaluation results? In practice, a minimum of 50 topics should be included in the test collection to ensure reliable. However, results have also shown that making fewer relevance judgments over greater numbers of queries leads to more reliable evaluation (Carterette et al. 2008).
- Do the topics represent a diverse enough set of information needs? Often queries on a range of topics/themes should be selected with varying characteristics to test information retrieval systems under a range of settings. For example, one might include shorter and longer queries, queries for specific entities (e.g., people or places) versus more subject-based queries. Research has shown that some queries are more useful in evaluation than others (Mizzaro and Robertson 2007).
- How should the topics be expressed? The topics might be expressed as a set of keywords and/or verbal descriptions. For other forms of media this may include non-textual descriptions. For example, for image retrieval the topics might include example images that can be used to initiate retrieval. It should be noted, however, that realistic queries in practice tend to be short and ambiguous.
Relevance assessments
For each topic in the test collection, a set of relevance judgments must be created indicating which documents in the collection are relevant to each topic. The notion of relevance used in the Cranfield approach is commonly interpreted as topical relevance: whether a document contains information on the same topic as the query. In addition, relevance is assumed to be consistent across assessors and static across judgments. However, this is a narrow view of relevance, which has been shown to be subjective, situational and multi-dimensional (Schamber 1994). Some have speculated that this variability would assess the accuracy of the measurement of retrieval effectiveness. A series of experiments were conducted to test this hypothesis (Cleverdon 1970, Voorhees 1998) with results showing that despite there being marked differences in the documents that different assessors judged as relevant or non-relevant, the differences did not substantially affect the relative ordering of information retrieval systems being measured using the different assessments.
Relevance judgments can be binary (relevant or not relevant) or use graded relevance judgments, e.g. highly relevant, partially relevant or non-relevant. The use of binary versus graded relevance judgments is important as it has implications for which evaluation measures can be used to evaluate the information retrieval systems. Commonly the assessors will be the people who originally created the topics, but this is not always the case. For example, relevance assessments might be gathered in a distributed way using multiple assessors and crowdsourcing. It must, however, be recognised that domain expertise can affect the quality of relevance assessments obtained (Bailey et al. 2008, Kinney et al. 2008).
There are various ways of gathering the relevance assessments. For example, in TREC the common approach is to gather the top n results from the different information retrieval systems under test for each topic and aggregate results into a single list of results for judging (called pooling). This assumes that the result lists of different information retrieval systems are diverse and therefore will bring relevant documents into the pool. The relevance assessors then go through the pool and make relevance judgments on each document which can then be used to compute system effectiveness. Documents which are not judged are often categorised as not relevant. An issue with pooling is the completeness of relevance assessments. Ideally for each topic one should find all relevant documents in the document collection; however, pooling may only find a subset. Approaches to help overcome this include using results lists from searches conducted manually in the pool of documents for assessment, or supplementing the sets of relevance judgments with additional relevant documents discovered during further manual. Generating complete sets of relevance judgments helps to ensure that when evaluating future systems, improvements in results can be detected. Generating relevance assessment is often highly time-consuming and labor intensive. This often leads to a bottleneck in the creation of test collections. Various techniques have been proposed to make the process of relevance assessment more efficient. Practical issues that often arise during relevance assessments include:
- Who should do the assessments? Ideally the topic generator will also perform relevance judgments. Bailey et al. (2008) compared the relevance assessments of judges who were subject experts and those who were not. They found that different assessments resulted, which had an effect on the measurement of system effectiveness (although the ranking remained the same).
- How many assessments should be made? There is evidence to suggest that assessing fewer documents for more topics (shallow and wide) is better than making exhaustive judgments for fewer topics (narrow and deep) (Carterette et al. 2008).
- What are the assessors expected to do? The relevance schemes must be decided and assessors provided with clear instructions on how to make relevance assessments. Kinney et al. (2008) showed that relevance assessments were more reliably obtained if detailed descriptions of what the user, issuing a query, was seeking.
- What about finding missing relevant documents? If a new system or configuration finds relevant documents that have not been identified previously its performance will not be maximal (especially for ad hoc retrieval and using pooling techniques). Either the original set of judgments should be as complete as possible using techniques to supplement the pools, testers should use measures of system effectiveness that can deal with incomplete judgments (Aslam et al. 2006), or testers should generate new judgments for results obtained from new systems.
Assessing system effectiveness: evaluation measures
Evaluation measures provide a way of quantifying retrieval effectiveness (Manning et al. 2008, Croft et al. 2009). Together, the test collection and evaluation measure provide a simulation of the user of an information retrieval system. For example, in the case of ad hoc retrieval the user is modelled as submitting a single query and being presented with a ranked list of results. One assumes that the user then starts at the top of the ranked list and works their way down examining each document in turn for relevance. This, of course, is an estimation of how users behave; in practice they are often far less predictable. There are also further complications that must be considered. For example, research has shown that users are more likely to select documents higher up in the ranking (rank bias); measures typically assume that no connection exists between retrieved documents (independence assumption); and particularly in the case of Web search, one must decide how to deal with duplicate documents: should they be counted as relevant or ignored as they have been previously seen?
Set-based measures
Two simple measures developed early on were precision and recall. These are set-based measures: documents in the ranking are treated as unique and the ordering of results is ignored. Precision measures the fraction of retrieved documents that are relevant; recall measures the fraction of relevant documents that are retrieved. Precision and recall hold an approximate inverse relationship: higher precision is often coupled with lower recall. However, this is not always the case as it has been shown that precision is affected by the retrieval of non-relevant documents; recall is not. Compared to other evaluation measures, precision is simple to compute because one only considers the set of retrieved documents (as long as relevance can be judged). However, to compute recall requires comparing the set of retrieved documents with the entire collection, which is impossible in many cases (e.g., for Web search). In this situation techniques, such as pooling, are used.
Often preference is given to either precision or recall. For example, in Web search the focus is typically on obtaining high precision by finding as many relevant documents in the top n results. However, there are certain domains, such as patent search, where the focus is on finding all relevant documents through an exhaustive search. Alternative recall-oriented measures can then be employed (Magdy and Jones 2010). Scores for precision and recall are often combined into a single measure to allow the comparison of information retrieval systems. Example measures include the e and f measures (van Rijsbergen 1979).
Rank-based measures
More commonly used measures are based on evaluating ranked retrieval results, where importance is placed, not only on obtaining the maximum number of relevant documents, but also for returning relevant documents higher in the ranked list. A common way to evaluate ranked outputs is to compute precision at various levels of recall (e.g., 0.0, 0.1, 0.2, ... 1.0), or at the rank positions of all the relevant documents and the scores averaged (referred to as average precision). This can be computed across multiple queries by taking the arithmetic mean of average precision values for individual topics. This single-figure measure of precision across relevant documents and multiple queries is referred to as mean average precision (or MAP). A further common measure is precision at a fixed rank position, for example Precision at rank 10 (P10 or P@10). Because the number of relevant documents can influence the P@10 score, an alternative measure called R-precision can be used: precision is measured at the rank position Rq, the total number of relevant documents for query q.
More recently, measures based on non-binary (or graded) relevance judgments have been utilised, such as discounted cumulative gain (Järvelin and Kekäläinen 2002). In such measures, each document is given a score indicating relevance (e.g., relevant=2; partially-relevant=1; non-relevant=0). Discounted cumulative gain computes a value for the number of relevant documents retrieved that includes a discount function to progressively reduce the importance of relevant documents found further down the ranked results list. This simulates the assumption that users prefer relevant documents higher in the ranked list. The measure also makes the assumption that highly relevant documents are more useful than partially relevant documents, which in turn are more useful than non-relevant documents. The score can be normalised to provide a value in the range 0 to 1, known as normalised DCG (nDCG). The measure can be averaged across multiple topics similar to computing mean average precision, and it has also been extended to compute the value of retrieved results across multiple queries in a session, referred to as normalised session discounted cumulative gain or nsDCG (Järvelin et al. 2008).
Other measures
Additional measures have been developed to evaluate different information retrieval problems. For example, to measure the success of search tasks where just one relevant document is required (known-item search), measures, such as mean reciprocal rank (MRR), can be used. Another problem requiring bespoke evaluation measures is the assessment of the variability or diversity of results. In these cases one might want to evaluate the different aspects (or sub-topics) of query and therefore measures such as S-recall (Zhai et al. 2003) were created. This measure promotes systems that return different relevant documents rather than all from the same (or similar) topic or aspect.
In practice it is important to select an evaluation measure that is suitable for the given task; for example, if the problem is known-item search then the mean reciprocal rank would be appropriate; for an ad hoc search task then mean average precision or averaged normalised discounted cumulative gain would be applicable.
Comparing results
It is common to find in practice that retrieved results are compared against each other (comparative evaluation) with the runs with the highest scores deemed as the 'best'. In this case an absolute score of retrieval performance is of less importance than scores relative to each other. One might compare sets of results produced by running the same system multiple times with varying parameter settings, or taking single runs from multiple systems and comparing them to determine the best search system. Typically evaluation measures are computed across multiple queries (or topics) and averaged to produce a final score. When comparing systems, significance testing should be used to determine whether one ranking is actually better than another rather than due to chance. Commonly used tests are Kendall's Tau and the Wilcoxon signed-rank test (Sanderson 2010: 63-73).
Current and future research directions
Laboratory-based information retrieval evaluation using test collections is a popular, and useful, approach to evaluation. However, information retrieval researchers have recognised the need to update the original Cranfield approach to allow the evaluation of new information retrieval systems that deal with varying search problems and domains (Kamps et al. 2009). Research is ongoing to tackle a range of issues in information retrieval evaluation using test collections and we discuss three examples in which we have been personally involved: gathering relevance assessments efficiently, comparing system effectiveness and user utility; and evaluating information retrieval systems over sessions. Further avenues of research include the use of simulations (Azzopardi et al. 2010), and the development of new information retrieval evaluation measures (Yilmaz et al. 2010, Smucker and Clarke 2012).
Gathering relevance assessments efficiently
One area receiving recent attention by the information retrieval research community is how to create test collections efficiently. Given the bottleneck in gathering relevance judgments, 'low-cost evaluation' techniques have been proposed. These include approaches based on focusing assessor effort on runs from particular systems or topics that are likely to contain more relevant documents (Zobel 1998), sampling documents from the pool (Aslam et al. 2006), supplementing pools with relevant documents found by manually searching the document collection with an information retrieval system, known as interactive search and judge or ISJ (Cormack et al. 1998) and simulating queries and relevance assessments based on user's queries and clicks in search logs (Zhang and Kamps 2010).
One particular approach that has received interest recently has been the use of crowdsourcing: the act of taking a job traditionally performed by a designated person and outsourcing to an undefined, generally large, group of people in the form of an open call. Amazon Mechanical Turk (AMT) is one such example of a crowdsourcing platform. This system has around 200,000 workers from many countries that perform human intelligence tasks. Recent research has demonstrated that crowdsourcing is feasible for gathering relevance assessments (Alonso and Mizzaro 2009, Kazai 2011, Carvalho et al. 2011). However, previous studies have also demonstrated that domain expertise can have an impact on the quality and reliability of relevance judgments (Bailey et al. 2008, Kinney et al. 2008) and is particularly problematic when using crowdsourcing in specialised domains (Clough et al. 2012).
System effectiveness versus user utility
The effectiveness of information retrieval systems is typically measured based on the number of relevant items found. The test collection and measure aim to simulate users in an operational setting with the assumption that if System A were to score higher than System B on a test collection then users would prefer or be more satisfied with System A over System B. This is important because if this were not the case then test collection-based evaluation would be limited: improvements in system effectiveness in the laboratory would not necessarily translate to improvements in an operational setting and thereby benefit the user.
Several studies have been undertaken to investigate the link between system effectiveness and user preference or satisfaction. Results from these studies have differed widely: some have shown that an increase in system effectiveness did not have detectable gains for the end user in practice (Hersh et al. 2000); others have demonstrated that when comparing two systems with large relative differences in system effectiveness and user preferences for which system output they prefer, results were strongly correlated (Al-Maskari et al. 2008, Sanderson et al. 2010). Ingwersen and Järvelin (2005) assert that the real issue in information retrieval system design is not whether precision and recall goes up, but rather whether the system helps users perform search tasks more effectively. There are still many unanswered questions about the link between system- and user-oriented measures of search effectiveness and satisfaction.
Evaluation across user sessions
Test collection-based evaluation to date has focused on evaluating information retrieval systems in a single query-response manner. However, in practice users typically reformulate their queries in response to results from the information retrieval system or as their information need alters during a search episode (Bates 1989; Jansen et al. 2009). This requires re-thinking information retrieval evaluation to compute system success over multiple query-response interactions (Järvelin 2009, Keskustalo et al. 2009, Kanoulas et al. 2011). For example, traditional evaluation measures, such as precision and recall, are not valid for multiple queries; therefore proposals have been made to extend existing measures. For example, Järvelin et al. (2008) extend the normalised discounted cumulative gain (nDCG) measure to a measure that considered multi-query sessions called normalised session discounted cumulative gain (nsDCG); Kanoulas et al. (2011) generalize traditional evaluation measures such as average precision to multi-query session evaluation. Further aspects that must be considered are whether or not to include duplicate documents when evaluating multiple query-responses.
An example of a large-scale evaluation effort to evaluate information retrieval systems over multi-query sessions is the TREC Session Track, which has run since 2010. The TREC organisers have provided the resources and evaluation measures to assess whether systems with prior knowledge of a user's search behaviour can provide more effective search results for subsequent queries (Kanoulas et al. 2012). The test collection for session evaluation consists of a document collection, topics, and relevance assessments. However, the test collection for sessions also contains information from real user sessions, such as retrieved results, the items clicked on by users and the length of time spent viewing results and selected items. Results have broadly shown that information retrieval results can be improved for a given query with users' interaction data and that the more data used the higher the performance obtained.
Summary
Like conducting any information retrieval experiment, the use of a test collection-based approach must be planned and decisions made with respect to the experimental design (Tague-Sutcliffe 1996). The goals of the evaluation must be defined; a suitable test collection must be selected from those already in existence, or must be created specifically for the retrieval problem being addressed; different information retrieval systems or techniques must be developed or chosen for testing and comparing; evaluation measures and statistical tests must be selected for evaluating the performance of the information retrieval system of for comparing whether one version of the system is better than another. These decisions are important, as they will affect the quality of the benchmark and impact on the accuracy and usefulness of results.
Over the years, test collection-based evaluation has been highly influential in shaping the core components of modern information retrieval systems, including Web search. This has been particularly visible in the context of TREC and related studies, which have not only provided the necessary benchmarks to compare different approaches to retrieval, but also provided a community and forum in which it can be discussed and promoted. The Cranfield approach to information retrieval evaluation has remained popular for more than fifty years although it is important to remember the original purpose of test collection experimentation for information retrieval: to develop and optimize algorithms for locating and ranking a set of documents about the same topic to a given query. Indexing, matching and ranking documents with respect to queries are the basic requirements of an information retrieval algorithm, which can be tested in a laboratory using test collections. There are widely recognised weaknesses to the Cranfield approach that will provide the research community with many avenues for further research in coming years.
About the authors
Paul Clough is a Senior Lecturer in the Information School at the University of Sheffield. He received a B.Eng. (hons) degree in Computer Science from the University of York in 1998 and a Ph.D. from the University of Sheffield in 2002. His research interests mainly revolve around developing technologies to assist people with accessing and managing information. Paul can be contacted at p.d.clough@sheffield.ac.uk
Mark Sanderson is a Professor in the School of Computer Science and Information Technology at RMIT University in Melbourne, Australia. He received his Bachelor's and Doctorate degrees in Computer Science at the University of Glasgow, Scotland. He can be contacted at mark.sanderson@rmit.edu.au