Evaluation of retrieval performance is a crucial problem in content based image retrieval (CBIR). The determination of relevant and non-relevant documents for a given query is one of the most important and time-consuming tasks. Using real users, it takes a long time to judge a large number of documents.
The working definition of relevance: “If you were writing a report on the subject of topic and would use the information contained in the document in the report, then the document is relevant”. Only binary judgments (“relevant” or “non-relevant”) are made and the document is judged relevant if any piece of it is relevant (regardless of how small the piece is in relation to the rest of the document). There is also a need for standardization of evaluation measures, since several measures are slight variations of the same definition. This makes it very hard to compare the performance of systems objectively. To overcome this problem, a set of standard performance measures and a standard image database is needed. After all, the ultimate aim is to measure the usefulness of a system for a user. An overview of existing performance evaluation measures in CBIR is given as follows:
User comparison:
User comparison is an interactive method. The users judge the success of the query directly after the query. It is hard to get a large number of such user comparisons, as they are time-consuming.
Before-after comparison: This is the easiest test method. Users are given two or more different results and are asked to choose the one that is preferred or found to be most relevant to the query. This method needs a base system or, at least, another system for comparison.
Single-valued measures:
Rank of the best match: In this method we measure whether the “most relevant” image is either in the first 50 or in the first 500 images retrieved. 50 represents the number of images returned on the screen and 500 is an estimate of the maximum number of images a user might look at when browsing
Average rank of relevant images:
This method can give a good indication of system performance, although it clearly contains less information than a precision-recall graph. It is vulnerable to outliers, since just one relevant image with a very high rank adversely affects it. A simpler and more robust measure is the rank of the relevant image, which is very useful for CBIR
Precision and Recall:
These are standard measures in the Image Retrieval, which give a good indication of system performance. Either value alone contains insufficient information.
Precision = No. of relevant documents retrieved
Total No. of documents retrieved
Recall = No. of relevant documents retrieved
Total No. of relevant documents in collection
We can always make recall one simply one simply by retrieving all images. Thus precision and recall should either be used together, or the number of images retrieved should be specified. Precision and recall are often averaged, but it is important to know the basis on which this is done
Target Testing:
The target testing approach differs significantly from the other performance measures. Users are given a target image and the number of images that the user needs to examine before finding the target image is recorded. Starting with random images, the user marks images as either relevant or non-relevant.
Error Rate:
It is in fact a single precision value, so it is important to know where the value is measured.
Error rate = No. of non-relevant images retrieved
Total No. of Images retrieved
Retrieval Efficiency:
If the number of images retrieved is lower than or equal to the number of relevant images, this value is precision, otherwise it is the recall of a query. This definition can be misleading since it mixes two standard measures.
Correct and Incorrect Detection:
In this method the number of correct and incorrect classifications is counted. When divided by the number of retrieved images, these measures are equivalent to error rate and precision.