As expected, such systems are inherently noisy and generate imperfect output. Sometimes they miss tuples that appear in the documents and sometimes they generate spurious tuples. One of the important questions is how to evaluate such a system objectively and with the minimum amount of effort.
A common evaluation strategy is to use precision-recall curves to show how the system behaves under different settings. The precision of the output is defined as the number of correct tuples in the output over the total number of generated tuples; recall is defined as the as the number of correct tuples in the output over the total number of correct tuples that can be extracted from the documents.
Unfortunately, precision is problematic due to its dependence on the input class distribution, as the following example illustrates:
- Example: Consider an extraction system E that generates a table of companies and their headquarters locations, Headquarters(Company, Location) from news articles in The New York TimesBusiness'' and the "Sports'' section. The "Business'' documents contain many tuples for the target relation, while "Sports'' documents do not contain any. The information extraction system works well, but occasionally extracts spurious tuples from some documents, independently of their topic. If the test set contains a large number of "Sports'' documents then the extraction system will also generate a large number of incorrect tuples from these "bad'' documents, bringing down the precision of the output. Actually, the more "Sports'' documents in the test set, the worse the reported precision, even though the underlying extraction system remains the same. Notice, though, that the recall is not affected by the document distribution in the test set and remains constant, independently of the number of "Sports" documents in the test set.
The fact that precision depends on the distribution of good and bad documents in the test set is well-known in machine learning, from the task of classifier evaluation. To evaluate classifiers, it is preferable to use ROC curves, which are independent of the class distribution in the test set. The ROC curves summarize graphically the tradeoffs between the different types of errors. When characterizing a binary decision process with ROC curves, we plot the true positive rate (the fraction of true positives correctly classified as positives, i.e., recall) as the ordinate, and the false positive rate (the fraction of true negatives incorrectly classified as positives) as the abscissa.
The standard application of ROC curves for information extraction is unfortunately problematic, for two reasons.
First reason: We typically do not know what a "true negative" is. Unlike document classification, a "bad tuple" does not exist apriori in a document. It only exists because the extraction system can extract it.
- Solution 1: One way to overcome this problem is to measure the number of all bad tuples that can be extracted from a document using all possible settings and all available extraction systems. Then, we can use this number as the normalizing factor to define the false positive rate. This solution works when dealing with a static set of extraction systems. Alas, the definition of false positive rate becomes unstable if we introduce later another system (or another setting) that generates previously unseen noisy tuples; this changes the number of all bad tuples, which serves as the normalizing constant, and forces recomputation of all false positive rates.
- Solution 2: Another way to avoid this problem is by having an un-normalized x-axis (abscissa). Instead of having the false positive rate, we can have the average number of bad tuples generated. In this case, the new curve is called the "Free Response Operating Characteristic" (FROC) curve. Such techniques are widely used in radiology to evaluate the performance of systems that detect nodules in MRI and CAT scans. (A nodule refers to a small aggregation of cells, indicative of a disease.) A problem with this approach is the lack of a "probabilistic" interpretation of the x-axis; the probabilistic interpretation can be convenient when analyzing/integrating the extraction system as part of of a bigger system, and we are not simply trying to measure its performance in a vacuum.
- Solution 1: We can play the same trick as above, to avoid the problem of reading/annotating the documents. We process each document multiple times, using all possible settings and all possible extraction systems. The union of the extracted tuples can be then validated to identify the set of all correct tuples. As in the case of true negatives, though, the definition becomes unstable if we have a dynamic set of extraction systems that can identify more good tuples at some point in the future, forcing a re-calculation of recall metrics for all systems.
- Solution 2: We can also have an un-normalized y-axis. For instance, we can have as the ordinate (y-axis) the average number of good tuples extracted for each document. (I have not seen anything like the FROC curves that will leave the true positive rate unnormalized, though.) The downside is that by leaving recall unnormalized, the values now depend on the distribution of good and bad documents in the input: the more bad documents with no good tuples in the test set, the lower the unnormalized value will be. Therefore, this definition seems to go against the spirit of ROC curves.