1 Introduction
Information Retrieval (IR) and XML query systems are headed for significant unification. Starting with simple fielded search [2], IR engines are getting more sophisticated at handling more complex annotation tags inlined with source text [25]. Meanwhile, XML engines are adding text search features [13,5,30,14,1]. The Unstructured Information Management Architecture (UIMA, http://www.alphaworks.ibm.com/tech/uima) is moving to standardize annotation and indexing modules and define a query paradigm based on XML fragments.
Ontologies (e.g. http://www.ontoweb.org/), linguistic knowledge bases [26] and processors [8], semantic taggers [15,10], and named entity recognizers are the major sources of annotations added to a corpus. Consider the noun is-a (hypernym) hierarchy in WordNet, and suppose every noun (phrase) in the corpus has been annotated with all its hypernyms. E.g., if the token Einstein appears at some position, we recognize that a special case of types Physicist and person (among others) appears at this position.
Suitable positional indexes of such annotations, integrated with text, are key enablers of the Semantic Web vision, because they let us ask powerful queries that combine syntactic match with type constraints.
Such queries can be used to guide contextual information extraction and aggregation where scanning the whole corpus (e.g. the Web) would be prohibitive. E.g., we may be interested in extracting numeric tokens adjacent to hours and close to the words laptop and battery. We may even use instances of specified types to activate instances of other types, e.g., we may want to instantiate type person near type money amount and words UN and oil.
Apart from soft aggregation, we can also use such queries as initial approximations to natural language questions. E.g., to answer the question "Who discovered the theory of relativity", we can seek instances of person in close proximity to stems discover, theory and relativity, using a suitable function to weight and aggregate the evidence from occurrences of the three keywords. Recent NLP work [21] has enabled mapping questions to target entity types with high accuracy. Here we consider the task of retrieving answer candidates using those types.
In this paper we will work with the is-a relation, but our system works with any partial order on entities, such as is-part-of, is-contained-in, etc. While small type systems may suffice for specific applications, open-domain query systems need very large type systems to be useful. Bootstrapping large type systems and semantic relations from the Web [10,28,11,6] is an active research area, but efficient index and query support are needed to fully realize their potential.
1.1 Our contributions
We identify a broad class of proximity queries that combine text and annotation. Despite its simplicity, our framework is surprisingly general and useful for a number of text mining applications. We describe a system we are building around Lucene [2] and UIMA (http://www.alphaworks.ibm.com/tech/uima) that enables efficient annotation indexing and search. We intend to make the system publicly available. We identify two key technical problems that must be solved: learning proximity scoring functions and workload-adaptive index optimization.
Scoring functions are fairly standardized in traditional IR but largely secret in Web search. Early XML query semantics were set- and path-oriented; later, scoring schemes were adopted from IR and Pagerank [13,5,30,16,14,1], but these are not yet grounded in evaluation as extensive as standard IR. Therefore, we must learn good proximity scoring functions automatically before we can consider index and query optimization.
We design a simple and efficient learning algorithm to fit a proximity-based scoring function to logs of queries and known answers. When it is plugged into the query processor, we get high accuracy: recall at rank 300 increases from 0.8 (standard IR) to 0.9 and typically, a correct answer token appears at rank 2-4. Surprisingly, we find the best-fit scoring function to be non-monotonic wrt to the distance between selectors and candidate tokens.
Large and deep type hierarchies are essential to support open-domain search. Consequently, the index space required for the type annotations becomes very large compared to the standard inverted index, which can be compressed significantly [31]. We devise new algorithms that exploit the skew in the distribution of answer types (atypes) in query logs (i.e., historical workload statistics) and cut down the additional index storage requirement by 85-90%, while sacrificing query processing speed by a modest factor between 1 and 3. Our indexes occupy about the same total space as the corpus in gzipped form and an optimized IR index.
We present large-scale experiments with two 5 GB corpora having a million documents each, some 400 million distinct token positions, an atype hierarchy with 80,000 types (18,000 non-leaf types), and several hundred real-life queries from the TREC (http://trec.nist.gov/data/qa.html) competition. We use TREC data mainly because of the "truthed" collection of questions and answers, but our conclusions appear general. We present data on both the quality of answers and the performance of our system.
1.2 Related work
IR engines can support basic proximity clauses with user-supplied windows. The popular IR package Lucene [2] provides basic fielded search with typically a limited number of fields, and does not natively support proximity clauses across fields. INDRI [25] has a rich and powerful query language that can support hard proximity clauses across tagged fields, but a hard proximity window must be specified by the user; also, we do not know of any workload-sensitive index optimization in INDRI.
The Bindings Engine [4] (BE) is a critical breakthrough toward semantic pattern searches, and is specifically optimized for text, but it depends strongly on immediate juxtaposition of ground constants with patterns to instantiate, as in "cruel ProperNoun said" or "cities such as NounPhraseList". BE is reported as using a handful of types (parts of speech and named entities like person, place, time) which still leads to an index 10 times larger than an IR index. However, thanks to the more restricted query class, BE uses more sequential disk access than our system.
The database and XML communities are steadily adding IR support [20]. COMPASS [14] and XXL [30] are notable systems integrating text and structure search, but to our knowledge do not learn textual proximity. COMPASS supports "concept" searches of the form "concept=value" where the concept may have soft aliases, but not "concept near value" in our sense.
Index space is a significant concern in adding IR support to XML systems. Florescu et al. [12] survey work in the database and XML communities and propose a text extension to XML-QL that allow element-level word matches. They report an index that is 12 times larger than the uncompressed XML source, and do not consider proximity-based scoring.
In principle, we can turn our corpus into a giant XML graph with one token per node and try to apply activation-based query systems such as ObjectRank [3], XRank [16] or TeXQuery [1]. However, given an average English token is 4-5 bytes, and compresses down to 0.5-1 byte in the IR index, even a 32-bit ID per such node would be prohibitive. Another problem is that (as we shall see) good scoring functions in the linear token space do not necessarily decay monotonically with distance.
Elegant data structures have been proposed to deal with "hard" proximity constraints [23,27], but they do not support learnable proximity scores, type containment or self-tuning indexes that exploit the query distribution. Earlier IR systems have exploited query skew in index-pruning [9] and caching [29] to achieve impressive reduction of IR index sizes and query times, but do not consider queries involving a type hierarchy.
Figure 1: System architecture. In this paper we are concerned with the blocks highlighted with dotted lines: learning a scoring function, and workload-driven index optimization.
2 System architecture
Figure 1 shows our system architecture. The system input consists of a text corpus, an atype taxonomy with associated annotators, and pairs of sample queries with truthed response tokens embedded in their document contexts. From the query-response pairs, we learn a scoring function that rewards and combines proximity between candidate tokens and matched selectors. This module is described in Section 3. The scoring function is plugged into the query processor. We also optimize our indexes using the query logs; this is described in Section 4.
2.1 Atype taxonomy, corpus and annotators
The atype taxonomy is a DAG where nodes are atypes and edges represent the is-a relation. A number of sources can provide atype taxonomy information. Here we use WordNet [26], but we are also integrating data from http://en.wikipedia.org/ to increase recall. Is-a instances can also be bootstrapped from the Web effectively [11]. In this paper, we will refer to nodes using WordNet synset notation, e.g., Einstein#n#2 is the second noun sense of Einstein, meaning genius.
The corpus is a set of documents. Each document is a sequence of tokens. Tokens can be compound, such as New_York. An annotator module (Figure 1) connects some tokens to nodes in the atype taxonomy. E.g. the string token Einstein might be connected to both senses Einstein#n#1 (the specific Physicist) Einstein#n#2 (genius). Now, through is-a connections in WordNet, we know that Einstein is-a person#n#1. Disambiguation can be integrated into the annotator module, but is an extensive research area in NLP [24] and is outside our current scope.
2.2 Query and candidate responses
A query consists of two parts. The first part is an atype from the taxonomy. The second part is a set of indexable predicates on token strings. The simplest predicate is string or stem equality, but we can also use a fixed library of regular expressions, also called surface patterns. (Multiple atypes can be allowed easily, but we will not discuss this here.)
To find the distance between Hamburg and Munich, we might use the atype linear_measure#n#1 and literal matches for Hamburg and Munich. The system and experiments in this paper used only equality predicates on stems; in such cases, we call string literals selectors. The atype may also be specified by a surface pattern, in this example, say, by [:digit:]+, a sequence of digits.
Any token in the corpus that is connected to a descendant of linear_measure#n#1 is a candidate answer token. We admit a candidate only if at least one selector appears within W (typically, 50) tokens of the candidate. Many IR systems use similar pruning policies.
The score of a candidate depends on the selectors that appear nearby, as described in Section 3. For some applications, we simply need to report k candidate tokens in order of decreasing score. In other applications such as question answering, a short context around the candidate is submitted to a more sophisticated NLP module.
2.3 Indexes
The query processor opens a cursor on the posting list for each atype and selector, and does a BE-style [4] multiway merge while pushing candidate tokens with scores into a scoring heap. In Section 4 we will build smaller indexes where most atypes will be "missing"-in that case, the query atype must be generalized to a coarser atype, and tentative result tokens checked using a forward index and a reachability index (described later in this section).
2.3.1 The stem and full atype indexes
As shown in Figure 1, in the first pass we build, apart from the forward and reachability indexes, two ordinary inverted indexes [31]. One maps from stems to posting lists. Each posting is a record containing a document ID where the stem appears, the number of times it appears in the document, and a list of token offsets where it appears. For English corpora, the size of the positional index is typically between 15-40% of size of the original (uncompressed) corpus, depending on the index compression techniques used. An IR index provides efficient sequential access through a posting list, so that multiple postings can be merged efficiently.
While indexing a stem, we also look for any atype attached to the token, and then follow all is-a links in the atype DAG up to the roots. For example, from the token Einstein in some document d, we can reach synsets physicist#n#1, intellectual#n#1, scientist#n#1, person#n#1, organism#n#1, living_thing#n#1, object#n#1, causal_agent#n#1, entity#n#1. We index all these atypes as if they all occurred at the same token offset in d as Einstein, in a second inverted index called the (full) atype index, i.e., in the full atype index, the posting list of each of the above atypes will include d.
A realistic and useful atype taxonomy can be quite deep-many prominent noun classes in WordNet are 6-12 links down from the noun roots. Consequently, as Figure 2 shows, the full atype index is almost the size of the uncompressed original corpus, over thrice the size of the gzipped corpus, and almost five times the size of the stem index. This would be unacceptable in most search applications, definitely large-scale services that need to cache substantial parts of the index in RAM. Luckily the query workload is very skewed, so we can choose only a subset of atypes to index; this is the topic of Section 4.
2.3.2 Search using stem and full atype indexes
The posting list for a term (an atype or a selector) helps us scan through (docid, tokenOffset) pairs in increasing lexicographic order. We use a bounded max-heap (of size k where k is the number of results sought) to keep track of the highest k scores. The memory required is bounded by the sum of the size of the score heap and the space required for holding all occurrences of candidates and selectors within a single document. We scan the postings lock-step, completely processing one document and inserting all its candidate locations into the score heap (evicting low-score locations if needed) and then move on to the next document.
| Corpus/index |
Size (GB) |
| Original corpus |
5.72 |
| Gzipped corpus |
1.33 |
| Stem index |
0.91 |
| Full atype index |
4.30 |
| Reachability index |
0.005 |
| Forward index |
1.16 |
Figure 2: Relative sizes of the corpus and various indexes for TREC 2000.
2.3.3 Reachability index
In Section 4 we will also need a reachability index: given two atypes a1 and a2, or an atype a1 and a token w, it can quickly (in O(1) time) tell us if a2 or w is-a a1 i.e. if a1 is an ancestor or generalization of a2 or w in the atype taxonomy. Because a type hierarchy is "largely" a tree, a simple Dewey coding [16] of the atype taxonomy leads to a reachability index small enough (5 MB from 80,000 atypes of which 18,000 are non-leaf, see Figure 2) to fit easily in RAM. Given a token, we can verify if it is a candidate, but the reachability index does not give us a direct index into candidate positions in the corpus. Obtaining compact reachability indices for arbitrary graphs is more complicated [7].
2.3.4 Forward index
Another useful index is the forward index which basically stores the original corpus as close to the gzipped size as possible, while allowing fast (docid,tokenOffset) queries, returning the actual token (which can now be tested using the reachability index). Most search systems need to keep around a forward index to generate query-specific snippets along with the top responses. We will use the reachability and forward indexes in conjunction in Section 4.
The forward index1 is built in two passes over the corpus. In the first pass, we count the frequency of each token in the corpus. Then we sort tokens by frequency and assign them variable-length integer IDs; frequent tokens get shorter IDs. In the second pass, we rescan the corpus and store packed bit-vectors for each document that can be unpacked and mapped back to tokens or token IDs.
Observe in Figure 2 that the forward index is even smaller than the gzipped corpus. Also, the forward index plus the reachability index add up to only about 25% of the full atype index.
Access to the forward index does require one random disk access per probe (a disk block generally holds several complete documents), but, as we shall see in Section 4, these accesses happen only for tokens that are top contenders for answering the query.
2.4 Experimental setup and evaluation
We used the TREC collection because it comes with a set of questions that can be mapped to precise atypes [21], and "truthed" answer passages. We used the TIPSTER and AQUAINT corpora adding up to over two million documents, which led to over 400 million term positions. Approximately 500 TREC questions were annotated with an atype.
Query time is measured as usual using real time and CPU time as appropriate. Our experiments were run on otherwise-unloaded P3/Xeon computers with 2-4GB RAM and Ultra320 SCSI disk. Index size is measured in bytes on disk.
The quality of a ranking of candidate tokens is measured in two standard ways. First, we measure the "recall-at-k": if our system returns k top-scoring tokens, in what fraction of questions do we nail an answer token (in the correct document and context specified by TREC)? Second, we measure the "mean reciprocal rank" or MRR used in the TREC community: if the first answer to query q is at rank rq, award a score of 1/rq, and average over the query set Q to get the MRR 1/|Q|åq Î Q(1/rq).
3 Learning to score candidates
Our goal here is to learn how to score and rank candidate tokens. The score of a candidate token (i.e., one that is a subtype or instance of the query atype) depends on three sets of quantities: statistical properties of matched selectors in the vicinity, the distance between those selectors and the candidate, and the manner in which contributions from different selectors are aggregated at the candidate token.
* Selector energy Each matched selector s has an associated positive number called its energy, denoted energy(s). A common notion of energy is the inverse document frequency or IDF standard in IR: the number N of documents in the corpus divided by the number Ns of documents containing the selector token s. This is a linear form of IDF, the logarithmic form log(1+N/Ns) is more commonly used. We use the term selector energy in place of the more familiar term weight to highlight that our scoring function has a spreading-activation form.
* Gap and Decay The gap between a candidate token w and a matched selector s, called gap(w,s), is one plus the number of intervening tokens. We consider each selector as spreading activation or energy to the candidate. We define the energy transmitted by the selector to the candidate to be energy(s) decay(gap(w,s)), where decay(g) is some function of the gap.
In many graph-based scoring systems such as ObjectRank [3], XRank [16] or TeXQuery [1] it is common to exploit a monotone decreasing decay(g)=dg, where 0 < d < 1 is a magic decay factor, for fast query execution. However, as we shall see, this form may not perfectly match data behavior.
Figure 3: Selector activation example. Stems television and invent are selectors, person#n#1 is the atype, John Baird is an atype candidate.
* Aggregation A selector s can appear multiple times near a candidate, we call this set {si}. If a is the candidate, our generic scoring function looks like
| |
|
|
| |
数据挖掘论坛 Å s
|
|
Æ i
|
|
energy
|
(si) |
decay
|
( |
gap
|
(si,a)), |
|
|
(1) |
|
where Æ aggregates over multiple occurrences of s and Å aggregates over different selectors. Sum or max can be used for Æ and Å, although sum behaves poorly as Æ because even a low-IDF selector can make the score less reliable by appearing often near a candidate.
Our query processor does not require any assumptions on the aggregation function, but we can think of other execution strategies that can take advantage of Æ = max and Å = S.
3.1 Data collection and representation
A few thousand TREC questions are available annotated with atypes [21]. We collected questions for which the answer tokens prescribed by TREC included at least one instance or subtype of the atype of the question. We report results on 261 usable questions from TREC 2000. For each question, we need positive (answer) and negative (candidate but not answer) tokens, and, to learn their distinction well, we should collect negative tokens that are "closest" to the positive ones, i.e., strongly activated by selectors.
To achieve this, we used the full atype index to locate all candidate tokens, and made a generous estimate of the activation from (the nearest occurrence of) each selector. This generous estimate used the log IDF as energy and no decay, i.e., energy was accrued unattenuated at the candidate position. For each query, we retained all positive answer tokens and the 300 negative tokens with top scores. Overall, we finished with 169662 positive and negative contexts. 5-fold cross-validation (i.e. 80% training, 20% testing in each fold) was used.
The next job was to turn contexts into feature vectors. Recall that there must be at least one selector match within W tokens of the candidate a. We set up this window with 2W+1 tokens centered at a, and retain only one instance of each selector, the one closest to a2.
As a first-cut, we can represent a context by a W-dimensional vector f, where fj is the energy sourced at position ±j wrt the candidate. If we are not sure which energy(·) function to use (log or linear form), we can always stick both values in a 2W-dimensional feature vector. Clearly, we can extend to more than two forms of the energy function, and can use separate parameters for the left and right context. In experiments, the log form by itself with symmetric decay gave better results, so, for simplicity, we will assume we have W-dimensional feature vectors.
3.2 Learning decay with hinge loss
For every gap value 1 £ j £ W, suppose the decay is bj. Then the score of a feature vector f is åj=1W fjbj, or b·f. If f+ (f-) is a feature vector for a positive (negative) context, b should ensure that b·f+ > b·f-, or b·(f+ - f-) > 0. From the answer contexts collected as above, we construct pairs (f+i,f-i) of positive and negative answer contexts, indexed by i, so that the constraints on b look like 数据挖掘论坛
Herbrich et al. [17] and Joachims [19] suggested elegant max-margin solutions to ordering problems that we generically call RankSVM:
| |
min b,s ³ 0
|
|
1
2
|
b¢b + C |
å i
|
si s.t. for all i, b ·xi + si ³ 1 |
|
(3) |
where xi is shorthand for the difference vector f+i-f-i and no offset parameter b0 is needed. As with support vector classifiers, C is a tuned parameter that trades off the model complexity ||b|| against violations of the ordering requirements.
From our application perspective, 169662 passage contexts and millions of difference vectors x is just a beginning; the data is still sparse and any additional data helps accuracy. However, RankSVM executed millions of iterations with hundreds of millions of kernel evaluations, and, for some folds, failed to terminate in a day on a 3GHz CPU.
3.3 Learning decay with exponential loss
In equation (3), xi is assigned penalty Cmax{0,1-b·xi}, which can be bounded above by Cexp(-b·xi), giving us the unconstrained optimization that we call RankExp: 数据挖掘研究院
| |
min b
|
|
1
2
|
b¢b + C |
å 数据挖掘研究院 i
|
exp(-b·xi) |
|
(4) |
which may be potentially less accurate than a hinge-loss formulation, but allows us to use simpler optimizers such as L-BFGS [22]. This lets RankExp scale much better with increasing training set size, as shown in Figure 4. On identical data sets, for C Î [0.01,0.3] the fraction of orderings satisfied by RankSVM and RankExp, as well as the MRRs were typically within 3% of each other, while RankExp took 14-40 iterations or 10-20 minutes to train and RankSVM took between 2 and 24 hours.
Figure 4: Relative CPU times needed by RankSVM and RankExp as a function of the number of ordering constraints.
3.4 Typical parameters and roughness
Figure 5 shows a typical b vector, where bj gives the relative importance of a selector match at gap j (bjs can be shifted or scaled without changing the ranking). We did not expect the clearly non-monotonic behavior near j=0, and only in hindsight found that this is a property of language (perhaps already appreciated by linguists): selectors are often named entities, and are often connected to the answer token via prepositions and articles that creates a gap. This goes against conventional wisdom that spreading activation should monotonically decay with distance. We believe that a more detailed study of proximity functions in IR, taking linguistic effects into account, is overdue.
Figure 5: bj shows a noisy unimodal pattern.
In spite of the prominent non-monotonicity near j=1¼5, there is much noise at larger gap. To bias the learner to greater smoothness, we can pin a phantom bW+1=0, and penalize deviation of bj from bj+1: 数据挖掘研究院
| |
min b
|
|
W å j=1
|
(bj - bj+1)2 + C |
数据挖掘论坛 å i
|
exp(-b·xi), |
|
(5) |
where C is set by cross-validation. Figure 5 shows the smooth b, which also improves cross-validation accuracy slightly. It also gives us confidence in the non-monotonic nature of the best decay function.
3.5 Accuracy of learnt score function
In a standard IR system [31], the score of a snippet would be decided by a vector space model using selectors alone. We gave the standard score the additional benefit of considering only those snippets centered at an atype candidate, and considering each matched selector only once (use only IDF and not TF). Even so, a basic IR scoring approach was significantly worse than the result of plugging in the scoring function learnt by RankExp, as shown in Figure 6. Both recall and MRR (see Section 2.4) over held-out test data improve substantially.
| b from |
Train |
Test |
R300 |
MRR |
| IR-IDF |
- |
2000 |
211 |
0.16 |
| RankExp |
1999 |
2000 |
231 |
0.27 |
| RankExp |
2000 |
2000 |
235 |
0.31 |
| RankExp |
2001 |
2000 |
235 |
0.29 |
Figure 6: End-to-end accuracy using RankExp b is significantly better than IR-style ranking. Train and test years are from 1999, 2000, 2001. R300 is recall at k=300 out of 261 test questions. C=0.1, C=1 and C=10 gave almost identical results.
4 Workload-tuned atype index
As we saw in Figure 2, for a 5.72 GB corpus, the stem index occupied only 0.91 GB while the full atype index took 4.30 GB. In this section we explore how to exploit the skew in the distribution of query atypes to achieve a graceful trade-off between index space and query performance.
4.1 Characterizing a skewed workload
Figure 7 shows a sample from TREC query logs. There is much skew, but, as is generally appreciated in the Web search community, the distribution is heavy-tailed and the skew may be somewhat misleading.
|
|
|