|
Direct Assessment 数据挖掘论坛
The most primitive method for assessing the performance of a stemmer is to examine its behaviour when applied to samples of words - especially words which have already been arranged into ′conflation groups′. This way, specific errors (e.g., failing to merge "maintained" with "maintenance", or wrongly merging "experiment" with "experience") can be identified, and the rules adjusted accordingly. This approach is of very limited utility on its own, but can be used to complement other methods, such as the error-counting approach outlined later.
数据挖掘研究院
Information Retrieval 数据挖掘工具
The most obvious method for comparing the usefulness of Stemmers for the field of IR is by their impact on IR performance, using a testing system and a ‘test collection’ of documents, queries and relevance judgments. This involves substituting different Stemmers to see which gives the best results in terms of performance metrics such as Precision and Recall. However, there are problems with using such a technique for deciding which stemmer to use in an IR system, since the results are frequently indecisive. Thus, the ‘best’ Stemmer may be different for different databases and different searches. 数据挖掘研究院
Error Counting 数据挖掘交友
It is possible to evaluate stemming by counting the numbers of two kinds of errors that occur during stemming, namely; 数据挖掘研究院
- Under-Stemming
This refers to words that should be grouped together by stemming, but aren′t. This causes a single concept to be spread over various different stems, which will tend to decrease the Recall in an IR search.
- Over-Stemming
This refers to words that shouldn’t be grouped together by stemming, but are. This causes the meanings of the stems to be diluted, which will effect Precision of IR. Stemming Errors for a fuller description of all this. Stemming Errors).
数据挖掘实验室
Using a sample file of grouped words, these errors are then counted. This evaluation method was described by Chris Paice in a paper entitled ′Method for Evaluation of Stemming Algorithms Based on Error Counting′ JASIS 47(8), August 1996, 632-649. This method returns a value for an Under-Stemming (or Conflation) index; 数据挖掘实验室
UI = Under-Stemming Index CI = Conflation Index: proportion of equivalent word pairs which were successfully grouped to the same stem.
UI= 1 - CI
Also a value for an Over-Stemming (or Distinctness) index;
OI = Over-Stemming index DI = Distinctness Index: proportion of non-equivalent word pairs which remained distinct after stemming. 数据挖掘论坛
OI= 1 - DI 数据挖掘工具
See
The purpose of this error-counting approach is that, although it is advantageous to have the index of terms compressed, this is only useful up to a point. This is because, as conflation becomes ′heavier′, the merging of distinct concepts becomes increasingly frequent. At this point, small increases in Recall are gained at the expense of a major loss of Precision. 数据挖掘论坛
One question mark over this approach concerns the validity of the grouped file against which the errors are assessed. In Paice (1996), these grouped files were constructed by human judgment, during scrutiny of sample word lists. 数据挖掘工具
Stemmer Strength
A ′weak′ or ′light′ stemmer is one which only merges a few of the most highly related words together - for example, just singular and plural forms, or inflective variants of verbs ("react", "reacts", "reacting", "reacted"). A ′strong′ or ′heavy′ stemmer, on the other hand, merges a much wider variety of forms (… "reaction", "reactions", "reactive", "reactivity", "reactant" etc.). In IR searching, light stemming seems to be most effective; heavy stemming increases the chances of ambiguity and confusion (for instance, merging "author" with "authority" is generally not helpful). In other situations - e.g., for displaying terms in a user interface - a heavier stemmer may be quite useful.
Whether the impact of a stemmer is positive depends not only on its strength, but also on whether performs its task accurately - stemmer strength metrics do not represent actual stemming accuracy (for further discussion of which, see 数据挖掘工具
Frakes & Fox (in press) have listed the following ways to measure stemmer strength: 数据挖掘研究院
数据挖掘工具
- Number of words per conflation class
This is the average size of the groups of words coverted to a particular stem (regardless of whether they are all correct). Thus, if the words "engineer", "engineering" and "engineered", and no others, were all stemmed to the stem "engineer", then the size of that conflation class would be 3. If the conflation of 1,000 different words resulted in 250 distinct stems, then the mean number of words per conflation class would be 4.
This metric is obviously dependent on the number of words processed, but for a word collection of given size, a higher value indicates a heavier stemmer. The value is easily calculated as follows: 数据挖掘交友
WC = Mean number of words per conflation class 数据挖掘工具
N = Number of unique words before Stemming
S = Number of unique stems after Stemming
- Index Compression
数据挖掘研究院
The Index Compression Factor represents the extent to which a collection of unique words is reduced (compressed) by stemming, the idea being that the heavier the Stemmer, the greater the Index Compression Factor. This can be calculated by;
IC = Index Compression Factor 数据挖掘论坛
N = Number of unique words before Stemming 数据挖掘研究院
S = Number of unique stems after Stemming 数据挖掘工具
- The Word Change Factor
数据挖掘交友
- Number of Characters Removed
数据挖掘交友
- Hamming Distance
Chris O′Neill of the Lancaster University Computing Department during the course of a student project produced under the supervision of Chris Paice during 2001.
The Hamming Distance takes two strings of equal length and counts the number of corresponding positions where the characters are different. If the strings are of different lengths, we can use the Modified Hamming Distance, MHD. Thus, suppose the string lengths are P and Q, where P<Q, we use the formula
MHD = HD(1,P) + (Q-P) 数据挖掘研究院
where HD(1,P) is the Hamming Distance for the first P characters of both strings. Applying this to a stemmer, suppose that the word "parties" is converted to "party". In this case, P=5 and Q=7, so that HD(1,P) = 1 by comparing "parti" with "party", and (Q-P) = 2, giving MHD = 3. Clearly, we can compute the average MHD value for every word in the original sample. 数据挖掘工具
数据挖掘工具
Inter-Stemmer Similarity
数据挖掘研究院
It is possible to compare two separate stemming algorithms by comparing the output they produce. This provides a measure of the similarity (or conversely, the distance) between the two algorithms. The approach is to take a set of words and apply both algorithms in turn, thus producing two output lists. Corresponding stems in the two output lists are then compared to give a measure of similarity between the stemmers. This could be useful, for example, for deciding whether two separate implementations of the same algorithm really do the same thing (for example, it can be used for comparing the various available implementations of Porter′s algorithm). 数据挖掘交友
A simple way to measure inter-stemmer similarity is by simply counting the proportion of the stem-pairs which are identical. A more sensitive measure, however, would be one based on the average Modified Hamming Distance between the pairs of stems. This will give a measure of inter-stemmer distance, so (as suggested by Frakes & Fox) it may be more useful to use the inverse of the average MHD as a similarity value.
As with stemmer strength, inter-stemmer similarity is not directly related to accuracy. Thus, you could have two stemmers which are very dissimilar and yet which are virtually identical in their ability to conflate related words. To see this, suppose you take a particular stemmer, and modify it so that it systematically recodes the last 3 letters of each stem (e.g., "A" to "B", "B" to "C", "C" to "D", etc.). The modified stemmer would group a collection of words in exactly the same way as the original, but the similarity between the two would be quite low. A different kind of metric would be needed to notice the similarity in this case! 数据挖掘研究院
数据挖掘论坛
A New Similarity Metric
The following metric was developed by 数据挖掘交友
This approach starts by considering the average Hamming distance SSM between the corresponding stems in two lists, where:
A and B are the stemmers being compared, 数据挖掘实验室
HD is the Modified Hamming distance between two strings, and 数据挖掘工具
N is the number of words in the word list. 数据挖掘研究院
However, this approach leaves the following problems: 数据挖掘研究院
- It would be better if the result could be normalised to give a value between 1 and 0,
- The above metric fails to take into consideration the overall lengths of the two strings. Thus, removing ‘s’ from the word ‘reds’ to give ‘red’, is a removal of 25% of the word, and would have a Hamming distance of 1. Removing the suffix ‘s’ from the word ‘methylenedioxymethamphetamins’ to give ‘methylenedioxymethamphetamin’ is a removal of 3.4% of the word, yet this also gives a modified Hamming distance of 1.
A new metric was therefore developed which, though still based on the Hamming distance, also takes into account the lengths of the words being compared. It uses the idea that there is a maximum and minimum modified Hamming distance that can obtain between two strings. The minimum distance is zero, when both strings are identical, whereas the maximum distance MD is the length of the longer string. The relative distance RD is obtained from the modified hamming distance MHD thus: 数据挖掘工具
Therefore, by calculating the sum of the relative distances for all pairs of stems, then dividing this by the number of words, an average distance metric is obtained. This was then further developed by subtracting the value from one, then multiplying it by 100 to give a percentage, so that 0% represents no similarity and 100% total similarity. Thus, the new stemmer similarity metric is given by the formula:
where 数据挖掘研究院
U & V are the Stemmers being compared, 数据挖掘工具
N = Number of words in the sample 数据挖掘实验室
MHD = Modified Hamming Distance 数据挖掘实验室
MD = Maximum Distance
数据挖掘研究院
数据挖掘工具
There are various metrics which use the principle that strong stemmers remove more characters from words than weaker stemmers. One way is to compute the average number of letters removed when a stemmer is applied to a text collection. Thus, suppose that the nine words "react", "reacts", "reacting", "reacted", "reaction", "reactions", "reactive", "reactivity" and "reactivities" are all reduced to "react"; the numbers of characters removed are then 0, 1, 3, 2, 3, 4, 3, 5 and 7, respectively. This gives a mean removal rate of 28/9 = 3.11.
Obtaining such data for a large collection of words enables us to compute not only the mean but also the mode and the median of the resulting distribution, in case these seem more useful. 数据挖掘工具
The meaning of ′number of characters removed′ is not clear-cut for those stemmers which replace endings rather than just removing them. Thus, a stemmer may replace the endings "-ies" and "-ied" by "-y". We could count this as 3 (number of letters removed), or 2 (reduction in length), or even 4 (letters removed + letters added). Alternatively, we could use a metric such as the Hamming Distance, or Modified Hamming Distance, as discussed next. 数据挖掘论坛
This is a simply the proportion of the words in a sample that have been changed in any way by the stemming process, the idea being that the larger the number of words that are altered, the greater the strength of the Stemmer.
数据挖掘论坛
|