Stemming Performance

 

Direct Assessment 数据挖掘论坛

The most primitive method for assessing the performance of a stemmer is to examine its behaviour when applied to samples of words - especially words which have already been arranged into ′conflation groups′. This way, specific errors (e.g., failing to merge "maintained" with "maintenance", or wrongly merging "experiment" with "experience") can be identified, and the rules adjusted accordingly. This approach is of very limited utility on its own, but can be used to complement other methods, such as the error-counting approach outlined later.

  数据挖掘研究院

Information Retrieval 数据挖掘工具

The most obvious method for comparing the usefulness of Stemmers for the field of IR is by their impact on IR performance, using a testing system and a ‘test collection’ of documents, queries and relevance judgments. This involves substituting different Stemmers to see which gives the best results in terms of performance metrics such as Precision and Recall. However, there are problems with using such a technique for deciding which stemmer to use in an IR system, since the results are frequently indecisive. Thus, the ‘best’ Stemmer may be different for different databases and different searches. 数据挖掘研究院


Error Counting 数据挖掘交友

It is possible to evaluate stemming by counting the numbers of two kinds of errors that occur during stemming, namely; 数据挖掘研究院

  • Under-Stemming
    This refers to words that should be grouped together by stemming, but aren′t. This causes a single concept to be spread over various different stems, which will tend to decrease the Recall in an IR search.
  • Over-Stemming
    This refers to words that shouldn’t be grouped together by stemming, but are. This causes the meanings of the stems to be diluted, which will effect Precision of IR.
    Stemming Errors for a fuller description of all this. Stemming Errors).

    Mean number of words per conflation class 数据挖掘实验室

Using a sample file of grouped words, these errors are then counted. This evaluation method was described by Chris Paice in a paper entitled ′Method for Evaluation of Stemming Algorithms Based on Error Counting′ JASIS 47(8), August 1996, 632-649. This method returns a value for an Under-Stemming (or Conflation) index; 数据挖掘实验室

UI = Under-Stemming Index
CI = Conflation Index: proportion of equivalent word pairs which were successfully grouped to the same stem.

UI= 1 - CI

Also a value for an Over-Stemming (or Distinctness) index;

OI = Over-Stemming index
DI = Distinctness Index: proportion of non-equivalent word pairs which remained distinct after stemming. 数据挖掘论坛

OI= 1 - DI 数据挖掘工具

See

The purpose of this error-counting approach is that, although it is advantageous to have the index of terms compressed, this is only useful up to a point. This is because, as conflation becomes ′heavier′, the merging of distinct concepts becomes increasingly frequent. At this point, small increases in Recall are gained at the expense of a major loss of Precision. 数据挖掘论坛

One question mark over this approach concerns the validity of the grouped file against which the errors are assessed. In Paice (1996), these grouped files were constructed by human judgment, during scrutiny of sample word lists. 数据挖掘工具


Stemmer Strength

A ′weak′ or ′light′ stemmer is one which only merges a few of the most highly related words together - for example, just singular and plural forms, or inflective variants of verbs ("react", "reacts", "reacting", "reacted"). A ′strong′ or ′heavy′ stemmer, on the other hand, merges a much wider variety of forms (… "reaction", "reactions", "reactive", "reactivity", "reactant" etc.). In IR searching, light stemming seems to be most effective; heavy stemming increases the chances of ambiguity and confusion (for instance, merging "author" with "authority" is generally not helpful). In other situations - e.g., for displaying terms in a user interface - a heavier stemmer may be quite useful.

Whether the impact of a stemmer is positive depends not only on its strength, but also on whether performs its task accurately - stemmer strength metrics do not represent actual stemming accuracy (for further discussion of which, see 数据挖掘工具

Frakes & Fox (in press) have listed the following ways to measure stemmer strength: 数据挖掘研究院

  数据挖掘工具

  • Number of words per conflation class

  • This is the average size of the groups of words coverted to a particular stem (regardless of whether they are all correct). Thus, if the words "engineer", "engineering" and "engineered", and no others, were all stemmed to the stem "engineer", then the size of that conflation class would be 3. If the conflation of 1,000 different words resulted in 250 distinct stems, then the mean number of words per conflation class would be 4.

This metric is obviously dependent on the number of words processed, but for a word collection of given size, a higher value indicates a heavier stemmer. The value is easily calculated as follows:

数据挖掘交友

WC = Mean number of words per conflation class 数据挖掘工具

N = Number of unique words before Stemming

S = Number of unique stems after Stemming

  • Index Compression
  • Index Compression Factor 数据挖掘研究院


    The Index Compression Factor represents the extent to which a collection of unique words is reduced (compressed) by stemming, the idea being that the heavier the Stemmer, the greater the Index Compression Factor. This can be calculated by;

IC = Index Compression Factor 数据挖掘论坛

N = Number of unique words before Stemming 数据挖掘研究院

S = Number of unique stems after Stemming

数据挖掘工具

  • The Word Change Factor
  •   数据挖掘交友

    • Number of Characters Removed
    •  

      数据挖掘交友

      • Hamming Distance
      • Chris O′Neill of the Lancaster University Computing Department during the course of a student project produced under the supervision of Chris Paice during 2001.

        Stemmer Similarity Metric


        The Hamming Distance takes two strings of equal length and counts the number of corresponding positions where the characters are different. If the strings are of different lengths, we can use the Modified Hamming Distance, MHD. Thus, suppose the string lengths are P and Q, where P<Q, we use the formula

      MHD = HD(1,P) + (Q-P) 数据挖掘研究院

      where HD(1,P) is the Hamming Distance for the first P characters of both strings. Applying this to a stemmer, suppose that the word "parties" is converted to "party". In this case, P=5 and Q=7, so that HD(1,P) = 1 by comparing "parti" with "party", and (Q-P) = 2, giving MHD = 3. Clearly, we can compute the average MHD value for every word in the original sample.

      数据挖掘工具

        数据挖掘工具


      Inter-Stemmer Similarity

      数据挖掘研究院

      It is possible to compare two separate stemming algorithms by comparing the output they produce. This provides a measure of the similarity (or conversely, the distance) between the two algorithms. The approach is to take a set of words and apply both algorithms in turn, thus producing two output lists. Corresponding stems in the two output lists are then compared to give a measure of similarity between the stemmers. This could be useful, for example, for deciding whether two separate implementations of the same algorithm really do the same thing (for example, it can be used for comparing the various available implementations of Porter′s algorithm).

      数据挖掘交友

      A simple way to measure inter-stemmer similarity is by simply counting the proportion of the stem-pairs which are identical. A more sensitive measure, however, would be one based on the average Modified Hamming Distance between the pairs of stems. This will give a measure of inter-stemmer distance, so (as suggested by Frakes & Fox) it may be more useful to use the inverse of the average MHD as a similarity value.

      As with stemmer strength, inter-stemmer similarity is not directly related to accuracy. Thus, you could have two stemmers which are very dissimilar and yet which are virtually identical in their ability to conflate related words. To see this, suppose you take a particular stemmer, and modify it so that it systematically recodes the last 3 letters of each stem (e.g., "A" to "B", "B" to "C", "C" to "D", etc.). The modified stemmer would group a collection of words in exactly the same way as the original, but the similarity between the two would be quite low. A different kind of metric would be needed to notice the similarity in this case!

      数据挖掘研究院

        数据挖掘论坛

      A New Similarity Metric

      The following metric was developed by 数据挖掘交友

      This approach starts by considering the average Hamming distance SSM between the corresponding stems in two lists, where:

      A and B are the stemmers being compared,

      数据挖掘实验室

      HD is the Modified Hamming distance between two strings, and 数据挖掘工具

      N is the number of words in the word list. 数据挖掘研究院

      However, this approach leaves the following problems:

      数据挖掘研究院

      • It would be better if the result could be normalised to give a value between 1 and 0,
      • The above metric fails to take into consideration the overall lengths of the two strings. Thus, removing ‘s’ from the word ‘reds’ to give ‘red’, is a removal of 25% of the word, and would have a Hamming distance of 1. Removing the suffix ‘s’ from the word ‘methylenedioxymethamphetamins’ to give ‘methylenedioxymethamphetamin’ is a removal of 3.4% of the word, yet this also gives a modified Hamming distance of 1.

      A new metric was therefore developed which, though still based on the Hamming distance, also takes into account the lengths of the words being compared. It uses the idea that there is a maximum and minimum modified Hamming distance that can obtain between two strings. The minimum distance is zero, when both strings are identical, whereas the maximum distance MD is the length of the longer string. The relative distance RD is obtained from the modified hamming distance MHD thus:

      数据挖掘工具

      Therefore, by calculating the sum of the relative distances for all pairs of stems, then dividing this by the number of words, an average distance metric is obtained. This was then further developed by subtracting the value from one, then multiplying it by 100 to give a percentage, so that 0% represents no similarity and 100% total similarity. Thus, the new stemmer similarity metric is given by the formula:

      where 数据挖掘研究院

      U & V are the Stemmers being compared, 数据挖掘工具

      N = Number of words in the sample 数据挖掘实验室

      MHD = Modified Hamming Distance 数据挖掘实验室

      MD = Maximum Distance

      New Similarity Metric

      数据挖掘研究院

      Relative Distance

      数据挖掘工具


      There are various metrics which use the principle that strong stemmers remove more characters from words than weaker stemmers. One way is to compute the average number of letters removed when a stemmer is applied to a text collection. Thus, suppose that the nine words "react", "reacts", "reacting", "reacted", "reaction", "reactions", "reactive", "reactivity" and "reactivities" are all reduced to "react"; the numbers of characters removed are then 0, 1, 3, 2, 3, 4, 3, 5 and 7, respectively. This gives a mean removal rate of 28/9 = 3.11.

    Obtaining such data for a large collection of words enables us to compute not only the mean but also the mode and the median of the resulting distribution, in case these seem more useful.

    数据挖掘工具

    The meaning of ′number of characters removed′ is not clear-cut for those stemmers which replace endings rather than just removing them. Thus, a stemmer may replace the endings "-ies" and "-ied" by "-y". We could count this as 3 (number of letters removed), or 2 (reduction in length), or even 4 (letters removed + letters added). Alternatively, we could use a metric such as the Hamming Distance, or Modified Hamming Distance, as discussed next.

    数据挖掘论坛


    This is a simply the proportion of the words in a sample that have been changed in any way by the stemming process, the idea being that the larger the number of words that are altered, the greater the strength of the Stemmer.

数据挖掘论坛

[数据挖掘专家] [数据挖掘研究院] [数据挖掘论坛] [数据挖掘实验室]
上一篇:Stemming Errors
下一篇:Stemming - Bibliography
最新评论共有 0 位网友发表了评论 , 查看所有评论
发表评论( 不能超过250字,需审核,请自觉遵守互联网相关政策法规。 )
匿名?
数据挖掘网站导航 数据挖掘论坛导航
  • 数据挖掘工具
  • 数据挖掘论坛
  • DataCruncher - Cognos
  • MineSet - MathSoft
  • Intelligent Miner - GainSmarts
  • Sqlserver - SAS - Clementine
  • CART - Weka - WizSoft
  • NeuroShell - ModelQuest
  • data mining tools - Darwin
  • 数据挖掘交友
  • 数据挖掘博客
  • 数据挖掘工具
  • 数据挖掘资源
  • 数据挖掘技术算法
  • 数据挖掘相关期刊、会议
  • 研究院联盟合作专区
  • 数据挖掘基础与相关技术
  • 数据挖掘厂商与就业
  • 数据挖掘研究者乐园
  • 知名厂商数据挖掘工具资料
  • 国内数据挖掘实验室
  • Foreign Data Mining Lab
  • 热点关注
  • 信息检索的核心支撑技术
  • 信息检索研究人员推荐读物
  • 清华信息检索在TREC评测中再创佳绩
  • 如何实现中文文献的自动聚合分类
  • Resources for Text, Speech and Language
  • 基于WordNet的文本分类技术研究和实现
  • 字符串匹配的KMP算法
  • 中创软件Infor中间件助力税收信息化
  • Boyer Moore 算法
  • 中文信息处理——纵览与建议
  • 论坛最新话题
  • Foundations of Statistical Natural Langu
  • Game Theory meet Data Mining: A Recent P
  • System Building: How does it help or hin
  • 数据挖掘与Clementine培训
  • 新手报到
  • 求 SASEM 客户流失预测分析
  • 数据挖掘工程师/搜索研究院—北京——无线
  • 数据挖掘入门介绍(如何着手数据挖掘)
  • Information Overload Survey Results
  • The INEX 2005 Workshop on Element Retrie
  • 相关资讯
  • 信息检索权威资料收集
  • Artificial Intelligence as Smart as Huma
  • 2nd CFP: Social Linking Track at Hyperte
  • 如何实现中文文献的自动聚合分类
  • 信息检索的核心支撑技术
  • Efficient Similarity Search over Vector
  • MARS: A Matching and Ranking System for
  • 信息检索研究人员推荐读物
  • Resources for Text, Speech and Language
  • Information Wants to be Found
  • 数据挖掘实验室资料
  • 数据挖掘博客地址
  • 数据挖掘实验室网站地址
  • Prepare for Medicare audits by using dat
  • 注册成为SAS用户与爱好者俱乐部会员
  • 水南梅
  • 明日烟
  • 新人报道
  • 下载
  • 厦门服务器托管,450元/月—0592-5177319 高
  • 买空间送域名--0592-5177319 高静