The Lovins stemming algorithm

The first ever published stemming algorithm was: Lovins JB (1968) Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11: 22-31. Julie Beth Lovins’ paper was remarkable for the early date at which it was done, and for its seminal influence on later work in this area.

The design of the algorithm was much influenced by the technical vocabulary with which Lovins found herself working (subject term keywords attached to documents in the materials science and engineering field). The subject term list may also have been slightly limiting in that certain common endings are not represented (ements and ents for example, corresponding to the singular forms ement and ent), and also in that the algorithm′s treatment of short words, or words with short stems, can be rather destructive.

The Lovins algorithm is noticeably bigger than the Porter algorithm, because of its very extensive endings list. But in one way that is used to advantage: it is faster. It has effectively traded space for time, and with its large suffix set it needs just two major steps to remove a suffix, compared with the eight of the Porter algorithm.

The Lovins stemmer has 294 endings, 29 conditions and 35 transformation rules. Each ending is associated with one of the conditions. In the first step the longest ending is found which satisfies its associated condition, and is removed. In the second step the 35 rules are applied to transform the ending. The second step is done whether or not an ending is removed in the first step.

For example, nationally has the ending ationally, with associated condition, B, ‘minimum stem length = 3’. Since removing ationally would leave a stem of length 1 this is rejected. But it also has ending ionally with associated condition A. Condition A is ‘no restriction on stem length’, so ionally is removed, leaving nat. 数据挖掘工具

The transformation rules handle features like letter undoubling (sitting -> sitt -> sit), irregular plurals (matrix and matrices), and English morphological oddities ultimately caused by the behaviour of Latin verbs of the second conjugation (assume / assumption, commit / commission etc). Although they are described as being applied in turn, they can be broken into two stages, rule 1 being done in stage 1, and either zero or one of rules 2 to 35 being done in stage 2.

Here is the list of endings as given in Appendix A of Lovins’ paper. They are grouped by length, from 11 characters down to 1. Each ending is followed by its condition code.



 

Appendix A. The list of endings

 
    .11.
    alistically B   arizability A   izationally B
 
    .10.
    antialness A   arisations A   arizations A   entialness A
 
    .09.
    allically C   antaneous A   antiality A   arisation A
    arization A   ationally B   ativeness A   eableness E
    entations A   entiality A   entialize A   entiation A
    ionalness A   istically A   itousness A   izability A
    izational A
 
    .08.
    ableness A   arizable A   entation A   entially A
    eousness A   ibleness A   icalness A   ionalism A
    ionality A   ionalize A   iousness A   izations A
    lessness A
 
    .07.
    ability A   aically A   alistic B   alities A
    ariness E   aristic A   arizing A   ateness A
    atingly A   ational B   atively A   ativism A
    elihood E   encible A   entally A   entials A
    entiate A   entness A   fulness A   ibility A
    icalism A   icalist A   icality A   icalize A
    ication G   icianry A   ination A   ingness A
    ionally A   isation A   ishness A   istical A
    iteness A   iveness A   ivistic A   ivities A
    ization F   izement A   oidally A   ousness A
 
    .06.
    aceous A   acious B   action G   alness A
    ancial A   ancies A   ancing B   ariser A
    arized A   arizer A   atable A   ations B
    atives A   eature Z   efully A   encies A
    encing A   ential A   enting C   entist A
    eously A   ialist A   iality A   ialize A
    ically A   icance A   icians A   icists A
    ifully A   ionals A   ionate D   ioning A
    ionist A   iously A   istics A   izable E
    lessly A   nesses A   oidism A
 
    .05.
    acies A   acity A   aging B   aical A
    alist A   alism B   ality A   alize A
    allic BB   anced B   ances B   antic C
    arial A   aries A   arily A   arity B
    arize A   aroid A   ately A   ating I
    ation B   ative A   ators A   atory A
    ature E   early Y   ehood A   eless A
    elity A   ement A   enced A   ences A
    eness E   ening E   ental A   ented C
    ently A   fully A   ially A   icant A
    ician A   icide A   icism A   icist A
    icity A   idine I   iedly A   ihood A
    inate A   iness A   ingly B   inism J
    inity CC   ional A   ioned A   ished A
    istic A   ities A   itous A   ively A
    ivity A   izers F   izing F   oidal A
    oides A   otide A   ously A
 
    .04.
    able A   ably A   ages B   ally B
    ance B   ancy B   ants B   aric A
    arly K   ated I   ates A   atic B
    ator A   ealy Y   edly E   eful A
    eity A   ence A   ency A   ened E
    enly E   eous A   hood A   ials A
    ians A   ible A   ibly A   ical A
    ides L   iers A   iful A   ines M
    ings N   ions B   ious A   isms B
    ists A   itic H   ized F   izer F
    less A   lily A   ness A   ogen A
    ward A   wise A   ying B   yish A
 
    .03.
    acy A   age B   aic A   als BB
    ant B   ars O   ary F   ata A
    ate A   eal Y   ear Y   ely E
    ene E   ent C   ery E   ese A
    ful A   ial A   ian A   ics A
    ide L   ied A   ier A   ies P
    ily A   ine M   ing N   ion Q
    ish C   ism B   ist A   ite AA
    ity A   ium A   ive A   ize F
    oid A   one R   ous A
 
    .02.
    ae A   al BB   ar X   as B
    ed E   en F   es E   ia A
    ic A   is A   ly B   on S
    or T   um U   us V   yl R
    s′ A   ′s A
 
    .01.
    a A   e A   i A   o A
    s W   y B




Here are the 29 conditions, called A to Z, AA, BB and CC (* stands for any letter):



 

Appendix B. Codes for context-sensitive rules associated with certain endings

    A   No restrictions on stem
    B   Minimum stem length = 3
    C   Minimum stem length = 4
    D   Minimum stem length = 5
    E   Do not remove ending after e
    F   Minimum stem length = 3 and do not remove ending after e
    G   Minimum stem length = 3 and remove ending only after f
    H   Remove ending only after t or ll
    I   Do not remove ending after o or e
    J   Do not remove ending after a or e
    K   Minimum stem length = 3 and remove ending only after l, i or u*e
    L   Do not remove ending after u, x or s, unless s follows o
    M   Do not remove ending after a, c, e or m
    N   Minimum stem length = 4 after s**, elsewhere = 3
    O   Remove ending only after l or i
    P   Do not remove ending after c
    Q   Minimum stem length = 3 and do not remove ending after l or n
    R   Remove ending only after n or r
    S   Remove ending only after dr or t, unless t follows t
    T   Remove ending only after s or t, unless t follows o
    U   Remove ending only after l, m, n or r
    V   Remove ending only after c
    W   Do not remove ending after s or u
    X   Remove ending only after l, i or u*e
    Y   Remove ending only after in
    Z   Do not remove ending after f
    AA   Remove ending only after d, f, ph, th, l, er, or, es or t
    BB   Minimum stem length = 3 and do not remove ending after met or ryst
    CC   Remove ending only after l
数据挖掘实验室

There is an implicit assumption in each condition, A included, that the minimum stem length is 2.

Finally, here are the 35 transformation rules.
 


 

Appendix C. Transformation rules used in recoding stem terminations

    1         remove one of double b, d, g, l, m, n, p, r, s, t
    2   iev   ->   ief
    3   uct   ->   uc
    4   umpt   ->   um
    5   rpt   ->   rb
    6   urs   ->   ur
    7   istr   ->   ister
    7a   metr   ->   meter
    8   olv   ->   olut
    9   ul   ->   l except following a, o, i
    10   bex   ->   bic
    11   dex   ->   dic
    12   pex   ->   pic
    13   tex   ->   tic
    14   ax   ->   ac
    15   ex   ->   ec
    16   ix   ->   ic
    17   lux   ->   luc
    18   uad   ->   uas
    19   vad   ->   vas
    20   cid   ->   cis
    21   lid   ->   lis
    22   erid   ->   eris
    23   pand   ->   pans
    24   end   ->   ens except following s
    25   ond   ->   ons
    26   lud   ->   lus
    27   rud   ->   rus
    28   her   ->   hes except following p, t
    29   mit   ->   mis
    30   ent   ->   ens except following m
    31   ert   ->   ers
    32   et   ->   es except following n
    33   yt   ->   ys
    34   yz   ->   ys
(Rule 30 as given here corrects a typographical error in the published paper of 1968.)

数据挖掘论坛



The following examples show the intentions behind these rules.

    1         rubb[ing] -> rub, embedd[ed] -> embed etc
    2   believ[e] -> belief
    3   induct[ion] -> induc[e]
    4   consumpt[ion] -> consum[e]
    5   absorpt[ion] -> absorb
    6   recurs[ive] -> recur
    7   administr[ate] -> administ[er]
    7a   parametr[ic] -> paramet[er]
    8   dissolv[ed] -> dissolut[ion]
    9   angul[ar] -> angl[e]
    10   vibex -> vibic[es]
    11   index -> indic[es]
    12   apex -> apic[es]
    13   cortex -> cortic[al]
    14   anthrax -> anthrac[ite]
    15   ?
    16   matrix
[数据挖掘专家] [数据挖掘研究院] [数据挖掘论坛] [数据挖掘实验室]
上一篇:The Krovetz Stemmer
下一篇:What is Stemming?
最新评论共有 0 位网友发表了评论 , 查看所有评论
发表评论( 不能超过250字,需审核,请自觉遵守互联网相关政策法规。 )
匿名?
数据挖掘网站导航 数据挖掘论坛导航
  • 数据挖掘工具
  • 数据挖掘论坛
  • DataCruncher - Cognos
  • MineSet - MathSoft
  • Intelligent Miner - GainSmarts
  • Sqlserver - SAS - Clementine
  • CART - Weka - WizSoft
  • NeuroShell - ModelQuest
  • data mining tools - Darwin
  • 数据挖掘交友
  • 数据挖掘博客
  • 数据挖掘工具
  • 数据挖掘资源
  • 数据挖掘技术算法
  • 数据挖掘相关期刊、会议
  • 研究院联盟合作专区
  • 数据挖掘基础与相关技术
  • 数据挖掘厂商与就业
  • 数据挖掘研究者乐园
  • 知名厂商数据挖掘工具资料
  • 国内数据挖掘实验室
  • Foreign Data Mining Lab
  • 热点关注
  • 信息检索的核心支撑技术
  • 信息检索研究人员推荐读物
  • 清华信息检索在TREC评测中再创佳绩
  • 如何实现中文文献的自动聚合分类
  • Resources for Text, Speech and Language
  • 基于WordNet的文本分类技术研究和实现
  • 字符串匹配的KMP算法
  • 中创软件Infor中间件助力税收信息化
  • Boyer Moore 算法
  • 中文信息处理——纵览与建议
  • 论坛最新话题
  • Foundations of Statistical Natural Langu
  • Game Theory meet Data Mining: A Recent P
  • System Building: How does it help or hin
  • 数据挖掘与Clementine培训
  • 新手报到
  • 求 SASEM 客户流失预测分析
  • 数据挖掘工程师/搜索研究院—北京——无线
  • 数据挖掘入门介绍(如何着手数据挖掘)
  • Information Overload Survey Results
  • The INEX 2005 Workshop on Element Retrie
  • 相关资讯
  • 信息检索权威资料收集
  • Artificial Intelligence as Smart as Huma
  • 2nd CFP: Social Linking Track at Hyperte
  • 如何实现中文文献的自动聚合分类
  • 信息检索的核心支撑技术
  • Efficient Similarity Search over Vector
  • MARS: A Matching and Ranking System for
  • 信息检索研究人员推荐读物
  • Resources for Text, Speech and Language
  • Information Wants to be Found
  • 数据挖掘实验室资料
  • 数据挖掘博客地址
  • 数据挖掘实验室网站地址
  • Prepare for Medicare audits by using dat
  • 注册成为SAS用户与爱好者俱乐部会员
  • 水南梅
  • 明日烟
  • 新人报道
  • 下载
  • 厦门服务器托管,450元/月—0592-5177319 高
  • 买空间送域名--0592-5177319 高静