What is Stemming?

 

Information Retrieval 数据挖掘论坛

Information Retrieval (IR) is essentially a matter of deciding which documents in a collection should be retrieved to satisfy a user′s need for information. The user′s information need is represented by a query or profile, and contains one or more search terms, plus perhaps some additional information such importance weights. Hence, the retrieval decision is made by comparing the terms of the query with the index terms (important words or phrases) appearing in the document itself. The decision may be binary (retrieve/reject), or it may involve estimating the degree of relevance that the document has to the query.

Unfortunately, the words that appear in documents and in queries often have many morphological variants. Thus, pairs of terms such as "computing" and "computation" will not be recognised as equivalent without some form of natural language processing (NLP). 数据挖掘实验室

Stemming

数据挖掘工具

In most cases, morphological variants of words have similar semantic interpretations and can be considered as equivalent for the purpose of IR applications. For this reason, a number of so-called stemming Algorithms, or stemmers, have been developed, which attempt to reduce a word to its stem or root form. Thus, the key terms of a query or document are represented by stems rather than by the original words. This not only means that different variants of a term can be conflated to a single representative form – it also reduces the dictionary size, that is, the number of distinct terms needed for representing a set of documents. A smaller dictionary size results in a saving of storage space and processing time.

数据挖掘实验室

For IR purposes, it doesn′t usually matter whether the stems generated are genuine words or not – thus, "computation" might be stemmed to "comput" – provided that (a) different words with the same ′base meaning′ are conflated to the same form, and (b) words with distinct meanings are kept separate. An algorithm which attempts to convert a word to its linguistically correct root ("compute" in this case) is sometimes called a lemmatiser.

数据挖掘交友

Examples of products using stemming algorithms would be search engines such as Lycos and Google, and also thesauruses and other products using NLP for the purpose of IR. Stemmers and lemmatizers also have applications more widely within the field of Computational Linguistics.

数据挖掘论坛

[数据挖掘工作交流] [数据挖掘研究院] [数据挖掘论坛] [数据挖掘实验室]
上一篇:The Lovins stemming algorithm
下一篇:Stemming Errors
最新评论共有 0 位网友发表了评论 , 查看所有评论
发表评论( 不能超过250字,需审核,请自觉遵守互联网相关政策法规。 )
匿名?
数据挖掘网站导航 数据挖掘论坛导航
  • 数据挖掘工具
  • 数据挖掘论坛
  • DataCruncher - Cognos
  • MineSet - MathSoft
  • Intelligent Miner - GainSmarts
  • Sqlserver - SAS - Clementine
  • CART - Weka - WizSoft
  • NeuroShell - ModelQuest
  • data mining tools - Darwin
  • 数据挖掘交友
  • 数据挖掘博客
  • 数据挖掘工具
  • 数据挖掘资源
  • 数据挖掘技术算法
  • 数据挖掘相关期刊、会议
  • 研究院联盟合作专区
  • 数据挖掘基础与相关技术
  • 数据挖掘厂商与就业
  • 数据挖掘研究者乐园
  • 知名厂商数据挖掘工具资料
  • 国内数据挖掘实验室
  • Foreign Data Mining Lab
  • 热点关注
  • 信息检索的核心支撑技术
  • 信息检索研究人员推荐读物
  • 清华信息检索在TREC评测中再创佳绩
  • 如何实现中文文献的自动聚合分类
  • Resources for Text, Speech and Language
  • 基于WordNet的文本分类技术研究和实现
  • 字符串匹配的KMP算法
  • 中创软件Infor中间件助力税收信息化
  • Boyer Moore 算法
  • 中文信息处理——纵览与建议
  • 论坛最新话题
  • 正规省级、国家级别期刊征集论文稿件
  • 寻data mining cookbook 一书的配套光盘
  • 网博垂直搜索引擎完全开源版
  • 电脑也会成为火灾元凶 操作不当也会有危险
  • 网络暴力间接逼死崔真实 韩国拟立法实名上
  • 网络最流行的歌曲单良《那一场雪》推荐给大
  • 快国庆了大家怎么安排
  • 08年“铁观音秋茶”安溪铁观音,茶叶批发网
  • 快国庆了大家怎么安排
  • 世界最大规模“网格计算”网络启动
  • 相关资讯
  • 信息检索权威资料收集
  • Artificial Intelligence as Smart as Huma
  • 2nd CFP: Social Linking Track at Hyperte
  • 如何实现中文文献的自动聚合分类
  • 信息检索的核心支撑技术
  • Efficient Similarity Search over Vector
  • MARS: A Matching and Ranking System for
  • 信息检索研究人员推荐读物
  • Resources for Text, Speech and Language
  • Information Wants to be Found
  • 数据挖掘实验室资料
  • 注册成为SAS用户与爱好者俱乐部会员
  • 水南梅
  • 明日烟
  • 新人报道
  • 下载
  • 厦门服务器托管,450元/月—0592-5177319 高
  • 买空间送域名--0592-5177319 高静
  • mit ocw 数据挖掘相关课程连接
  • Introduction to Data Mining
  • Data Mining & Business Intelligence