Special Section -Enhanced Text Retrieval Using Natural Langu

It makes common sense for linguistic processing to be used in the task of text retrieval, given that users′ queries are linguistic expressions, and the relevant documents that the system is attempting to retrieve are also linguistic objects. While this may seem obvious today, it has not always been the case.

In the early days of information retrieval (IR) research, approaches to IR remained mainly statistical. This state of affairs was particularly true after funding support for machine translation (MT) research was all but abandoned due to the ALPAC (Automated Language Processing Advisory Committee of the National Academy of Science-National Research Council) Report of 1966. This report said that MT was beyond then-available computational capabilities and recommended that it not be funded. Some low-level linguistic techniques, such as stemming, were introduced and spread widely. However, most efforts to include more complex techniques, such as natural language processing (NLP), were scoffed at. The same situation continued to hold true in the 1970s and 1980s. Those who attempted to demonstrate that NLP had enhanced capabilities to offer IR had an uphill struggle, given the predominant focus on successful statistical approaches by the leaders of the field. 数据挖掘研究院

However, by 1993 and 1994, when Dave Lewis and I presented tutorials on the use of NLP for IR at the annual conferences of the Association for Computational Linguistics and ACM-SIGIR (Special Interest Group in Information Retrieval), attendance was exceptionally high. Also exceptionally high were both the skepticism and optimism that NLP could improve effectiveness of real IR applications. However, the large attendance augured well for the future. While we acknowledged the difficulties others had pointed out as endemic to NLP, the field had advanced sufficiently that the difficulties that had been deemed insurmountable to earlier researchers now seemed more feasible to those in attendance. Inclusion of a broader range of NLP techniques has gradually increased since that time, but it is really only in very recent years that demonstrated successes in the use of NLP have given the beleaguered NLP processing paradigm a chance at inclusion in large scale IR systems.

数据挖掘论坛

We will look later at the circumstances that can be considered responsible for this change after a brief overview of natural language processing.

Definition of NLP

Natural language processing is a range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of particular tasks or applications. The levels of linguistic analysis are:
  • Phonological: interpretation of speech sounds within and across words
  • Morphological: componential analysis of words, including prefixes, suffixes and roots
  • Lexical: word level analysis including lexical meaning and part of speech analysis
  • Syntactic: analysis of words in a sentence in order to uncover the grammatical structure of the sentence
  • Semantic: determining the possible meanings of a sentence, including disambiguation of words in context
  • Discourse: interpreting structure and meaning conveyed by texts larger than a sentence
  • Pragmatic: understanding the purposeful use of language in situations, particularly those aspects of language which require world knowledge
The above levels of linguistic processing reflect an increasing size of unit of analysis as well as increasing complexity and difficulty as we move from top to bottom. The larger the unit of analysis becomes (i.e., from morpheme to word to sentence to paragraph to full document), the less precise the language phenomena and the greater the free choice and variability. This decrease in precision results in fewer discernible rules and more reliance on less predictable regularities as one moves from the lowest to the highest levels. Additionally, higher levels presume reliance on the lower levels of language understanding, and the theories used to explain the data move more into the areas of cognitive psychology and artificial intelligence. As a result, the lower levels of language processing have been more thoroughly investigated and incorporated into IR systems. I am aware of only one system that includes all levels of language analysis.

Use of NLP in IR

The central task in NLP for IR is the translation of potentially ambiguous natural language queries and documents into unambiguous internal representations on which matching and retrieval can take place. In fact, the ideal IR system is one in which users can express their information needs naturally and with all requisite detail - exactly as they would state them to a research librarian. The system should then "understand" the underlying meaning of the query in all its complexity and subtlety. Likewise, a full NLP IR system will represent the contents of documents - no matter the nature of the document - at all the same levels of understanding, thereby permitting full-fledged conceptual matching of queries and documents.

Those who employ NLP in IR applications may elect to use one or more of the multiple levels of language processing and may elect to apply these levels of language processing to just the queries or to both the queries and documents. Unfortunately for the public′s understanding of NLP, some systems that call themselves NLP systems do, in fact, use only a few levels of NLP and use them only on the queries. However, as users have become more sophisticated in their understanding of what is meant by NLP, their expectations are that the documents will likewise be processed and that the language processing will be more complex than just stemming.

Having now understood the levels of language - all of which convey meaning - let′s take a look at how these levels of NLP can be utilized in an IR system. Some of these ways will be better known to those with some familiarity with IR than others - most likely due to their not being as frequently incorporated in IR systems. The reasons why some levels are not implemented include

  • unfamiliarity of many in IR with how to incorporate the higher levels of language understanding;
  • lack of empirical results which tease out the individual contributions of each of the levels of processing; and
  • concern over the complexity of the processing required by some of these higher levels.

Examples

Starting with the lowest unit, the phonological level comes into play in speech recognition systems which accept spoken queries or even provide spoken documents. For these applications, the phonological level is an obvious requirement, but for other applications in IR, this level has obviously not come into play.

The morphological level is the level of linguistic processing most commonly incorporated in IR systems and has the longest history of inclusion. Stemming of terms in documents and queries so that morphological variants between query and document will match has a long history in both experimental and commercial systems. And while there have been differing empirical results on the impact of stemming in English, most current IR systems support stemming to avoid the potential for obviously missed relevant documents. For example, if the plural forms of nouns in documents are not stemmed, these documents will not match to the singular form of the term of interest in a query or vice versa. It should be noted that for other languages that have a richer morphology, the attention to morphological processing offers a much more obvious and larger pay-off for IR than it does for English. 数据挖掘论坛

The lexical level of linguistic processing can be used in IR either for part-of-speech tagging or for the utilization of lexicons from which the detailed features of individual terms can be accessed. The lexical level of language is evidenced in the knowledge contained in thesauri and other similar resources, which were originally manual consultation tools for both indexers and searchers. They were and are utilized to ensure that a common vocabulary is used in selecting appropriate indexing or searching terms/phrases. These lexicons, which provide both syntagmatic and paradigmatic relations of terms, can be used in IR systems for automated or semi-automated assistance in building queries. Recognition, tagging and indexing of specific lexical features of interest (e.g., proper nouns) reflects lexical information usage. 数据挖掘工具

The syntactic level of linguistic processing utilizes the part-of-speech tagging output from the lexical level and can assign phrase and clause brackets. This semi-parsed text can then be used to drive the selection of better indexing entries because phrases can be automatically recognized and used to represent the documents′ contents rather than single-word indexing which frequently introduces ambiguity into the representation and resultant retrieval. Similarly, syntactically identified phrases extracted from the query can provide better searching keys for matching against similarly bracketed documents. 数据挖掘工具

Use of the semantic level of language in IR includes interpretation of the meaning of sentences as the unit of understanding, as opposed to processing at the individual word or phrase level. This level of processing can include the semantic disambiguation of words with multiple senses, the identification of predicate argument relations in sentences or the expansion of the query by addition of all synonymous equivalents of the query terms. Term expansions can be obtained from lexical sources such as WordNet or IR-style thesauri, but the challenge here is to add just those terms which are expansions of the particular sense of the word intended in the query. Another usage of semantic processing is the production of semantic vectors to represent both queries and documents, but this also requires that the appropriate sense of each term has been determined and the appropriate semantic category selected for inclusion in the semantic vector. 数据挖掘研究院

The discourse level of language processing goes beyond the sentence to understand and represent meaning and therefore can utilize the structure and organizing principles implicitly used by writers of documents and queries. Such processing would take into account the predictable script-like structure of communications which are oft-repeated by a community which actually relies on this structure to convey meaning above and beyond that conveyed in individual words or sentences. In IR, the discourse level structure can be utilized to understand what the specific role of a piece of information is in a document, for example - is this a conclusion, is this an opinion, is this a prediction or is this a fact?

数据挖掘交友

Additionally, the recognition and resolution of anaphora (abbreviated subsequent reference to a concept introduced earlier in the text, e.g., pronouns) would result in an improved representation of both documents and queries. The representation would be improved because anaphora resolution would enable the implicit presence of concepts to be more completely accounted for at the lexical level and an integrated representation of the contents of a query or document to be produced at both the semantic and discourse levels.

The pragmatic level of language, which is concerned with how the external world impacts the meaning of communications, would come into play primarily at the query processing and understanding level. In the same way that good reference librarians can elicit from users the purpose to which they plan to put the information they are seeking, IR systems need to understand the user and his/her needs in the context of their history and their goals. Gricean maxims and other principles of communication can be incorporated in the user interface of IR systems to facilitate the "conversation" between the user and the IR system. 数据挖掘交友

Commercial Use of NLP in IR

Some linguistic enhancements have finally reached commercially available search engines, but there is no standard use of terminology to describe their processes. As a result, what the various search engines are actually doing is often far from clear. Most use of linguistics is rudimentary. The engines expect minimal one-word or two-word queries and are optimized for them, rather than for sentences, which would enable the user to fully present their information need. In general, the linguistic enhancements one now sees include
  • Automatic truncation, mainly to enrich a query with both the plural and singular forms of a noun. Some search engines are more clever and can add or subtract other forms of a word, mainly suffixes.
  • Automatic identification of proper nouns. This feature is becoming more common. The mechanism for doing this identification is simple recognition of upper case, rather than anything more linguistically motivated.
  • Phrase identification. Some search engines have added phrase bracketers. For the most part, it appears that phrase identification is based mainly on word proximity.
  • Concept identification. While this is highly touted on some search engines, it appears that Web search engines rely on statistical word co-occurrence to identify concepts, rather than on true semantic or pragmatic understanding of concepts.
In general, linguistic approaches have begun to filter into the Web search engine world, but they do not appear to have hit the traditional online systems, whose most common form of NLP continues to be automatic truncation. While some services permit the user to enter a query without using the idiosyncratic formats previously required, query processing still appears to be dependent at most on simple morphological and lexical levels of processing. Additionally, most systems which state that they use NLP appear to perform linguistic processing on just the queries. The documents in the database have not been processed with any level of linguistic analysis. However, current attention to linguistics among IR vendors seems to indicate that they will be incorporating more NLP in the future.

Those who are interested in learning about a full NLP-based IR system that does incorporate all the levels of language processing described above may want to check out the DR-LINK (Document Retrieval using LINguistic Knowledge) System at 数据挖掘工具

www.textwise.com or www.mnis.net.

This system was developed to demonstrate the powerful capabilities which NLP has to offer IR.

数据挖掘研究院

[数据挖掘专家] [数据挖掘研究院] [数据挖掘论坛] [数据挖掘实验室]
上一篇:请教关于中文自然语言处理的问题
下一篇:Natural Language Processing (NLP)
最新评论共有 0 位网友发表了评论 , 查看所有评论
发表评论( 不能超过250字,需审核,请自觉遵守互联网相关政策法规。 )
匿名?
数据挖掘网站导航 数据挖掘论坛导航
  • 数据挖掘工具
  • 数据挖掘论坛
  • DataCruncher - Cognos
  • MineSet - MathSoft
  • Intelligent Miner - GainSmarts
  • Sqlserver - SAS - Clementine
  • CART - Weka - WizSoft
  • NeuroShell - ModelQuest
  • data mining tools - Darwin
  • 数据挖掘交友
  • 数据挖掘博客
  • 数据挖掘工具
  • 数据挖掘资源
  • 数据挖掘技术算法
  • 数据挖掘相关期刊、会议
  • 研究院联盟合作专区
  • 数据挖掘基础与相关技术
  • 数据挖掘厂商与就业
  • 数据挖掘研究者乐园
  • 知名厂商数据挖掘工具资料
  • 国内数据挖掘实验室
  • Foreign Data Mining Lab
  • 热点关注
  • 统计语言模型能做什么?
  • 经典论文Magerman (95) Decision Tree Pars
  • Personal Information Management: PIM 200
  • 自然语言理解技术及其应用探讨(上)
  • Invitation to attend second workshop on
  • 能详细介绍下计算语言学究竟是研究什么
  • 自然语言理解技术及其应用探讨(下)
  • 自然语言理解相关书籍资料推荐
  • 请教关于中文自然语言处理的问题
  • 2006末各大行业垂直搜索引擎横向测评
  • 论坛最新话题
  • Foundations of Statistical Natural Langu
  • Game Theory meet Data Mining: A Recent P
  • System Building: How does it help or hin
  • 数据挖掘与Clementine培训
  • 新手报到
  • 求 SASEM 客户流失预测分析
  • 数据挖掘工程师/搜索研究院—北京——无线
  • 数据挖掘入门介绍(如何着手数据挖掘)
  • Information Overload Survey Results
  • The INEX 2005 Workshop on Element Retrie
  • 相关资讯
  • Personal Information Management: PIM 200
  • 能详细介绍下计算语言学究竟是研究什么
  • 信息时代对汉字编码的要求及汉字编码的发展
  • 统计语言模型能做什么?
  • Statistical Language Modeling Toolkit
  • 经典论文Magerman (95) Decision Tree Pars
  • 语义及概念体系在NLP中的作用
  • HNC理论的语言学基础
  • 自然语言理解技术及其应用探讨(上)
  • HNC的发展和未来
  • 数据挖掘实验室资料
  • 数据挖掘博客地址
  • 数据挖掘实验室网站地址
  • Prepare for Medicare audits by using dat
  • 注册成为SAS用户与爱好者俱乐部会员
  • 水南梅
  • 明日烟
  • 新人报道
  • 下载
  • 厦门服务器托管,450元/月—0592-5177319 高
  • 买空间送域名--0592-5177319 高静