Text Analytics Basics

I created a workshop, Text Analytics for Dummies, for presentation before the start of this year’s Text Analytics Summit. Many folks who attend the summit are new to text analytics. The summit sponsors and I figured they could use a solid grounding in the technology and typical applications to help them understand sometimes-intense summit content. Our figuring was right on: I expected 20 workshop attendees but we had over 35. It occurs to me that the same conditions apply for readers of my Business Intelligence Network (BeyeNETWORK.com) text analytics channel. I’ve touched on technology underpinnings in previous articles, but I have never covered them comprehensively, hence this month's article, Text Analytics Basics. This article – the first of two parts – should be especially useful as background for my recently published Business Intelligence Network research report, Voice of the Customer: Text Analytics for the Responsive Enterprise, which is featured on BeyeRESEARCH.com. 数据挖掘工具

I’ve posted my class slides on the web; they may be of some use even though many do not carry explanatory text. All the same, the overall text-analytics story should come through clearly. That story starts with placing the technology in terms of what people do with electronic documents:

数据挖掘交友

  1. Publish, manage and archive.

  2. Index and search.

  3. Categorize and classify according to metadata and contents.

  4. Information extraction.

For textual documents, text analytics enhances #2 and enables #3 and #4. Text analytics enriches indexing and search by discerning the concepts and relationships, which provide relevance-boosting context, behind search terms and document content. That is, text analytics enables search engines to provide more accurate results (as measured by both precision and recall, to be defined later) and improved results ranking and results presentation. Text analytics – text data mining, actually – provides the technology behind clustering, categorizing and classifying documents and their contents, supporting both interactive exploration of text-sourced information and automated document processing. And information extraction (IE) – pulling important entities, concepts, relationships, facts and opinions from text – is the key to including text-sourced data in business intelligence (BI) and predictive-analytics applications. 数据挖掘研究院

Back to Future for Business Intelligence

Enterprises now face an imperative, given the huge volume of textual information generated by enterprises and their stakeholders, to exploit “unstructured” sources to discern and act on opportunity and risk. For business intelligence, it’s Back to the Future. The original conception of BI, dating to a 1958 IBM Journal paper, A Business Intelligence System, defined business as “a collection of activities carried on for whatever purpose, be it science, technology, commerce, industry, law, government, defense, et cetera,” and “the notion of intelligence... as the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal.” Notably, the paper's author, Hans Peter Luhn, focused exclusively on documents as an information source – business operations weren’t computerized in 1958 – and also on core knowledge management questions:

数据挖掘实验室

  • What is known?

  • Who knows what?

  • Who needs to know?

In a sense, for 45+ years, business intelligence detoured around the estimated 80% of enterprise information locked inaccessibly in textual form. The reason is clear. As Prabhakar Raghavan of Yahoo Research explains, “The bulk of information value is perceived as coming from data in relational tables. The reason is that data that is structured is easy to mine and analyze.” So business intelligence thrived – crunching fielded, numerical, RDBMS-managed data, structured for analyses via star schemas and the like. And BI delivered findings via tables, charts and dashboards that focus more on numbers than on knowledge, on “interrelationships of presented facts” that “guide action toward a desired goal.” 数据挖掘研究院

But now, within the last few years, text technologies have matured to the point where they can meet the “unstructured data” challenge.

上一页12 3 4 5 下一页
[数据挖掘专家] [数据挖掘研究院] [数据挖掘论坛] [数据挖掘实验室]
上一篇:引用 信息时代市场营销的新策略
下一篇:ACM's KDD 2008 Conference
最新评论共有 0 位网友发表了评论 , 查看所有评论
发表评论( 不能超过250字,需审核,请自觉遵守互联网相关政策法规。 )
匿名?
数据挖掘网站导航 数据挖掘论坛导航
  • 数据挖掘工具
  • 数据挖掘论坛
  • DataCruncher - Cognos
  • MineSet - MathSoft
  • Intelligent Miner - GainSmarts
  • Sqlserver - SAS - Clementine
  • CART - Weka - WizSoft
  • NeuroShell - ModelQuest
  • data mining tools - Darwin
  • 数据挖掘交友
  • 数据挖掘博客
  • 数据挖掘工具
  • 数据挖掘资源
  • 数据挖掘技术算法
  • 数据挖掘相关期刊、会议
  • 研究院联盟合作专区
  • 数据挖掘基础与相关技术
  • 数据挖掘厂商与就业
  • 数据挖掘研究者乐园
  • 知名厂商数据挖掘工具资料
  • 国内数据挖掘实验室
  • Foreign Data Mining Lab
  • 热点关注
  • 文本聚类程序实例
  • BBS 数据挖掘研究及其地位与核心问题
  • 一种新的基于统计的自动文本分类方法
  • Text Categorization
  • Is Data Mining Misguided?
  • 焦点应用:语义分析
  • 句子相似度计算在FAQ中的应用
  • 文本挖掘抢占商业智能掘金制高点
  • 基于文本概念和kNN 的跨语种文本过滤
  • More data isn’t always a good thing in
  • 论坛最新话题
  • Foundations of Statistical Natural Langu
  • Game Theory meet Data Mining: A Recent P
  • System Building: How does it help or hin
  • 数据挖掘与Clementine培训
  • 新手报到
  • 求 SASEM 客户流失预测分析
  • 数据挖掘工程师/搜索研究院—北京——无线
  • 数据挖掘入门介绍(如何着手数据挖掘)
  • Information Overload Survey Results
  • The INEX 2005 Workshop on Element Retrie
  • 相关资讯
  • More data isn’t always a good thing in
  • Text Categorization
  • Finding Advertising Keywords on Web Page
  • Communities from Seed Sets
  • To Randomize or Not To Randomize: Space
  • Overview of Text Summarization History
  • Porter Stemming Algorithm
  • Sequential Minimal Optimization
  • 句子相似度计算在FAQ中的应用
  • 弱指导的统计隐含语义分析及其在跨语言信息
  • 数据挖掘实验室资料
  • 数据挖掘博客地址
  • 数据挖掘实验室网站地址
  • Prepare for Medicare audits by using dat
  • 注册成为SAS用户与爱好者俱乐部会员
  • 水南梅
  • 明日烟
  • 新人报道
  • 下载
  • 厦门服务器托管,450元/月—0592-5177319 高
  • 买空间送域名--0592-5177319 高静