I created a workshop, Text Analytics for Dummies, for presentation before the start of this year’s Text Analytics Summit. Many folks who attend the summit are new to text analytics. The summit sponsors and I figured they could use a solid grounding in the technology and typical applications to help them understand sometimes-intense summit content. Our figuring was right on: I expected 20 workshop attendees but we had over 35. It occurs to me that the same conditions apply for readers of my Business Intelligence Network (BeyeNETWORK.com) text analytics channel. I’ve touched on technology underpinnings in previous articles, but I have never covered them comprehensively, hence this month's article, Text Analytics Basics. This article – the first of two parts – should be especially useful as background for my recently published Business Intelligence Network research report, Voice of the Customer: Text Analytics for the Responsive Enterprise, which is featured on BeyeRESEARCH.com. 数据挖掘工具
I’ve posted my class slides on the web; they may be of some use even though many do not carry explanatory text. All the same, the overall text-analytics story should come through clearly. That story starts with placing the technology in terms of what people do with electronic documents:
数据挖掘交友
- Publish, manage and archive.
- Index and search.
- Categorize and classify according to metadata and contents.
- Information extraction.
For textual documents, text analytics enhances #2 and enables #3 and #4. Text analytics enriches indexing and search by discerning the concepts and relationships, which provide relevance-boosting context, behind search terms and document content. That is, text analytics enables search engines to provide more accurate results (as measured by both precision and recall, to be defined later) and improved results ranking and results presentation. Text analytics – text data mining, actually – provides the technology behind clustering, categorizing and classifying documents and their contents, supporting both interactive exploration of text-sourced information and automated document processing. And information extraction (IE) – pulling important entities, concepts, relationships, facts and opinions from text – is the key to including text-sourced data in business intelligence (BI) and predictive-analytics applications. 数据挖掘研究院
Back to Future for Business Intelligence
Enterprises now face an imperative, given the huge volume of textual information generated by enterprises and their stakeholders, to exploit “unstructured” sources to discern and act on opportunity and risk. For business intelligence, it’s Back to the Future. The original conception of BI, dating to a 1958 IBM Journal paper, A Business Intelligence System, defined business as “a collection of activities carried on for whatever purpose, be it science, technology, commerce, industry, law, government, defense, et cetera,” and “the notion of intelligence... as the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal.” Notably, the paper's author, Hans Peter Luhn, focused exclusively on documents as an information source – business operations weren’t computerized in 1958 – and also on core knowledge management questions:
数据挖掘实验室
- What is known?
- Who knows what?
- Who needs to know?
In a sense, for 45+ years, business intelligence detoured around the estimated 80% of enterprise information locked inaccessibly in textual form. The reason is clear. As Prabhakar Raghavan of Yahoo Research explains, “The bulk of information value is perceived as coming from data in relational tables. The reason is that data that is structured is easy to mine and analyze.” So business intelligence thrived – crunching fielded, numerical, RDBMS-managed data, structured for analyses via star schemas and the like. And BI delivered findings via tables, charts and dashboards that focus more on numbers than on knowledge, on “interrelationships of presented facts” that “guide action toward a desired goal.” 数据挖掘研究院
But now, within the last few years, text technologies have matured to the point where they can meet the “unstructured data” challenge.