Start with sources.
Text analytics has had great success in areas such as mining biomedical literature as part of drug-discovery processes. If we can understand the relationship between certain protein interactions and disease onset, we can begin to identify promising therapies. Text analytics can help us achieve this understanding without costly and time-consuming clinical trials. We mine for factual information, yet accuracy of information extraction from formally written scientific literature as measured by precision and recall – by levels of correctness and exhaustiveness – typically reaches the 85%-90% range. 数据挖掘实验室
Opinions are far harder than facts to describe. Opinion sources are typically informally written (or worse) and highly diverse. They are short on descriptive metadata that can provide context for analytical efforts. So sentiment-extraction accuracy is typically far lower, but it can be boosted by approaches that are appropriate for the sources and goals.
We might start by classifying source documents – Web pages, e-mail messages, news or blog articles, or audio transcripts – by theme, topic, type, authorship, and other characteristics. To this end, we parse documents for entities such as names of persons, products, companies, and places; for descriptive attributes such as authorship; and also for abstract concepts. For example, the concept “vehicle” subsumes entities are names of makes and models with year and style attributes. Taxonomies can help in the classification effort but they may be incomplete when dealing with truly diverse sources. 数据挖掘交友
Entity extraction gives us subject matter for further investigation. But beyond facts – “I bought my first Mac last year” – what was the writer or speaker trying to communicate?
According to researchers Livia Polanyi and Annie Zaenen, “The most salient clues about attitude are provided by the lexical choice of the writer, ... but the organization of the text also contributes information relevant to assessing attitude.” Lexical choices: those are words. Boost, benefit, and brave indicate positive valence – that is, tone or polarity – while