RSS
热门关键字:  数据挖掘  人工智能  数据仓库  搜索引擎  数据挖掘导论
当前位置 :| 首页>人工智能>知识工程>

Knowledge Discovery

来源: 作者:unkonwn 时间:2004-12-04 点击:

Until recently, IT departments with big-budget projects involving data have focused their attention on data warehousing activities: collecting data in different formats from disparate sources and consolidating that data into a central repository. During these multiyear, massive data warehousing projects, few gave much thought to how this data could be "mined" to discover patterns and unexpected relationships.

Knowledge discovery and data mining (KDD) is a relatively new discipline gaining visibility due to the exponential growth in data collection. Companies collecting and storing information on every mouse-click now want to understand trends within this data, and apply that knowledge to reach customers more effectively.

Prediction and Exploration

Data mining is the process of preparing, transforming, reducing, and modeling large volumes of multidimensional data to find useful information in "big data." Data mining tasks typically fall into two categories: prediction and exploration. Prediction is a goal-directed activity; exploration or knowledge discovery is more open ended, and searches without expectation for interesting patterns. 数据挖掘研究院

The outcome of predictive data mining is to either classify the data (into "true" or "false" categories) or to perform a regression analysis on the data. Predictive methods search for and identify strong patterns in a given data set, so that new cases matched against these patterns can be labeled and classified based on closeness of match to these existing data items. In predictive mining, the data model usually consists of a large sample set of cases, with each case containing a certain number of features. Formulating a predictive problem trains the system to "learn" which patterns match predefined criteria within existing cases and which don′t, and to accept or reject new cases based on these criteria.

An example application that utilizes classification is one used by a number of telephone companies that classify their customers into high- or low-profit groups, or "churn" versus "loyal." Another example is one that can help identify likely cases of fraud or noncompliance, based on data learned from historical patterns. 数据挖掘研究院

Whereas predictive data mining depends on forecasting an outcome based on known answers, the outcome of exploratory mining or knowledge discovery is to identify patterns or unexpected groupings that may not be known ahead of time. Often, exploratory mining梖or example, to find clusters of data in a large data set梚s used as a preliminary step in doing predictive mining. When used in this manner, exploratory mining can sometimes lead to a better predictive model. One application of this is used in market-basket analysis, where an exploratory program can help discover which types of products or services are being purchased together.

The bulk of commercial data mining projects that organizations are undertaking today focus on static, structured data that has been prepared, cleaned, and transformed into a form (such as a multidimensional spreadsheet) that allows for the comparison and recognition of patterns. However, multiple features, including the introduction of sequence梖or example, in time-series data梖urther complicate the data mining problem by introducing the concept of order into the analysis of any patterns within the data. 数据挖掘研究院

From Information to Insight

The overall purpose of KDD is to turn large volumes of information into insight by allowing the access, modeling, analysis, and visualization of key relationships within corporate databases or in online Web-based interactions. This requires that commercial data mining tools provide visual interfaces to allow for the display of charting techniques and publication-quality graphics. As recently as two years ago, vendors of data mining solutions were offering "point" solutions to solve the data mining problem: Some vendors were focused almost exclusively on selling development environments with multiple algorithms, but tended to ignore integration, infrastructure, and scalability issues. Consequently, buyers of such tools were mostly researchers, and the market for these tools remained limited. Meanwhile, an entirely different group of vendors sold visualization software as a separate entity.

Today, all this has changed. Users are demanding that the tools they purchase support multiple data sources (including flat files) simultaneously, without requiring a proprietary underlying data model or data replication or transformation. They also need to be scalable in all dimensions. Commercial data mining tools today integrate more easily, for example, with COM-enabled or ActiveX clients. Some tools are now HTML-based to allow users to create and publish analytics in Web pages. Most integrate with standard commercial database offerings from companies like Sybase, Informix, IBM, and Microsoft. Most also allow spreadsheets such as Excel to be used in conjunction with them, and allow results to be displayed directly in Excel or provide output in both XML and ASCII formats. A large number of commercial tools also provide interfaces to popular desktop programs such as Microsoft′s Word, Excel, PowerPoint, and any ODBC-compliant database. 数据挖掘研究院

A Gathering of Quants

Six years ago at the first KDD conference, there was virtually no industry representation. Sessions focused on which algorithm or particular method was faster or more effective at finding patterns. Data mining was considered an esoteric activity, confined largely to academic circles and research departments within large corporations.

This year, however, at the sixth annual Association for Computing Machinery (ACM) KDD conference (http://www.acm.org/sigs/sigkdd/kdd2000)梩he premier gathering of all those involved in this field梩he tone was completely different. Data mining is now big business. Companies represented included General Motors, Chevron, Amazon.com, Daimler Chrysler, Nokia Research, AT&T Labs, NEC Research, and IBM′s T.J. Watson Research Center. Vendors of data mining products and services included small startups such as Magnify.com, digiMine, and MINEit, as well such giants as IBM, Microsoft, SAS, Oracle, and MicroStrategy.

数据挖掘研究院



One reason that data mining has taken some time to find its way into mainstream IT departments is that it is so quantitatively focused. Besides the need for experts who understand how to prepare, clean, and transform large quantities of data, data mining projects梪nlike data warehousing梟ecessitate the presence of those with a deep understanding of which methods and algorithms would most suitably apply to any particular problem.

The tools that have been available until recently required a strong understanding of statistics and other linear and nonlinear methods of data modeling and pattern recognition. And talent in this field is scarce and expensive. Data mining is an interdisciplinary field, with strong quantitative roots. People have come into this field with varying backgrounds in neural networks, statistical analysis, evolutionary computation, fuzzy logic, decision trees, memory-based reasoning, and genetic algorithms.

"It′s all about dealing with probabilities and helping companies minimize the risks involved in making a particular decision." 数据挖掘研究院

Anne Milley, analytical strategist, SAS Institute

Even within the field of so-called data mining "experts" there is considerable variation of people′s comfort levels with the various algorithms and technologies available to them. Typically, those who have come into the field from the machine learning world (in computer science departments) are familiar with one or other neural network or genetic algorithm method, but may not, for example, know much about rigorous statistical analysis. On the other hand, those with traditional statistical backgrounds (typically from mathematics departments) might be strong in understanding methods that involve linear regression and probability theory, but are uncomfortable with the nonlinear, "black box" trial-and-error methods of machine learning and pattern recognition.

Anne Milley, an analytical strategist at SAS Institute, Cary, N.C., puts it this way: "Companies undertaking analytical data projects need to understand that they may have a range of problems, some of which can be solved by very simple methods; others might require an entire arsenal of tools." She continues, "It′s all about dealing with probabilities and helping companies minimize the risks involved in making a particular decision. For a direct mail company, this could mean answering the question: ′To whom should I mail this catalog?′ with more precision than ever before." 数据挖掘研究院

It′s the Application, Stupid!

Indeed, it is the potential for the applications of these technologies that now has companies deeply interested in this area. At this year′s KDD conference, for example, Amazon.com had unashamedly set up a booth for recruiting data mining talent. Amazon, like many in the retail world, is focused on technologies that will help it go beyond merely recommending which book or CD a customer might be interested in: It is hell-bent on finding even better ways to identify the specific book or CD that a customer would most likely buy next.

The retail and financial industries have been among the first to embrace data mining technologies, first with the analysis of data in large corporate data warehouses, and more recently, in the analysis of online Web-based activities. Grocery stores, retailers, and consumer packaged goods companies have long been purchasing data from information providers such as IRI and Neilson. And quantitative market research departments within such companies have applied analytics and "business intelligence" to this data for some time now. However, the resulting decision support systems typically generated reports that were primarily understood by research departments, and it took a while for these reports to have an impact on changing retailers′ marketing strategies.

Source: Common Knowledge Cambridge

The market for data mining tools as "point" solutions is limited. Although some standalone tools will remain on the market, the majority of data mining techniques will find their way into embedded solutions in (1) full-blown analytic software packages; (2) Web-based personalization tools; (3) analytics built into marketing platforms; (4) new-generation CRM tools; (5) analytics built into other vertical industry-specific platforms; (6) analytics added to database tools.


The advent of the Web, however, has required a dramatic shortening of response time between analysis and action, and has resulted in the implementation of "closed loop" marketing systems, with much finer-grained analytic capabilities. As a consequence, marketing-specific companies have sprung up to help develop customer profiles by adding offline data warehouse information to data collected about site visitors, and then to personalize the customer experience.

数据挖掘研究院



Customer interaction systems and personalization services are now being offered by vendors such as Blue Martini Software, Unica Corp., Net Perceptions, MINEit, E.piphany, MarketFirst Software, YOUpowered, Epsilon, and thinkAnalytics. These systems and services capitalize on the increasing trend toward personalization and one-to-one marketing in real time, along with the need to customize campaigns and interact with individuals based on personal preferences. Chicago-based Magnify.com, for example, provides a customer profiling service that works with historical customer data and Internet data to create recommended action in real time. This process results in a specific offer to a customer who is likely to act on it. Sites using such a service aim to offer personalized content for each visitor, with products and services they are likely to buy.

While sophisticated data mining technologies are built into these systems, the target users of such systems are not only analysts, but also users within mainstream marketing departments. Such users may not necessarily have much background in quantitative techniques, but know what they need to do for effective campaign management or to create personalized permission-based e-mail communications. Applications that provide reporting via a Web portal customized for marketing professionals can provide instant insight and reduce the time it previously took market analysis groups to interpret such reports. 数据挖掘研究院

Although marketing is one of the most visible areas for the application of data mining techniques, the explosion of information available in the field of genomics has led to an increased volume of data to be dealt with in the biotechnology and pharmaceuticals industries. As a consequence, these industries are using computational analytics to help them make sense of the vast amounts of data being collected.

Quantitative Marketing?

While there appears to be a relative abundance of quantitative marketing experts within vendor companies, marketing people within retail and banking industries are largely still of the touchy-feely variety, hugely uncomfortable with the new rocket-science technologies associated with data mining. It takes some degree of comfort with multiple disciplines to be able to effectively apply these technologies to the field of marketing.

Scott Carl, VP of marketing at e-tailer Outpost.com, Kent, Conn., is an ideal example of such a rare individual. He can transition effortlessly between an exposition of the number of hidden layers in his neural network, the analytics within SAS′s Enterprise Miner, and the applications of these technologies to help his company perform market-basket analysis in order to better upsell and cross-sell. 数据挖掘研究院

Outpost.com applies analytics to its marketing activities in four "layers." First, they use SAS to do basic online analytical processing (OLAP)-style reporting. They then perform a classic market-basket analysis to determine which products tend to be purchased together, and to create product associations to help them cross-sell and upsell products. After this, they apply clustering and segmentation techniques to their data. Finally, they apply predictive modeling techniques and perform logistic regressions to help predict the effectiveness of their marketing efforts. These models include classic RFM (recency-frequency-monetary) attributes to perform customer scoring, as well as RFD models, where "D" indicates the duration of a customer visit to their site. Carl likens this to watching a person leaf through a catalog. Outpost.com can use this to summarize purchase and session behavior; for example, "Customer A bought three times, most recently 11 months ago, and spent a total of $500."

Outpost uses analytics to measure the effectiveness of their advertising?including TV, radio, print, and online banners. This allows the company (after adjusting for variables like changes in volumes due to seasonality) to ensure that it is spending its marketing dollars in the most appropriate places.

"In the past, dot-coms have focused on their ability to drive current session behavior by utilizing historical data and offline data mining," says Carl. "The next big win," he adds, "is when we can couple real-time and historical data to create sensible personalization, while the customer is still shopping." Outpost is currently not doing its analysis in real time, but aspires to do so in the future.

Unfortunately, while e-commerce Web sites are going to become increasingly driven by analytics, there are simply not enough marketing "quants" like Carl who know how to use the wealth of sophisticated products that are beginning to appear on the market. It will take a while for users within marketing departments to exploit梠r even understand梩he full potential of the tools and technologies becoming available to them.

Meanwhile, a new buzzword, "CRM Analytics," encompasses the marriage of "traditional" CRM environments with quantitative "data mining" technologies that allow for the personalization of relationships with each individual customer. Vendor companies have joined forces to offer this next generation of CRM solution. Broadbase Software, Menlo Park, Calif., a provider of customer-focused analytic and marketing automation applications, has announced its intention to acquire ServiceSoft, a provider of e-service solutions; and Epsilon, Burlington, Mass., a marketing services provider, recently paired up with Xchange Inc., Boston, to create better analytic CRM solutions.

数据挖掘研究院



Demand for Scalability

The merging of online clickstream and customer behavioral data with offline data warehouses is allowing retailers to build increasingly sophisticated profiles of their customers. However, integrating transactional, behavioral, and demographic data from disparate sources in a timely fashion continues to be a challenging task. This, coupled with the pressing demands for prompt analysis, raises the requirements for scalable processing.

While parallel processing has taken a great deal of time to find commercial applications beyond areas like weather forecasting, companies may soon begin exploring the need to use technologies that exploit parallelism within multiple distributed processors, in order to speed processing time and allow scale-up. Torrent Systems Inc., Cambridge, Mass., has developed one such solution to allow in-depth analysis of entire databases (as opposed to just samples within such databases). The company is working with ZDNet to help improve its advertising reporting system and its CRM programs. By using Torrent′s Orchestrate software, ZDNet has been able to analyze its ad traffic and provide detailed reports to advertisers about visitors to the site. 数据挖掘研究院

Knowledge discovery is quickly moving from gear-head status to greater and more general acceptance within the IT and business communities.

Besides the unglamorous task of data preparation梥o important but hardly addressed in data mining discussions梥calability is also a neglected topic. However, when planning a large data mining project, attention to how such a project will scale up is of immense importance to its ultimate success, once it is deployed.

A continual increase in the volume of data collected on individual transactions and the detailed recording of Web transactions for an ever-increasing population necessitate the need to seek meaning from this data. As a result, knowledge discovery is quickly moving from gear-head status to greater and more general acceptance within the IT and business communities.

最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?