Introduction and Motivation
Designing a rich web site so that it readily yields its information can be tricky. Unlike the proverbial oyster that contains a single pearl, a web site often contains myriad facts, images, and hyperlinks. Many different visitors approach a popular web site -- each with his or her own goals and concerns. Consider, for example, the web site for a typical computer science department. The site contains an amalgam of research project descriptions, course information, lists of graduating students, pointers to industrial affiliates, and much more. Each nugget of information is of value to someone who would like to access it readily. One might think that a well organized hierarchy would solve this problem, but we′ve all had the experience of banging our heads against a web site and crying out ``it′s got to be here somewhere...′′. 数据挖掘研究院
The problem of good web design is compounded by several factors. First, different visitors have distinct goals. Second, the same visitor may seek different information at different times. Third, many sites outgrow their original design, accumulating links and pages in unlikely places. Fourth, a site may be designed for a particular kind of use, but be used in many different ways in practice; the designer′s a priori expectations may be violated. Too often web site designs are fossils cast in HTML, while web navigation is dynamic, time-dependent, and idiosyncratic. In [13], we challenged the AI community to address this problem by creating adaptive web sites: sites that semi-automatically improve their organization and presentation by learning from visitor access patterns.
Many web sites can be viewed as user interfaces to complex information stores. However, in contrast to standard user interfaces, where data on user behavior has to be gathered in expensive (and artificial) focus groups and usability labs, web server logs automatically record user behavior at the site. We posit that adaptive web sites could become a valuable method of mining this data with the goal of continually tuning the site to its user population′s needs.
While adaptive web sites are a potentially valuable, their feasibility is unclear a priori: can non-trivial adaptations be automated? will adaptive web sites run amok, yielding chaos rather than improvement? what is an appropriate division of labor between the automated system and the human webmaster? To investigate these issues empirically, we analyze the problem of index page synthesis.
We focus on one subproblem (generating candidate link sets to include in index pages) as amenable to automation and describe the PageGather algorithm, which solves it.
The remainder of this paper is organized as follows. We next discuss the design space of adaptive web sites and present previous work in this area. We then present design desiderata which motivate our own approach. In section 2, we define the index page synthesis problem, the focus of our case study. We then present PageGather, analyzing variants of PageGather and both data mining and clustering algorithms as potential solutions. In section 3, we experimentally evaluate variants of PageGather, and compare the performance of PageGather to that of Apriori, the classical data mining algorithm for the discovery of frequent sets [3]. We also compare PageGather′s output to pre-existing, human-authored index pages available at our experimental web site. We conclude with a discussion of future work and a summary of our contributions.
Design Space
Adaptive web sites vary along a number of design axes.
- Types of adaptations. New pages may be created. Links may be added or removed, highlighted or rearranged. Text, link labels, or formatting may be altered.
- Customization vs. Transformation. Customization is modifying a web site to suit the needs of an individual user; customization necessitates creating a large number of versions of the web site -- one for each user. In contrast, transformation involves altering the site to make navigation easier for a large set of users. For example, a university web site may be reorganized to support one ``view′′ for faculty members and a distinct view for students. In addition, certain transformations may seek to improve the site for all visitors.
- Content-based vs. Access-based. A site that uses content-based adaptation organizes and presents pages based on their content -- what the pages say and what they are about. Access-based adaptation uses the way past visitors have interacted with the site to guide how information is structured. Naturally, content-based and access-based adaptations are complementary and may be used together.
- Degree of automation. Excite and Yahoo′s manually personalized home pages are a simple example of customization; we are interested in more automatic adaptation techniques. However, for feasibility, adaptive web sites are likely to be only partially automated.
We now survey previous work on adaptive web site using the vocabulary and distinctions introduced above. 数据挖掘实验室
Previous Work
It is quite common for web sites to allow users to customize the site for themselves. Common manual customizations include lists of favorite links, stock quotes of interest, and local weather reports. Slightly automated customizations include records of previous interactions with the site and references to pages that have changed since the previous visit. Some sites also allow users to describe interests and will present information -- news articles, for example -- relevant to those interests.
More sophisticated sites attempt path prediction: guessing where the user wants to go and taking her there immediately (or at least providing a link). The WebWatcher [5]
learns to predict what links users will follow on a particular page as a function of their specified interests. A link that WebWatcher believes a particular user is likely to follow will be highlighted graphically and duplicated at the top of the page when it is presented. Visitors to a site are asked, in broad terms, what they are looking for. Before they depart, they are asked if they have found what they wanted. WebWatcher takes an access-based approach, using the paths of people who indicated success as examples of successful navigations. If, for example, many people who were looking for ``personal home pages′′ follow the ``people′′ link, then WebWatcher will tend to highlight that link for future visitors with the same goal. Note that, because WebWatcher groups people based on their stated interests rather than customizing to each individual, it falls on the continuum between pure customization and pure transformation.
A site may also try to customize to a user by trying to guess her general interests dynamically as she browses. The AVANTI Project [7]
focuses on dynamic customization based on users′ needs and tastes. As with the WebWatcher, AVANTI relies partly on users providing information about themselves when they enter the site. Based on what it knows about the user, AVANTI attempts to predict both the user′s eventual goal and her likely next step. AVANTI will prominently present links leading directly to pages it thinks a user will want to see. Additionally, AVANTI will highlight links that accord with the user′s interests.
Another form of customization is based on collaborative filtering. In collaborative filtering, users rate objects (e.g. web pages or movies) based on how much they like them. Users that tend to give similar ratings to similar objects are presumed to have similar tastes; when a user seeks recommendations of new objects, the site suggests those objects that were highly rated by other users with similar tastes. The site recommends objects based solely on other users′ ratings or accesses, ignoring the content of the objects themselves. A simple form of collaborative filtering is used by, for example, Amazon.com; the web page for a particular book may have links to other books commonly purchased by people who bought this one. Firefly
uses a more individualized form of collaborative filtering in which members may rate hundreds of CDs or movies, building up a very detailed personal profile; Firefly then compares this profile with those of other members to make new recommendations. 数据挖掘研究院
Footprints [23] takes an access-based transformation approach. Their motivating metaphor is that of travelers creating footpaths in the grass over time. Visitors to a web site leave their ``footprints′′ behind, in the form of counts of how often each link is traversed; over time, ``paths′′ accumulate in the most heavily traveled areas. New visitors to the site can use these well-worn paths as indicators of the most interesting pages to visit. Footprints are left automatically (and anonymously), and any visitor to the site may see them; visitors need not provide any information about themselves in order to take advantage of the system. Footprints provides essentially localized information; the user sees only how often links between adjacent pages are traveled. 数据挖掘研究院
A web site′s ability to adapt could be enhanced by providing it with meta-information: information about its content, structure, and organization. One way to provide meta-information is to represent the site′s content in a formal framework with precisely defined semantics, such as a database or a semantic network. The use of meta-information to customize or optimize web sites has been explored in a number of projects (see, for example, XML annotations [9], Apple′s Meta-Content Format, and other projects [6,11]). One example of this approach is the STRUDEL web-site management system [6] which attempts to separate the information available at a web site from its graphical presentation. Instead of manipulating web sites at the level of pages and links, web sites may be specified using STRUDEL′s view-definition language. With all of the site′s content so encoded, its presentation may be easily adapted. 数据挖掘实验室
A number of projects have explored client-side customization, in which a user has her own associated agent who learns about her interests and customizes her web experience accordingly. The AiA project [4,17] explores the customization of web page information by adding a ``presentation agent′′ who can direct the user′s attention to topics of interest. The agent has a model of the individual user′s needs, preferences, and interests and uses this model to decide what information to highlight and how to present it. In the AiA model, the presentation agent is on the client side, but similar techniques could be applied to customized presentation by a web server as well. Letizia [10] is a personal agent that learns a model of its user by observing her behavior. Letizia explores the web ahead of the user (investigating links off of the current page) and uses its user model to recommend pages it thinks the user will enjoy. Other projects have investigated performing customization at neither the client nor the server but as part of the network in between, particularly by using transcoding proxies. Transend [8], for example, is a proxy server at the University of California at Berkeley that performs image compression and allows each of thousands of users to customize the degree of compression, the interface for image refinement, and the web pages to which compression is applied. 数据挖掘研究院

