1 Introduction
The world-wide web, having over 350 million pages, continues to grow rapidly at a million pages per day [2]. About 600 GB of text changes every month [19]. Such growth and flux poses basic limits of scale for today′s generic crawlers and search engines. At the time of writing4, Alta Vista′s crawler, called the Scooter, runs on a 1.5 GB memory, 30 GB RAID disk, 4×533MHz AlphaServer 4100-5/300 with 1 GB/s I/O bandwidth. Scooter connects to the indexing engine Vista, which is a 2 GB memory, 180 GB RAID disk, 2×533MHz AlphaServer 4100-5/300. (The query engine is even more impressive, but is not relevant to our discussion.) Other giant web crawlers use similar fire-power, although in somewhat different forms, e.g., Inktomi uses a cluster of hundreds of Sun Sparc workstations5 with 75 GB of RAM and over 1 TB of spinning disk, and it crawls over 10 million pages a day.
数据挖掘交友
In spite of these heroic efforts with high-end multiprocessors and exquisitely crafted crawling software, the largest crawls cover only 30-40% of the web, and refreshes take weeks to a month [2,22]. The overwhelming engineering challenges are in part due to the one-size-fits-all philosophy: Alta Vista and Inktomi try to cater to every possible query that might be made on the web. Although such services are invaluable for their broad coverage, the resulting diversity of content often snares all but the most craftily constructed queries in thousands of responses of little relevance or quality. Furthermore, the imminent explosion of web publication beyond North America and Europe, and beyond academic and corporate sites, will challenge even the most scalable solutions. 数据挖掘交友
Compared to the web, development of the human brain has been tardy: it has grown ``only linearly′′ from 400 to 1400 cubic centimeters in the last 3.5 million years. How do people avoid information overload? Serious web users adopt the strategy of filtering by relevance and quality. The growth of the web matters little to a physicist if at most a few dozen pages dealing with quantum electrodynamics are added or updated per week. Seasoned users also rarely roam aimlessly; they have bookmarked sites important to them, and their primary need is to expand and maintain a community around these examples while preserving the quality.
We argue that a giant, all-purpose crawl is neither necessary nor sufficient for this purpose. In our experience (§2), keyword queries cannot naturally locate resources relevant to specific topics. It is also unreasonable to have to first crawl and index 350 million pages in order to distill fifty good resources related to quantum electrodynamics! Much of this index would never be used, but, burdened by the responsibility of maintaining this huge index, the crawler would not be able to preferentially and frequently refresh and further explore relevant regions of the web. It might be argued that a central crawl amortizes work across multiple topics. But our results (§4) suggest that topical web exploration is efficient enough for distributed deployment.
Our contributions: In this paper, we describe a Focused Crawler which seeks, acquires, indexes, and maintains pages on a specific set of topics that represent a relatively narrow segment of the web. It entails a very small investment in hardware and network resources and yet achieves respectable coverage at a rapid rate, simply because there is relatively little to do. Thus, web content can be managed by a distributed team of focused crawlers, each specializing in one or a few topics. Each focused crawler will be far more nimble in detecting changes to pages within its focus than a crawler that is crawling the entire web. The focused crawler is guided by a classifier which learns to recognize relevance from examples embedded in a topic taxonomy, and a distiller which identifies topical vantage points on the web. We describe the architecture in §3 and our experiences in §4.
Eventually, our goal is to impose sufficient topical structure on the web so that powerful semi-structured query, analysis, and discovery are enabled. Here are some compelling examples:
Discovering linkage sociology:
Is there a hyperlink between the web page of a speed trap (traffic radar) maker and an auto insurance company? Apart from other bicycling pages, what topics are prominent in the neighborhood of bicycling pages? (First aid is one answer found by our system.)
Locating specialty sites:
Getting isolated pages, rather than comprehensive sites, is a common problem with web search. Now we can order sites according to the density of relevant pages found there. E.g., we can find the top five sites specializing in mountain biking. 数据挖掘交友
Semi-supervised learning:
Human-supervised topic learning yields very high-quality filtering, but needs labor-intensive training. Finding specialty sites can quickly generate large amounts of additional training data with little effort. 数据挖掘实验室
Detecting community culture:
Simple statistics about the link graph reveal important information about the community of the focused topic, e.g., whether it is competitive or collaborative (§4), the typical time taken by a good resource to become popular, etc.
数据挖掘交友
Estimating community timescales:
Simple queries can identify topical regions of the web that grow or change dramatically as against those that are relatively stable. This can be of great value to the web ontologists at Yahoo! or The Mining Company.
数据挖掘实验室
There is much awareness that for serious web users, focused portholes are more useful than generic portals6: ``The most interesting trend is the growing sense of natural limits, a recognition that covering a single galaxy can be more practical-and useful-than trying to cover the entire universe′′ [16]. A focused crawler is an example-driven automatic porthole-generator. In a companion paper [8] we have proposed new HTTP infrastructure to support bidirectional hyperlinks to facilitate exploration of fine-grained communities. We feel that the ability to focus on a topical subgraph of the web, as in this paper, together with the ability to browse communities within that subgraph, will lead to significantly improved web resource discovery.
数据挖掘交友
2 Focused crawler administration
Central to a focused crawler is a canonical topic taxonomy with examples. To run a specific instance, initial human input has to be provided in two forms. The user has to select and/or refine specific topic nodes in the taxonomy, and may also need to provide additional example URLs which serve as starting points for the crawl. In this section we give a user′s view of the system. 数据挖掘研究院
2.1 Operation synopsis
Canonical taxonomy creation:
When the system is built, the classifier is pre-trained with a canonical taxonomy (such as Yahoo!, The Open Directory Project, The Virtual Library or The Mining Co.) and a corresponding set of examples. The canonical (coarse-grained) classification tree is part of the initial system.
Example collection:
The user collects URLs that are examples of her interest. These are submitted to the system, e.g., by importing her bookmarks file. 数据挖掘研究院
Taxonomy selection and refinement:
The system proposes the most common classes where the examples fit best. The user can choose and mark some of these classes as good. Sometimes, the user may find the taxonomy too coarse, and refine some categories and move documents from one category to another. 数据挖掘实验室
Interactive exploration:
The system also proposes additional URLs in a small neighborhood of the examples, that appear to be similar to the examples. (This can be regarded as a slow-speed, interactive startup phase.) The user may inspect and include some of these as examples. The steps thus far are illustrated in Figure 1(a).
Training:
The classifier integrates the refinements made by the user into its statistical class models.
数据挖掘论坛
Resource discovery:
At this stage the system is ready to perform resource discovery as described in the rest of the paper. 数据挖掘交友
Distillation:
Intermittently and/or concurrently, the system runs a topic distillation algorithm to identify pages containing large numbers of relevant resource links, called hubs. The (re)visit priorities of these pages and immediate neighbors are raised.
数据挖掘交友
Feedback:
Typically, the user inspects the system regularly. The system reports the most popular sites and resource lists, and the user can give feedback by marking them as useful or not. This feedback goes back to the classifier and distiller. 数据挖掘研究院
(a)
(b) 数据挖掘研究院
数据挖掘交友
Figure 1: Focused crawler administration and monitoring. (a) A sample session for configuring a crawl for ``recreational bicycling′′ resources. (b) Applet for monitoring the recent relevant page acquisition rate of the focused crawler. 数据挖掘交友
The user collects examples by browsing. The applet shown in Figure 1(a) monitors the page being rendered by the browser. Using the Classify menu, the user can make the classifier route the current page to the few best matching nodes in the category tree, marking all nodes on the way. After sufficient browsing, all marked nodes are presented to the user as candidates for focused crawling. The user selects some nodes and selects them, thereby marking them good. These are shown highlighted in the tree view. E.g., for the topic of ``recreational bicycling,′′ two subtrees were found to be good choices. One (/Recreation/Sports/Cycling) is shown. The other was /Business/Companies/Sports/Cycling. Example sites that the master category system already knows about are shown in the upper right panel and can be viewed through a browser by clicking. When such a page is visited, the applet shows URLs of pages in the neighborhood of the example whose titles have many words in common with the most distinctive words of the topic [5,8]. Any pages thus found useful can also be added to the examples by dragging and dropping.
Sometimes the user may feel that the leaf nodes to which her examples have been assigned are still too broad and need to be refined. The tree view interface lets her create and move directories and populate them with examples. If major changes are made to the master category tree, some time is needed for the classifier to integrate the new structure into its models [5]. For our testbed with about 260,000 documents from Yahoo!, this takes a few hours. Smaller changes, such as moving of documents while keeping the tree unchanged, are interactive. 数据挖掘交友
At this stage, the focused crawler can be started. It is a complex system which not only crawls tens of thousands of pages per hour, but makes decisions based on millions of arithmetic calculations per page. It is thus quite helpful for diagnostic purposes to visualize the status of the crawl graphically. We have developed an applet that maintains a plot of page relevance against time. In Figure 1(b), each red dot is a web page, which may be viewed in a browser window by clicking on the dot. The x-axis represents time. The y-axis is the relevance score (a probability value) between zero and one. The blue line is a smoothed moving average over a recent window of pages fetched. Continual refreshing introduces new points at the right edge, and the display scrolls the leftmost points off the screen. 数据挖掘工具
If the page acquisition rate suddenly lowers, the right-to-left scrolling slows down. This can be made to raise an alarm (not implemented). Alternatively, the crawler may be getting many pages, but their relevance will be very low. The blue line will go down without significant recovery. This too can be made to raise an explicit alarm if necessary.
2.2 Justification and discussion
A different design is conceivable in which keyword search is used to locate an initial set of pages (using a giant crawl and index), expand this graph to a limited radius and then look for popular sites in the expanded graph using weighted degree measures [25,31,4,21,,3]. This approach was tried as a semi-automatic means to build a taxonomy like Yahoo!. For 966 topics picked from Yahoo!, keyword queries were constructed manually. E.g., the query for the topic /Business/Companies/Electronics/PowerSupply was +"power suppl*" ?witch* mode" smps -multiprocessor* üninterrupt* power suppl*" ups -parcel. Typically, several query refinements were needed to match the quality of Yahoo! in blind user tests. The resulting queries were complex (as above) compared to the average Alta Vista query [29]. The above experiment used an average of 7.03 query terms and 4.34 operators (+-*"); an average Alta Vista query has only 2.35 terms and 0.41 operators. Query construction is not a one-time investment, because as pages on the topic are discovered, their additional vocabulary must be folded in manually into the query for continued discovery.
Yet another design is possible in which the focused crawler only uses the examples found by the user, but does not use a taxonomy with the pre-packaged examples. E.g., we can set up a simple two-class learning problem (relevant/irrelevant) for each focus topic. However, we have a few reasons to believe that our approach is more promising.
数据挖掘实验室
Better modeling of the negative class: In the two-class learning of text, characterization of the the negative class (e.g. ``a page not about mutual funds′′) is often problematic. Using a taxonomy as we do, deals with this problem by describing the negative class as a union of positive classes. This is not merely a mental artifice, but it also affects the accuracy of learning algorithms significantly [26], because commonly used statistical models have large estimation errors on the diverse negative class. 数据挖掘论坛
Reuse of classifier training effort: Learning to recognize text documents is made difficult by the large dimensionality and consequent sparsity. It may be burdensome for every user to prepare enough sample documents to have an adequate number of positive and negative examples for learning her interest profile. The work of mapping the user′s interest onto a predefined set of categories, refining them when needed, will usually be significantly less than finding an adequate number of positive and negative examples. One may even envisage that a standards organization will design many ``backbone taxonomies′′ for smaller groups to fill in and refine. A similar procedure is espoused for maintaining many web directories.
数据挖掘研究院
Discovery of related classes: The framework of a master taxonomy helps the user detect additional regions of the web that are topically related to her interest which were not naturally connected with her start set. As we shall see later, the focused crawler is quick to suggest that crawling only on mutual funds while forbidding investment in general will not work well, because the neighborhoods of these two topics comingle. It is able to do this because it can classify the web pages it encounters into other categories from the taxonomy, which would not be possible if the binary-classification approach were used. 数据挖掘研究院
Why is this significant? In addition to teaching the user about the relationship of her interests to other topics on the web, this capability is important for diagnostic purposes. In the mutual funds example, it is better to broaden the set of categories to those that provide a minimal covering of the interest topics, because doing so provides a higher degree of linkage, which means many more available paths for finding relevant pages. In such a scenario the crawling-priority relevance score and the final (for displaying to the user) relevance score will be determined differently. A natural way to expand the topic set for this purpose is to add some of the parents and/or siblings of the relevant topics from the taxonomy-another advantage over binary classification.
资料全文下载 数据挖掘研究院