Crawling the Hidden Web

Keywords: Crawling, Hidden Web, Content extraction, HTML Forms

1 Introduction

A number of recent studies [4, 19, 20] have noted that a tremendous 
amount of content on the Web is dynamic. This dynamism takes a number of 
different forms (see Section 2). For instance, web pages can be 
dynamically generated, i.e., a server-side program creates a page after 
the request for the page is received from a client. Similarly, pages can 
be dynamic because they include code that executes on the client machine 
to retrieve content from remote servers (e.g., a page with an embedded 
applet that retrieves and displays the latest stock information).

Based on studies conducted in 1997, Lawrence and Giles [19] estimated 
that close to 80% of the content on the Web is dynamically generated, 
and that this number is continuing to increase. As major software 
vendors come up with new technologies [2, 17, 26] to make such dynamic 
page generation simpler and more efficient, this trend is likely to 
continue.

However, little of this dynamic content is being crawled and indexed. 
Current-day search and categorization services cover only a portion of 
the Web called the publicly indexable Web [19]. This refers to the set 
of web pages reachable purely by following hypertext links, ignoring 
search forms and pages that require authorization or prior registration.

In this paper, we address the problem of crawling a subset of the 
currently uncrawled dynamic Web content. In particular, we concentrate 
on extracting content from the portion of the Web that is hidden behind 
search forms in large searchable databases (the so called Hidden Web1 
[11]). The hidden Web is particularly important, as organizations with 
large amounts of high-quality information (e.g., the Census Bureau, 
Patents and Trademarks 1The term Deep Web has been used in reference [4] 
to refer to the same portion of the Web.

1


Office, News media companies) are placing their content online. This is 
typically achieved by building a Web query front-end to the database 
using standard HTML form elements [12]. As a result, the content from 
these databases is accessible only through dynamically generated pages, 
delivered in response to user queries.

Crawling the hidden Web is a very challenging problem for two 
fundamental reasons. First is the issue of scale; a recent study [4] 
estimates that the size of the content available through such searchable 
online databases is about 400 to 500 times larger than the size of the 
?static Web.ö As a result, it does not seem to be prudent to attempt 
comprehensive coverage of the hidden Web. Second, access to these 
databases is provided only through restricted search interfaces, 
intended for use by humans. Hence, ?trainingö a crawler to use this 
restricted interface to extract relevant content, is a non-trivial 
problem.

To address these challenges, we propose a task-specific, human-assisted 
approach to crawling the hidden Web. Specifically, we aim to selectively 
crawl portions of the hidden Web, extracting content based on the 
requirements of a particular application, domain, or user profiles. In 
addition, we provide a framework that allows the human expert to 
customize and assist the crawler in its activity.

Task-specificity helps us counter the issue of scale. For example, a 
marketing analyst may be interested in news articles and press releases 
pertaining to the semiconductor industry. Similarly, a military analyst 
may be interested in political information about certain countries. The 
analysts can use existing search services to obtain URLs for sites 
likely to contain relevant information, and can then instruct the 
crawler to focus on those sites. In this paper we do not directly 
address this resource discovery problem per se; see Section 7 for 
citations to relevant work. Rather, our work addresses the issue of how 
best to automate content retrieval, given the location of potential 
sources.

Human-assistance is critical to enable the crawler to submit queries on 
the hidden Web that are relevant to the application/task. For example, 
the marketing analyst may provide lists of products and companies that 
are of interest, so that when the crawler encounters a form requiring 
that a ?companyö or a ?productö be filled-in, the crawler can 
automatically fill in many such forms. Of course, the analyst could have 
filled out the forms manually, but this process would be very laborious. 
By encoding the analyst′s knowledge for the crawler, we can speed up the 
process dramatically. Furthermore, as we will see, our crawler will be 
able to ?learnö about other potential company and product names as it 
visits pages, so what the analyst provides is simply an initial seed 
set.

As the crawler submits forms and collects ?hidden pages,ö it saves them 
in a repository (together with the queries that generated the pages). 
The repository also holds static pages crawled in a conventional 
fashion. An index can then be built on these pages. Searches on this 
index can now reveal both hidden and static content, at least for the 
targeted application. The repository can also be used as a cache. This 
use is especially important in military or intelligence applications, 
where direct Web access may not be desirable or possible. For instance, 
during a crisis we may want to hide our interest in a particular set of 
pages. Similarly, copies of the cache could be placed at sites that have 
intermittent net access, e.g., a submerged submarine. Thus, an analyst 
on the submarine could still access important ?hiddenö pages while 
access is cut off, without a need to submit queries to the original 
sources.
At Stanford, we have built a prototype hidden Web crawler called HiWE 
(Hidden Web Exposer). Using our experience in designing and implementing 
HiWE, we make the following contributions in this paper:

2


- We first present a systematic classification of dynamic content along 
two dimensions that are most relevant to crawling; the type of dynamism 
and the generative mechanism. This helps place our work in the overall 
context of crawling the Web. (Section 2)

- We propose model of forms and form fill-outs that succinctly captures 
the actions that the crawler must perform, to successfully extract 
content. This helps cast the content extraction problem as one of 
identifying the domains of form elements and gathering suitable values 
for these domains. (Section 3)

- We describe the architecture of the HiWE crawler and describe various 
strategies for building (domain, list of values) pairs. We also propose 
novel techniques to handle the actual mechanics of crawling the hidden 
Web (such as analyzing forms and deducing the domains of form elements). 
(Sections 4 and 5)

- Finally, we present proof-of-concept experiments to demonstrate the 
effectiveness of our approach and techniques. (Section 6)

Note that crawling dynamic pages from a database becomes significantly 
easier if the site hosting the database is cooperative. For instance, a 
crawler might be used by an organization to gather and index pages and 
databases on it′s local intranet. In this case, the web servers running 
on the internal network can be configured to recognize requests from the 
crawler and in response, export the entire database in some predefined 
format. This approach is already employed by some e-commerce sites, 
which recognize requests from the crawlers of major search engine 
companies and in response, export their entire catalog/database for 
indexing.

In this paper, we consider the more general case of a crawler visiting 
sites on the public Internet where such cooperation does not exist. The 
big advantage is that no special agreements with visited sites are 
required. This advantage is especially important when a ?competitor′sö 
or a ?unfriendly country′sö sites are being studied. Of course, the 
drawback is that that the crawling process is inherently imprecise. That 
is, an automatic crawler may miss some pages or may fill our some forms 
incorrectly (as we will discuss). But in many cases, it will be better 
to index or cache a useful subset of hidden pages, rather than having 
nothing.

2 Classifying Dynamic Web Content

Before attempting to classify dynamic content, it is important to have a 
well-defined notion of a dynamic page. We shall adopt the following 
definition in this paper:

A page P is said to be dynamic if some or all of its content is 
generated at run-time (i.e., after the request for P is received at the 
server) by a program executing either on the server or on the client. 
This is in contrast to a static page P 0, where the entire content of P 
0 already exists on the server, ready to be transmitted to the client 
whenever a request is received.

Since our aim is to crawl and index dynamic content, our definition only 
encompasses dynamism in content, not dynamism in appearance or user 
interaction. For example, a page with static content, but containing 
client-side scripts and DHTML tags that dynamically modify the 
appearance and visibility of objects on the page, does not

3


satisfy our definition. Below, we categorize dynamic Web content along 
two important dimensions: the type of dynamism, and the mechanism used 
to implement the dynamism.

2.1 Categorization based on type of dynamism

There are three common reasons for making Web content dynamic: 
time-sensitive information, user customization, and user input. This in 
turn, leads to the following three types of dynamism:

Temporal dynamism: A page containing time-sensitive dynamic content 
exhibits temporal dynamism. For example, a page displaying stock tickers 
or a list of the latest world news headlines might fall under this 
category. By definition, requests for a temporally dynamic page at two 
different points in time may return different content.2

Current-day crawlers do crawl temporally dynamic pages. The key issue in 
crawling such pages is freshness [7], i.e., a measure of how up to date 
the crawled collection is, when compared with the latest content on the 
web sites. The analyses and crawling strategies presented by Cho et. al. 
[6, 7], to maximize freshness, are applicable in this context.

Client-based dynamism: A page containing content that is custom 
generated for a particular client (or user) exhibits client-based 
dynamism. The most common use of client-based dynamism is for 
personalization. Web sites customize their pages (in terms of look, 
feel, behavior, and content) to suit a particular user or community of 
users. This entails generating pages on the fly, using information from 
client-side cookies or explicit logins, to identify a particular user.

Since pages with client-based dynamism have customized content, crawling 
such pages may not be useful for applications that target a 
heterogeneous user population (e.g., a crawler used by a generic Web 
search engine). However, for certain applications, a restricted crawler3 
can be equipped with the necessary cookies or login information (i.e., 
usernames and passwords) to allow it to crawl a fixed set of sites.

Input dynamism: Pages whose content depends on the input received from 
the user exhibit input dynamism. The prototypical example of such pages 
are the responses generated by a web server in response to form 
submissions. For example, a query on an online searchable database 
through a form generates one or more pages containing the search 
results. All these result pages fall under the category of input 
dynamism. In general, all pages in the hidden Web exhibit input 
dynamism. In this paper, our focus will be on crawling such pages.

Note that many dynamic pages exhibit a combination of the above three 
classes of dynamism. For instance, the welcome page on the Amazon web 
site [1] exhibits both client-based (e.g., book recommendations based on 
the user profile and interests) and temporal dynamism (e.g., latest 
bestseller list).

In addition, there are other miscellaneous sources of dynamism that do 
not fall into any of the above categories. For example, tools for web 
site creation and maintenance [10, 22] often allow the content to be 
stored on the server 2Note that simply modifying the content of a static 
page on the web server does not constitute temporal dynamism since our 
definition requires that a dynamic page be generated by a program at 
run-time.
3We use the term restricted crawler to refer to a crawler that limits 
it′s crawling activity to a specific set of sites.

4


Figure 1: Classifying Web content based on impact on crawlers

in native databases and text files. These tools provide programs to 
generate HTML-formatted pages at run-time from the raw content, allowing 
for clean separation between content and presentation. In this scenario, 
even though pages are dynamically generated, the content is 
intrinsically static.

2.2 Categorization based on generative mechanism

There are a number of mechanisms and technologies that assist in the 
creation of dynamic Web content. These mechanisms can be divided into 
the following three categories:

- Server-side programs: In this technique, a program executes on the 
server to generate a complete HTML page which is then transmitted to the 
client. This is the oldest and most commonly used method for generating 
web pages on the fly. A variety of specifications are available (e.g., 
Common Gateway Interface (CGI), Java servlets [26]) to control the 
interactions between the web server and the program generating the page. 
Such server-side programs are most often used to process and generate 
responses to form submissions (i.e., to implement input dynamism).

- Embedded code with server-side execution: In this technique, dynamic 
web pages on the server contain both static HTML text and embedded code 
snippets. When a request for this page is received, the code snippets 
execute on the server and generate output that replaces the actual code 
in the page. Unlike serverside programs which produce a complete HTML 
page as output, these code snippets generate only portions of the page. 
Different scripting languages can be used to implement the code snippets 
[2, 17, 25].

- Embedded code with client-side execution: As in the previous case, web 
pages contain both HTML text and embedded code (or references to 
wherever the code is available). However, the code is now downloaded and 
executed on the client machine, typically in a controlled environment 
provided by the browser. Java applets and ActiveX controls are examples 
of technologies that support this mechanism.

5


Figure 2: Sample labeled form

Pages that employ server-side programs or embedded code with server-side 
execution do not pose any special challenges to a crawler, once the page 
has been received. In both cases, the crawler merely receives HTML pages 
that it can process in the same way that it processes static content. 
However, pages that use client-side execution to pull in content from 
the server, require special environments (e.g., a Java virtual machine) 
in which to execute the embedded code. Equipping a crawler with the 
necessary environment(s) greatly complicates it′s design and 
implementation. Since pages in the hidden Web are usually generated 
using the first two techniques, we do not address the third technique 
any further in this paper.

Figure 1 summarizes the classification that we have presented in this 
section. The vertical axis lists the different generative mechanisms and 
the horizontal axis, the different types of content. Various portions of 
the Web (and their corresponding crawlers) have been represented as 
regions in this 2-dimensional grid.

3 Modeling Forms and Form Submissions

The fundamental difference between the actions of a hidden Web crawler, 
such as HiWE, and a traditional crawler is in the way they treat pages 
containing forms. In this section, we describe how we model forms and 
form submissions. Later sections will describe how HiWE uses this model 
to extract hidden content.

3.1 Modeling Forms

A form, F , is modeled as a set of (element; domain) pairs; F = f(E1; 
D1), (E2; D2), : : : (En; Dn)g where the Ei′s are the elements and the 
Di′s are the domains. A form element can be any one of the standard 
input objects: selection lists, text boxes, text areas, checkboxes, or 
radio buttons.4 The domain of an element is the set of values which can 
be associated with the corresponding form element. Some elements have 
finite domains, where the set of valid values are already embedded in 
the page. For example, if Ej is a selection list (indicated by the 
Document Type:  


Company Name:  


Sector:  Entertainment
Information Technology
Automobile
Construction
Figure 3: HTML markup for the sample form of Figure 2 text boxes have infinite domains (e.g., set of all text strings) from which their values can be chosen. In addition, many form elements are usually associated with some descriptive text to help the user understand the semantics of the element. We shall refer to such descriptive information as labels, and shall use label(Ei) to denote the label associated with the ith form element. Figure 2 shows a form with three elements and the corresponding representation using our notation. Figure 3 is the piece of HTML markup that was used to generate this form. We wish to emphasize that our notion of labels and domain values is quite distinct from the internal labels5 and values6 used within the form. For instance, referring to Figures 2 and 3, note that label(E1) is ?Document Typeö, and not the internal label (?whatö) of the or element 7 categories, with their associated values. In Section 5.5, we describe various data sources (including humans) from which the crawler can obtain such values. In addition, we also show how the crawler can use its own crawling experience to add to these lists of values. However, not all data sources are equally reliable. For instance, the crawler has more confidence in the usefulness of human-supplied values than it has, in the values it gathers based on it′s crawling experience. To model values and their confidence, inputs from all these sources are organized in a table called the Label Value Set (LVS) table. Each entry (or row) in the LVS table is of the form (L; V ), where L is a label and V = fv1, v2, : : : vng is a fuzzy/graded set [14] of values belonging to that label. Fuzzy set V has an associated membership function MV that assigns weights/grades, in the range [0; 1], to each member of the set. Intuitively, each vi represents a value that could potentially be assigned to an element E if label(E) ?matchesö L. MV (vi) represents the crawler′s estimate of how useful/correct, the assignment of vi to E, is likely to be. The LVS table also supports the notion of label aliasing, i.e., two or more labels are allowed to share the same fuzzy value set. This helps us handle aliases and synonyms representing the same concept (e.g., ?Companyö and ?Organizationö). 3.3 Generating Value Assignments Given a form F = f(E1; D1), (E2; D2), : : : (En; Dn)g, the crawler generates value assignments by textually matching element labels with labels in the LVS table. Specifically, for each Ei, we generate a fuzzy set of values Vi as follows: - If Ei is an infinite domain element and (L; V ) denotes the LVS entry whose label L most closely matches label(Ei) (see Section 5.3 for details),7 then Vi = V and MVi = MV . - If Ei is a finite domain element, then Vi = Di and MVi(x) = 1; 8x - Vi. Then, V alAssign(F; LV S) = V1 ?V2? : : : Vn denotes the set of all possible value assignments for form F , given the current content of the LVS table. Let Smax denote the maximum number of times a crawler is allowed to submit a given form.8 By imposing an upper bound on the number of submissions per form, we ensure that the crawler does not spend all it′s time at a single form and instead, extracts the most relevant content from all the databases that it visits. In particular, the crawler chooses the ?bestö minfSmax; jV1j ? : : : jVnjg value assignments to generate form submissions. The notion of the ?bestö value assignment is based on ranking all the value assignments in V alAssign(F; LV S). We experimented with three different ranking functions (below, ? represents the ranking function and fE1 v1; : : : ; En vng denotes a value assignment that associates value vifflVi with element Ei): Fuzzy Conjunction: The rank of a value assignment is the minimum of the weight/grade of each of the constituent values. This is equivalent to treating the value assignment as a standard Boolean conjunction of the individual fuzzy sets [14]. 7With approximate matching, each element label could be matched with a set of labels in the LVS table. For now, we assume that only the ?bestö match is used. 8This value is a parameter that we pass to the crawler at startup. 8 Figure 4: Comparing the basic execution loop of a traditional crawler and HiWE ?fuz(fE1 v1; : : : ; En vng) = min i=1:::n MVi(vi) Average: The rank of a value assignment is the average of the weights of the constituent values. ?avgfE1 v1; : : : ; En vng) = 1 n X i=1:::n MVi(vi) Probabilistic: This ranking function treats weights as probabilities. Hence MVi(vi) is the likelihood that the choice of vi is useful and 1 ? MVi(vi) is the likelihood that it is not. Then, the likelihood of a value assignment being useful is computed as: ?prob(fE1 v1; : : : ; En vng) = 1 ? Y i=1:::n (1 ?MVi(vi)) Note that ?fuz is very conservative in assigning ranks. It assigns a high rank for a value assignment only if each individual weight is high. The average is less conservative, always assigning a rank which is at least as great as the rank of the fuzzy conjunction for the same value assignment. In contrast, the ?prob is more aggressive and assigns a low rank to a value assignment only if all individual weights are very low. Section 6 presents more detailed experiments comparing these ranking functions. 4 HiWE: Hidden Web Exposer The basic actions of a hidden Web crawler, such as HiWE, are similar to those of other traditional crawlers [5, 8]. In Figure 4, the flowchart on the left indicates the typical crawler loop, consisting of URL selection, page retrieval, and page processing to extract links. Note that traditional crawlers do not distinguish between pages with and 9 Figure 5: HiWE Architecture without forms. However, as shown in the flowchart on the right, HiWE′s execution sequence contains additional steps for pages on which forms are detected. Specifically, HiWE performs the following sequence of actions for each form on a page: 1. Form Analysis: Parse and process the form to build an internal representation, based on the model outlined in Section 3.1. 2. Value assignment and submission: Generate the best (untried) value assignment and submit a completed from using that assignment. 3. Response Analysis: Analyze the response page to check if the submission yielded valid search results or if there were no matches. This feedback could be used to tune the value assignments in step 2. 4. Response Navigation: If the response page contains hypertext links, these are followed immediately (except for links that have already been visited or added to the queue) and recursively, to some pre-specified depth. Note that we could as well have added the links in the response page to the URL queue. However, for ease of implementation, in HiWE, we chose to navigate the response pages immediately, and that too, only upto a depth of 1. Steps 2, 3, and 4 are executed repeatedly, using different value assignments during each iteration. The sequence of value assignments are generated using the model described in Section 3.3. 4.1 HiWE Architecture Figure 5 illustrates the complete architecture of the HiWE crawler. It includes six basic functional modules and two internal crawler data structures. The basic crawler data structure is the URL List. It contains all the URLs that the crawler has discovered so far. When starting up the crawler, the URL List is initialized to a seed set of URLs. The Crawl Manager controls the entire crawling process. It decides which link to visit next, and makes the network connection to retrieve the page from the Web. In our implementation, the crawler was configured to stay 10 Figure 6: Actions of the Form Analysis Module within a pre-determined set of target sites (provided to the Crawl Manager at startup), not following links that pointed to other sites. The Crawl Manager hands the downloaded page over to the Parser module. In turn, the Parser extracts hypertext links from the page and adds them to the URL List structure. This sequence of operations is repeated until some termination condition (typically, after some number of hours have elapsed) is satisfied. We refer the reader to existing crawling literature [6, 8] for more details on the design of the Crawl Manager module. To process forms and extract hidden content, HiWE employs four additional modules and the LVS Table. The Form Analyzer, Form Processor, and Response Analyzer modules, together implement the iterative sequence of steps outlined in the previous section.9 The LVS Manager is responsible for managing additions and accesses to the LVS table. It provides an interface for various application-specific data sources to supply new entries to the table. We shall discuss how this happens in Section 5.5. 5 Design Issues and Techniques 5.1 Form Analysis HiWE′s representation of a form includes the following information: - A list of all the elements (e.g., selection lists, text boxes) in the form - A label for each element - For every element with a finite domain, a list of all the values that constitute that domain - Submission information (such as the submission method (GET or POST) and the submission URL) to be used when submitting completed forms To collect all this information, the Form Analyzer executes the sequence of steps indicated in Figure 6. It begins by constructing a logical tree representation of the structure of the HTML page, based on the Document Object Model (DOM) specification [9]. Next, it uses the DOM API to obtain the list of form elements as well as the necessary submission information. We refer the reader to the DOM specification [9] for details on how this can 9The Form Processor is responsible for the Response Navigation step. 11 be done. Then, the Form Analyzer uses the technique described in the remainder of this section to extract labels and domain values. Finally, it normalizes the extracted information (see Section 5.2) and integrates it with the information from the DOM API to produce the internal form representation. Label and Domain value extraction: Accurately extracting labels and domain values proves to be a hard problem, since their nesting relationships with respect to the form elements is not fixed. For instance, in Figure 3, as is commonly the case, the entire form is laid out within a table. The pieces of text representing the labels (e.g., the word ?Sectorö), the domain values (e.g., the word ?Automobileö), and the form control elements (e.g., the and element. 11Note that text pieces containing more than a few words (6 in our default crawler configuration) are ignored, as most labels are either short words or short phrases. 12We observed that most forms place labels either to the left or above the form widget. 12 Figure 7: Pruning before partial layout Note that to calculate visual adjacency, it is not necessary to completely layout the page. The location of the form with respect to the rest of the page is not relevant. Hence, we first prune the original page and isolate only those elements that directly influence the layout of the form elements and the labels. For instance, consider Figure 7, which shows the tree-structured representation of two different pages, one in which the FORM is directly embedded in the main body and another in which it is embedded within a table. The pruned tree is constructed by using only the subtree below the FORM element and the nodes on the path from the FORM to the root. In addition, the layout need not be `perfect′; in fact, our implementation uses a simple custom layout engine that discards images, ignores font sizes, uses a default font, ignores styling information such as bold or italics (except to break up ties as mentioned above), and ignores any associated style sheets. Our experiments in Section 6 indicate that visual adjacency is a very robust and effective strategy for extracting labels and domain values. Incidentally, in [18] we study and evaluate other techniques for matching labels to input elements. The techniques of [18] were developed in a different context than ours, for displaying forms on small hand-held devices. 5.2 Normalization When generating a value assignment, pieces of text (i.e., the labels and domain values) extracted from a HTML page must be matched with other pieces of text stored in the LVS table. To ensure that spurious differences do not result in missed matches, all these text pieces are subjected to a normalization process. The Form Analyzer normalizes the extracted labels and values whereas the LVS manager normalizes entries in the LVS table. Normalization consists of the following sequence of steps: - To counter possible errors during extraction, the extracted pieces of text are searched for HTML tags and HTML entity references. Any such tags and entity references are removed. - Next, all characters other than alphanumeric characters are replaced by a space character. - Uppercase characters, if any, are converted to their lower case equivalents. - Stop words [13], if any, are removed. 13 - Finally, each word in the resulting text is stemmed, using the standard Potter suffix-stripping algorithm [13]. 5.3 Form Processing There are two main issues in the design of the Form Processor module in Figure 5: choosing an algorithm for matching element labels with labels in the LVS table, and deciding whether or not a form must be processed. Matching labels: Recall that once a label is found in a form, we must obtain ?reasonableö values to fill-in the corresponding input element, so that we can submit a completed form. For example, if we find a label ?Enter State,ö we want to search our LVS table for some domain whose name is `similar,′ e.g., ?State.ö Once we find a good domain in the LVS table, we can use the values associated with it (e.g., ?Arizonaö, ?Californiaö, etc.) to fill-in the element labeled ?Enter State.ö To match form labels to labels in the LVS table, we employ an approximate string matching algorithm. There is a large body of work in the design and analysis of such string matching algorithms. Our choice was based on the ability of the algorithm to account for two things: typing errors and word reorderings. Typing errors can be captured by the standard string matching notion of edit distance, which measures the minimum number of insertions, deletions, and character replacements required to transform one string to another (e.g., edit-distance(house, hose) = 1). However, word reorderings requires a new distance measure so that two labels such as ?Company Typeö and ?Type of Companyö (these become ?company typeö and ?type companyö after normalization) are identified as being very `close′ to each other. The block edit models proposed in [21], succinctly represent both typing errors and word reorderings. These models define the concept of block edit distance; a generalization of the traditional notion of edit distances to handle block/word movements. We used one of a family of algorithms from [21] to implement our label matching system based on the block edit model. We match a form element E to an LVS entry by minimizing the block edit distance between their labels, subject to a threshold. Specifically, let edb(A; B) denote the block edit distance between strings A and B; let ö be a threshold block edit distance beyond which matches are discarded; and let distmin = min(L;V ) - LV S fedb(label(E); L)g represent the minimum block edit distance. Then, Match(E), the matching LVS entry is computed as follows: Match(E) = nil, if distmin > ö = (L0; V 0) such that edb(label(E); L0) = distmin, otherwise. Ignoring forms: As the crawler processes one page after another, it is likely to encounter forms which are not directly relevant to the task of extracting content from databases (e.g., a form for local site search). In addition, if the crawler′s LVS table does not contain matching labels, even relevant forms may have to be ignored. HiWE uses the following policy to decide whether a form must be ignored or submitted: 14 A form F = f(E1; D1), (E2; D2), : : : (En; Dn)g is submitted iff (i) n >= ff, and (ii) for each i such that Di is infinite, Match(Ei) is non-nil (i.e., there is a matching LVS entry). Here, ff is a configurable parameter that represents the smallest form size that the crawler will process. For instance, if ff > 1, the crawler will ignore single element forms (e.g., a form with just a simple search box). Note that we ignore a form if we are unable to associate a matching LVS entry for every infinite domain element in the form. However, forms may not always require all inputs to be provided. For example, a form that searches a `book catalog′ may allow the user to enter either an author, a title, or both. We do not consider such partial form inputs in our model. 5.4 Response Analysis The aim of response analysis is to automatically distinguish between a response page13 that contains search results and one that contains an error message, reporting that no matches were found for the submitted query. The idea is to use this information to tune the crawler′s value assignment strategy. Response analysis turns out to be a very challenging problem, for a number of reasons: - The absence of a `standard′ or commonly accepted format for reporting such errors means that each web site is free to use a custom error notification message (e.g., ?No matching results,ö ?0 results found,ö etc.). - Error pages often include a variety of other textual content (such as site maps, titles, headers, footers, code snippets, etc.) besides the actual error message. Therefore, even if the text of the error message were known, simply searching the page for matching text could lead to false drops. - Finally, many forms are often associated with multiple types of error messages, with the web server choosing between them based on some pre-programmed logic. To tackle these challenges, HiWE′s response module uses a technique based on identifying the `significant portion′ of the response page (i.e., the portion of the page obtained after discarding headers, footers, side bars, site maps, menus, etc.) To identify the significant portion, we use the heuristic that when the page is laid out, the significant portion will be visually in the middle of the page. Two cases arise: - If the response page is formatted using frames, we use information about the frame sizes to layout the page and identify the center-most frame. Then, we retrieve the center-most frame′s contents, and treat that content as the significant portion of the response page. - If the response page does not use frames, we use our custom layout engine (Section 5.1) to first identify the HTML element E that is visually laid out at the center of the page. We also parse the page to construct it′s DOM representation and locate E in the DOM tree. If E is present in the subtree of a
数据挖掘论坛
[数据挖掘专家] [数据挖掘研究院] [数据挖掘论坛] [数据挖掘实验室]
上一篇:Web Content Mining
下一篇:Collaborative Filtering Bibliography
最新评论共有 0 位网友发表了评论 , 查看所有评论
发表评论( 不能超过250字,需审核,请自觉遵守互联网相关政策法规。 )
匿名?
数据挖掘网站导航 数据挖掘论坛导航
  • 数据挖掘工具
  • 数据挖掘论坛
  • DataCruncher - Cognos
  • MineSet - MathSoft
  • Intelligent Miner - GainSmarts
  • Sqlserver - SAS - Clementine
  • CART - Weka - WizSoft
  • NeuroShell - ModelQuest
  • data mining tools - Darwin
  • 数据挖掘交友
  • 数据挖掘博客
  • 数据挖掘工具
  • 数据挖掘资源
  • 数据挖掘技术算法
  • 数据挖掘相关期刊、会议
  • 研究院联盟合作专区
  • 数据挖掘基础与相关技术
  • 数据挖掘厂商与就业
  • 数据挖掘研究者乐园
  • 知名厂商数据挖掘工具资料
  • 国内数据挖掘实验室
  • Foreign Data Mining Lab
  • 热点关注
  • Web数据挖掘的研究现状及发展
  • Web数据挖掘技术综述
  • 百度申请精确广告专利 欲抑制Google步伐
  • Web数据自动采集及其应用研究
  • 信息安全中的数据挖掘
  • 面向Web的数据挖掘
  • Extended Log File Format
  • 基于XML的Web数据挖掘在数字图书馆中的应用
  • XML与Web数据挖掘
  • Web数据挖掘
  • 论坛最新话题
  • Foundations of Statistical Natural Langu
  • Game Theory meet Data Mining: A Recent P
  • System Building: How does it help or hin
  • 数据挖掘与Clementine培训
  • 新手报到
  • 求 SASEM 客户流失预测分析
  • 数据挖掘工程师/搜索研究院—北京——无线
  • 数据挖掘入门介绍(如何着手数据挖掘)
  • Information Overload Survey Results
  • The INEX 2005 Workshop on Element Retrie
  • 相关资讯
  • Any Extract (AE) 网站在线编辑
  • 信息安全中的数据挖掘
  • 基于XML的Web数据挖掘在数字图书馆中的应用
  • Web数据挖掘技术综述
  • Web数据挖掘
  • 北大计算机所万小军博士接连在国际一流学术
  • Refereed Papers on WWW2007
  • WWW2007 tutorials
  • WWW2007 workshops
  • Why ’08 Matters for the Web
  • 数据挖掘实验室资料
  • 数据挖掘博客地址
  • 数据挖掘实验室网站地址
  • Prepare for Medicare audits by using dat
  • 注册成为SAS用户与爱好者俱乐部会员
  • 水南梅
  • 明日烟
  • 新人报道
  • 下载
  • 厦门服务器托管,450元/月—0592-5177319 高
  • 买空间送域名--0592-5177319 高静