RSS
热门关键字:  数据挖掘  人工智能  数据仓库  搜索引擎  数据挖掘导论

Geographically Focused Collaborative Crawling

来源: 作者:unkonwn 时间:2004-12-06 点击:

Keywords

Collaborative crawling, geographically focused crawling, geographic entities

1 Introduction

While most of the current search engines are effective for pure keyword-oriented searches, these search engines are not fully effective for geographic-oriented keyword searches. For instance, queries like ``restaurants in New York, NY" or ``good plumbers near 100 milam street, Houston, TX" or ``romantic hotels in Las Vegas, NV" are not properly managed by traditional web search engines. Therefore, in recent years, there has been surge of interest within the search industry on the search localization (e.g., Google Local 1, Yahoo Local 2). The main aim of such search localization is to allow the user to perform the search according his/her keyword input as well as the geographic location of his/her interest.

Due to the current size of the Web and its dynamical nature, building a large scale search engine is challenging and it is still active area of research. For instance, the design of efficient crawling strategies and policies have been extensively studied in recent years (see [9] for the overview of the field). While it is possible to build geographically sensitive search engines using the full web data collected through a standard web crawling, it would rather be more attractive to build such search engines over a more focused web data collection which are only relevant to the targeted geographic locations. Focusing on the collection of web pages which are relevant to the targeted geographic location would leverage the overall processing time and efforts for building such search engines. For instance, if we want to build a search engine targeting those users in New York, NY, then we can build it using the web collection, only relevant to the city of New York, NY. Therefore, given intended geographic regions for crawling, we refer the task of collecting web pages, relevant to the intended geographic regions as geographically focused crawling. 数据挖掘研究院

The idea of focusing on a particular portion of the web for crawling is not novel. For instance, the design of efficient topic-oriented or domain-oriented crawling strategies has been previously studied [8,23,24]. However, there has been little previous work on incorporating the geographical dimension of web pages to the crawling. In this paper, we study various aspects of crawling when the geographical dimension is considered.

While the basic idea behind the standard crawling is straightforward, the collaborative crawling or parallel crawling is often used due to the performance and scalability issues that might arise during the real crawling of the web [12,19]. In a collaborative or parallel crawler, the multiple crawling nodes are run in parallel on a multiprocessor or in a distributed manner to maximize the download speed and to further improve the overall performance especially for the scalability of crawling. Therefore, we study the geographically focused crawling under the collaborative setting, in which the targeted geographic regions are divided and then assigned to each participating crawling node. More precisely, in a geographically focused collaborative crawler, there will be a set of geographically focused crawling nodes in which each node is only responsible for collecting those web pages, relevant to its assigned geographic regions. Furthermore, there will be additional set of general crawling nodes which aim to support other geographically focused crawling nodes through the general crawling (download of pages which are not geographically-aware). The main contributions of our paper are follows: 数据挖掘研究院

  1. We propose several geographically focused collaborative crawling strategies whose goal is to collect web pages about the specified geographic regions.
  2. We propose several evaluation criteria for measuring the performance of a geographically focused crawling strategy.
  3. We empirically study our proposed crawling strategies by crawling the real web. More specifically, we collect web pages pertinent to the top 100 US cities for each crawling strategy.
  4. We empirically study geographic locality. That is, pages which are geographically related are more likely to be linked compared to those which are not.

The rest of the paper is organized as follows. In Section 2, we introduce some of the previous works related to our geographically focused collaborative crawling. In Section 3, we describe the problem of geographically focused collaborative crawling and then we propose several crawling policies to deal with this type of crawling. In Section 4, we present evaluation models to measure the performance of a geographically focused collaborative crawling strategy. In Section 5, we present results of our experiments with the real web data. Finally, in Section 6, we present final remarks about our work.

2 Related Works

A focused crawler is designed to only collect web pages on a specified topic while transversing the web. The basic idea of a focused crawler is to optimize the priority of the unvisited URLs on the crawler frontier so that pages concerning a particular topic are retrieved earlier. Bra et al. [4] propose a focused web crawling method in the context of a client-based real-time search engine. Its crawling strategy is based on the intuition that relevant pages on the topic likely contain links to other pages on the same topic. Thus, the crawler follows more links from relevant pages which are estimated by a binary classifier that uses keyword and regular expression matchings. In spite of its reasonably acceptable performance, it has an important drawback as a relevant page on the topic might be hardly reachable when this page is not pointed by pages relevant to the topic.

Cho et al. [11] propose several strategies for prioritizing unvisited URLs based on the pages downloaded so far. In contrast to other focused crawlers in which a supervised topic classifier is used to control the way that crawler handles the priority of pages to be be downloaded, their strategies are based on considering some simple properties such as linkage or keyword information to define the priority of pages to be downloaded. They conclude that determining the priority of pages to be downloaded based on their PageRank value yield the best overall crawling performance.

数据挖掘研究院

Chakrabarti et al.[8] propose another type of focused crawler architecture which is composed of three components, namely classifier, distiller and crawler. The classifier makes the decision on the page relevancy to determine its future link expansion. The distiller identifies those hub pages, as defined in [20], pointing to many topic related pages to determine the priority of pages to be visited. Finally, the crawling module fetches pages using the list of pages provided by the distiller. In the subsequent work, Chakrabarti et al. [7] suggest that only a fraction of URLs extracted from a page are worth following. They claim that a crawler can avoid irrelevant links if the relevancy of links can be determined by the local text surrounding it. They propose alternative focused crawler architecture where documents are modeled as tag trees using DOM (Document Object Model). In their crawler, two classifiers are used, namely the ``baseline′′ and the ``apprentice′′. The baseline classifier refers to the module that navigates through the web to obtain the enriching training data for the apprentice classifier. The apprentice classifier, on the other hand, is trained over the data collected through the baseline classifier and eventually guides the overall crawling by determining the relevancy of links using the contextual information around them. 数据挖掘研究院

Diligenti et al. [14] use the context graph to improve the baseline best-first focused crawling method. In their approach, there is a classifier which is trained through the features extracted from the paths that lead to the relevant pages. They claim that there is some chance that some off-topic pages might potentially lead to highly relevant pages. Therefore, in order to mediate the hardness of identifying apparently off-topic pages, they propose the usage of context graph to guide the crawling. More precisely, first a context graph for seed pages is built using links to the pages returned from a search engine. Next, the context graph is used to train a set of classifiers to assign documents to different categories using their estimated distance, based on the number of links, to relevant pages on different categories. Their experimental results reveal that the context graph based focused crawler has a better performance and achieves higher relevancy compared to an ordinary best-first crawler. 数据挖掘研究院

Cho et al. [10] attempt to map and explore a full design space for parallel and distributed crawlers. Their work addresses issues of communication bandwidth, page quality and the division of work between local crawlers. Later, Chung and Clarke [12] study parallel or distributed crawling in the context of topic-oriented crawling. Basically, in their topic-oriented collaborative crawler, each crawling node is responsible for a particular set of topics and the page is assigned to the crawling node which is responsible for the topic which the page is relevant to. To determine the topic of page, a simple Naive-Bayes classifier is employed. Recently, Exposto et al. [17] study distributed crawling by means of the geographical partition of the web considering the multi-level partitioning of the reduced IP web link graph. Note that our IP-based collaborative crawling strategy is similar to their approach in spirit as we consider the IP-addresses related to the given web pages to distribute them among participating crawling nodes.

数据挖掘实验室

Gravano and his collaborators study the geographically-aware search problem in various works [15,18,5]. Particularly, in [15], how to compute the geographical scope of web resources is discussed. In their work, linkage and semantic information are used to assess the geographical scope of web resources. Their basic idea is as follows. If a reasonable number of links pertinent to one particular geographic location point to a web resource and these links are smoothly distributed across the location, then this location is treated as one of the geographic scopes of the corresponding web resource. Similarly, if a reasonable number of location references is found within a web resource, and the location references are smoothly distributed across the location, then this location is treated as one of the geographical scopes of the web resource. They also propose how to solve aliasing and ambiguity. Recently, Markowotz et al. [22] propose the design and the initial implementation of a geographic search engine prototype for Germany. Their prototype extracts various geographic features from the crawled web dataset consisting of pages whose domain name contains ``de′′. A geographic footprint, a set of relevant locations for page, is assigned to each page. Subsequently, the resulting footprint is integrated into the query processor of the search engine.

3 Crawling

1 Problem Description

Even though, in theory, the targeted geographic locations of a geographically focused crawling can be any valid geographic location, in our paper, a geographic location refers to a city-state pair for the sake of simplicity. Therefore, given a list of city-state pairs, the goal of our geographically focused crawling is to collect web pages which are ``relevant′′ to the targeted city-state pairs. Thus, after splitting and distributing the targeted city-state pairs to the participating crawling nodes, each participating crawling node would be responsible for the crawling of web pages relevant to its assigned city-state pairs.
Example 1 Given ${$(New York, NY), (Houston, TX)$}$ as the targeted city-state pairs and 3 crawling nodes ${Cn_1,$ $Cn_2$ $,Cn_3}$, one possible design of geographically focused collaborative crawler is to assign (New York, NY) to $Cn_1$ and (Houston, TX) to $Cn_2$.

Particularly, for our experiments, we perform the geographically focused crawling of pages targeting the top 100 US cities, which will be explained later in Section 5. We use some general notations to denote the targeted city-state pairs and crawling nodes as follows. Let $TC={(c_1,s_1),$ $ ldots, (c_n,s_n)}$ denote the set of targeted city-state pairs for our crawling where each $(c_i,s_i)$ is a city-state pair. When it is clear in the context, we will simply denote $(c_i,s_i)$ as $c_i$. Let $CR={Cn_1, ldots, Cn_m}$ denote the set of participating crawling nodes for our crawling. The main challenges that have to be dealt by a geographically focused collaborative crawler are the following: 数据挖掘研究院

  • How to split and then distribute $TC={c_1, ldots, c_n}$ among the participating $CR={Cn_1, ldots, Cn_m}$
  • Given a retrieved page $p$, based on what criteria we assign the extracted URLs from $p$ to the participating crawling nodes.
Figure 1: Exchange of the extracted URLs
includegraphics[width=0.9	extwidth]{policya.eps}
a) All $l$ URLs extracted from $q$ are transferred to another crawling node (the worst scenario for policy A)
 
includegraphics[width=0.9	extwidth]{policyb.eps}

数据挖掘研究院


b) Page $q$ is transferred to the $m$ number of crawling nodes, but all URLs extracted from each $q$ of the crawling nodes are not transferred to other crawling nodes (the best scenario for policy B)

2 Policy for the assignment of the extracted URLs

When a crawling node extracts the URLs from a given page, it has to decide whether to keep the URLs for itself or transfer them to other participating crawling nodes for further fetching of the URLs. Once the URL is assigned to a particular crawling node, it may be added to the node′s pending queue. Given a retrieved page $p$, let $pr(c_{i}vert p)$ be the probability that page $p$ is about the city-state pair $c_{i}$. Suppose that the targeted city-state pairs are given and they are distributed over the participating crawling nodes. There are mainly two possible policies for the exchange of URLs between crawling nodes.
  • Policy A: Given the retrieved page $p$, let $c_i$ be the most probable city-state pair about $p$, i.e. $arg$ $max_{c_i in TC}$ $pr(c_ivert p)$. We assign each extracted URL from page $p$ to the crawling node $Cn_j$ responsible on $c_i$
  • Policy B: Given the retrieved page $p$, let ${c_{p_1},ldots, c_{p_k}}$ $subset TC$ be the set of city-state pairs whose $Pr(c_{p_i}vert p) 
eq 0$. We assign each extracted URL from page $p$ to EACH crawling node $Cn_j$ responsible on $c_{p_i} in TC$,
Lemma 2 Let $b$ be the bandwidth cost and let $c$ be the inter-communication cost between crawling nodes. If $b>c$, then the Policy A is more cost effective than the Policy B. 数据挖掘研究院

Proof: Given an extracted URL $q$ from page $p$, let $m$ be the number of crawling nodes used by the Policy B (crawling nodes which are assigned to download $q$). Since the cost for the policy A and B is equal when $m=1$, we suppose $m geq 2$. Let $l$ be the total number of URLs extracted from $q$. Let $C(A)$ and $C(B)$ be the sum of total inter-communication cost plus the bandwidth cost for the Policy A and Policy B respectively. One can easily verify that the cost of download for $q$ and all URLs extracted from $q$ is given as $C(A) leq b+l cdot (c+b)$ as shown in Figure 1a) and $C(B) geq m cdot b + l cdot m cdot b$. as shown in Figure 1b). Therefore, it follows that $C(A) leq C(B)$ since $m geq 2$ and $b>c$. 数据挖掘研究院

The assignment of extracted URLs for each retrieved page of all crawling collaboration strategies that we consider next will be based on the Policy A. 数据挖掘实验室

3 Hash Based Collaboration

We consider the hash based collaboration, which is the approach taken by most of collaborative crawlers, for the sake of comparison of this basic approach to our geographically focused collaboration strategies. The goal of hash based collaboration is to implementing a distributed crawler partition over the web by computing hash functions over URLs. When a crawling node extracts a URL from the retrieved page, a hash function is then computed over the URL. The URL is assigned to the participating crawling node responsible for the corresponding hash value of the URL. Since we are using a uniform hash function for our experiments, we will have a considerable data exchange between crawling nodes since the uniform hash function will map most of URLs extracted from the retrieved page to remote crawling nodes.

4 Geographically Focused Collaborations

We first divide up $CR$, the set of participating crawling nodes, into geographically sensitive nodes and general nodes. Even though, any combination of geographically sensitive and general crawling nodes is allowed, the architecture of our crawler consists of five geographically sensitive and one general crawling node for our experiments. A geographically sensitive crawling node will be responsible for the download of pages pertinent to a subset targeted city-state pairs while a general crawling node will be responsible for the download of pages which are not geographically-aware supporting other geographically sensitive nodes.

Each collaboration policy considers a particular set of features for the assessment of the geographical scope of page (whether a page is pertinent to a particular city-state pair or not). From the result of this assessment, each extracted URL from the page will be assigned to the crawling node that is responsible for the download of pages pertinent to the corresponding city-state pair.

数据挖掘研究院

1 URL Based

The intuition behind the URL based collaboration is that pages containing a targeted city-state pair in their URL address might potentially guide the crawler toward other pages about the city-state pair. More specifically, for each extracted URL from the retrieved page $p$, we verify whether the city-state pair $c_i$ is found somewhere in the URL address of the extracted URL. If the city-state pair $c_i$ is found, then we assign the corresponding URL to the crawling node which is responsible for the download of pages about $c_i$.

2 Extended Anchor Text Based

Given link text $l$, an extended anchor text of $l$ is defined as the set of prefix and suffix tokens of $l$ of certain size. It is known that extended anchor text provides valuable information to characterize the nature of the page which is pointed by link text. Therefore, for the extended anchor text based collaboration, our assumption is that pages associated with the extended anchor text, in which a targeted city-state pair $c_i$ is found, will lead the crawler toward those pages about $c_i$. More precisely, given retrieved page $p$, and the extended anchor text $l$ found somewhere in $p$, we verify whether the city-state pair $c_i subset TC$ is found as part of the extended anchor text $l$. When multiple findings of city-state occurs, then we choose the city-state pair that is the closest to the link text. Finally, we assign the URL associated with $l$ to the crawling node that is responsible for the download of pages about $c_i$.

3 Full Content Based

In [15], the location reference is used to assess the geographical scope of page. Therefore, for the full content based collaboration, we perform a content analysis of the retrieved page to guide the crawler for the future link expansion. Let $pr((c_i,s_i)vert p)$ be the probability that page $p$ is about city-state pair $(c_i,s_i)$. Given $TC$ and page $p$, we compute $pr((c_i,s_i)vert p)$ for $(c_i,s_i) in TC$ as follows:

数据挖掘实验室


数据挖掘研究院

egin{displaymath} pr((c_i,s_i)vert p)= alpha cdot char93 ((c_i, s_i),p) + (1-alpha) cdot pr(s_ivert c_i) cdot char93 (c_i,p) end{displaymath}

where $char93 ((c_i, s_i),p)$ denotes the number of times that the city-state pair $(c_i,s_i)$ is found as part of the content of $p$, $char93 (c_i,p)$ denotes the number of times (independent of $char93 ((c_i, s_i),p)$) that the city reference $c_i$ is found as part of the content of $p$, and $alpha$ denotes the weighting factor. For our experiments, $alpha=0.7$ was used.

The probability $pr(s_ivert c_i)$ is calculated under two simplified assumptions: (1) $pr(s_ivert c_i)$ is dependent on the real population size of $(c_i,s_i)$ (e.g., Population of Kansas City, Kansas is 500,000). We obtain the population size for each city city-data.com 3. (2) $pr(s_ivert c_i)$ is dependent on the number of times that the state reference is found (independent of $char93 ((c_i, s_i),p)$) as part of the content of $p$. In other words, our assumption for $pr(s_jvert c_i)$ can be written as 数据挖掘实验室


数据挖掘研究院

egin{displaymath} pr(s_ivert c_i) propto eta S(s_ivert c_i)+ (1-eta) 	ilde{S}(s_ivert p) end{displaymath}

where $S(s_ivert c_i)$ is the normalized form of the population size of $(c_i,s_i)$, $	ilde{S}(s_ivert p)$ is the normalized form of the number of appearances of the state reference $s_i$, independent of $char93 ((c_i, s_i),p)$), within the content of $p$, and $eta$ denotes the weighting factor. For our experiments, $eta=0.5$ was used. Therefore, $pr((c_i,s_i)vert p)$ is computed as 数据挖掘实验室

数据挖掘研究院

$displaystyle pr((c_i,s_i)vert p)$ $	extstyle =$ $displaystyle alpha cdot char93 ((c_i, s_i),p) + (1-alpha) cdot (eta S(s_ivert c_i)$ $	extstyle +$ $displaystyle (1-eta) 	ilde{S}(s_ivert p)) cdot char93 (c_i,p)$

Finally, given a retrieve page $p$, we assign all extracted URLs from $p$ to the crawling node which is responsible for pages relevant to $argmbox{ }max_{(c_i,s_i) in TC}$ $ Pr((c_i,s_i)vert p)$.

数据挖掘研究院

4 Classification Based

Chung et al. [12] show that the classification based collaboration yields a good performance for the topic-oriented collaborative crawling. Our classification based collaboration for the geographically crawling is motivated by their work. In this type of collaboration, the classes for the classifier are the partitions of targeted city-state pairs. We train our classifier to determine $pr(c_ivert p)$, the probability that the retrieved page $p$ is pertinent to the city-state pair $c_i$. Among various possible classification methods, we chose the Naive-Bayes classifier [25] due to its simplicity. To obtain training data, pages from the Open Directory Project (ODP) 4 were used. For each targeted city-state pair, we download all pages under the corresponding city-state category which, in turn, is the child category for the ``REGIONAL′′ category in the ODP. The number of pages downloaded for each city-state pair varied from 500 to 2000. We also download a set of randomly chosen pages which are not part of any city-state category in the ODP. We download 2000 pages for this purpose. Then, we train our Naive-Bayes classifier using these training data. Our classifier determines whether a page $p$ is pertinent to either of the targeted city-state pairs or it is not relevant to any city-state pair at all. Given the retrieved page $p$, we assign all extracted URLs from $p$ to the crawling node which is responsible for the download of pages which are pertinent to $argmbox{ }max_{c_i in TC} pr(c_ivert p)$.

5 IP-Address Based

The IP-address of the web service indicates the geographic location at which the web service is hosted. The IP-address based collaboration explores this information to control the behavior of the crawler for further downloads. Given a retrieved page $p$, we first determine the IP-address of the web service from which the crawler downloaded $p$. With this IP-address, we use the IP-address mapping tool to obtain the corresponding city-state pair of the given IP, and then we assign all extracted URLs of page $p$ to the crawling node which is responsible on the computed city-state pair. For the IP-address mapping tool, freely available IP address mapping tool, hostip.info(API)5 is employed.

5 Normalization and Disambiguation of City Names

As indicated in [2,15], problems of aliasing and ambiguity arise when one wants to map the possible city-state reference candidate to an unambiguous city-state pair. In this section, we describe how we handle these issues out. 数据挖掘研究院

  • Aliasing: Many times different names or abbreviations are used for the same city name. For example, Los Angeles can be also referred as LA or L.A. Similar to [15], we used the web database of the United States Postal Service (USPS) 6 to deal with aliasing. The service returns a list of variations of the corresponding city name given the zip code. Thus, we first obtained the list of representative zip codes for each city in the list using the US Zip Code Database product, purchased from ZIPWISE7, and then we obtain the list of possible names and abbreviations for each city from the USPS.
  • Ambiguity: When we deal with city names, we have to deal with the ambiguity of the city name reference. First, we can not guarantee whether the possible city name reference actually refers to the city name. For instance, New York might refer to New York as city name or New York as part of the brand name ``New York Fries′′ or New York as state name. Second, a city name can refer to cities in different states. For example, four states, New York, Georgia, Oregon and California, have a city called Albany. For both cases, unless we fully analyze the context in which the reference was made, the city name reference might be inherently ambiguous. Note that for the full content based collaboration, the issue of ambiguity is already handled through the term $pr(s_ivert c_i)$ of the Eq. 2. For the extended anchor text based and the URL based collaborations, we always treat the possible city name reference as the city that has the largest population size. For instance, Glendale found in either the URL address of page or the extended anchor text of page would be treated as the city name reference for Glendale, AZ. 8.

4 Evaluation Models

To assess the performance of each crawling collaboration strategy, it is imperative to determine how much geographi- cally-aware pages were downloaded for each strategy and whether the downloaded pages are actually pertinent to the targeted geographic locations. Note that while some previous works [2,15,18,5] attempt to define precisely what a geographically-aware page is, determining whether a page is geographically-aware or not remains as an open problem [2,18]. For our particular application, we define the notion of geographical awareness of page through geographic entities [21]. We refer the address description of a physical organization or a person as geographic entity. Since the targeted geographical city-state pairs for our experiments are the top 100 US cities, a geographic entity in the context of our experiments are further simplified as an address information, following the standard US address format, for any of the top 100 US cities. In other words, a geographic entity in our context is a sequence of Street Number, Street Name, City Name and State Name, found as part of the content of page. Next, we present various evaluation measures for our crawling strategies based on geographic entities. Additionally, we present traditional measures to quantify the performance of any collaborative crawling. Note that our evaluation measures are later used in our experiments.
  • Geo-coverage: When a page contain at least one geographic entity (i.e. address information), then the page is clearly a geographically aware page. Therefore, we define the geo-coverage of retrieved pages as the number of retrieved pages with at least one geographic entity, pertinent to the targeted geographical locations (e.g., the top US 100 cities) over the total number of retrieved pages.
  • Geo-focus: Each crawling node of the geographically focused collaborative crawler is responsible for a subset of the targeted geographic locations. For instance, suppose we have two geographically sensitive crawling nodes $Cn_1$, and $Cn_2$,
最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?