RSS
热门关键字:  数据挖掘  数据仓库  人工智能  搜索引擎  数据挖掘导论
当前位置 :| 首页>编程技术>xml技术>

Towards a Better Understanding of Web Resources and Server R

来源: 作者:unkonwn 时间:2004-11-28 点击:

1 Introduction

There have been many studies to better understand characteristics of the Web [6, 8, 14]. Other studies have proposed improved caching policies and mechanisms [5, 12]. However, work has not been done to specifically understand how changes in Web resources and the meta information reported by servers affect caching by Web browsers and proxy caches. To address this gap we have undertaken a study to monitor and better understand the characteristics of resource changes at servers and how these servers report meta data about the resources. The long-term goal of our project is to both examine the effectiveness of current caching techniques in light of more complete data, and also to investigate the potential of caching if improved techniques were used by Web caches and servers.

This paper focuses on the initial part of our project--characterizing information about Web resources and server responses that is relevant to Web caching. The approach is to study a set of URLs at a variety of sites and gather statistics about the rate and nature of changes compared with the resource type. In addition, we gather response header information reported by the servers with each retrieved resource. Previous work used proxy and server logs or network traces of user requests/responses, which constrained the resulting studies to the available data. In contrast, our approach is to retrieve each resource in the test set at fixed intervals for a period of time. In addition, logs and traces are affected by browser and ``lower-level′′ proxy caches, which hide some of the requested resources. Our approach is to disable caching for more complete data gathering.

数据挖掘研究院

We are aware that in generating our own set of resource requests the results may not reflect a realistic mix of requests as is found in a log or a packet trace. Rather, our study focuses on characterizing resources and responses based on content type. 数据挖掘研究院

In comparison to previous Web characterization work, our study has two distinguishing aspects: it focuses on issues relevant to Web caching; and it uses a methodology that allows us to study changes to resources in a controlled manner. In the remainder of this paper we describe our study. The following section discusses details of what information we are seeking in our study, followed by a section discussing the methodology we use in obtaining this information. The middle portion of the paper presents the results from our study on the test sets we use followed by a discussion on possible implications of these results for Web servers, caches and the HTTP protocol. The paper concludes with a description of related work, a discussion of future work and a summary of our work to date. 数据挖掘研究院

 

数据挖掘研究院

2 Study

 

数据挖掘研究院

The general goal of our work is to better understand the nature of how resources change at a collection of servers and how meta information reported by servers reflects those changes. The overriding goal of this work is obtain data that can be used to better understand the potential benefits of caching and whether existing software is reaching this potential. Our work has many specific directions for investigation:

  数据挖掘研究院

     

  • Monitor selected resources to study the frequency at which these resources change. A similar study was done using a packet trace [6], but with our approach we can control what requests are made and test whether resources change using an MD5 checksum of contents to determine when changes occur.

      数据挖掘研究院

  • Examine the availability and accuracy of cache validation information reported by servers for requested resources. The approach is to monitor response headers returned along with a resource to discover last modification time (lmodtime), size, and entity tag (Etag) information. The availability of lmodtime information is important in efficiently validating a cached resource using an If-Modified-Since (IMS) GET request in HTTP. Previous studies have found the percentage of server responses that contain the lmodtime for a resource vary from 50-80% [6, 12, 14]. Etags are an HTTP/1.1 mechanism for servers to provide an ``opaque′′ cache validator [13]. To check the accuracy of the validator information, we calculate MD5 checksums for resource contents and compare the checksums and validators for successive retrievals of a resource. We also measure the use and accuracy of explicit cache directives returned by servers such as the Expires header along with the Cache-Control header in HTTP/1.1 and Pragma:no-cache header in HTTP/1.0.

      数据挖掘研究院

  • Examine how images and other embedded resources change relative to the HTML resources they are contained in. Prior work indicates that images do not change at the same rate, but how does the use of embedded images change as these container resources change?

      数据挖掘研究院

  • Study the predictability and locality of changes to a resource. This is particularly important for resources that change often such as dynamically computed content. Techniques such as delta-encoding [16], HTML pre-processing [7] and active caches [4] have been proposed to allow resources that change frequently, but predictably, to be cached.

      数据挖掘研究院

  • Understand how servers respond to different types of requests for the same resource. One type of variation is whether servers are supplying cookies that clients are then including as part of subsequent requests. A recent study found that 30% of the requests made in a client trace included cookies, concluding that responses to these requests are uncachable [3, 9]. This result raises a number of questions. Is there a similar proportion of server replies that contain cookies for our test set? Does the inclusion of a cookie in a request always result in a different resource response than obtained with a request containing no cookie? Do two separate requests with two separate cookies always result in different resource responses? We believe answers to these questions will provide us with a better picture of the impact of cookies on caching.

      数据挖掘实验室

3 Methodology

There are two primary issues in our approach for studying the identified questions: how to determine the test set of resources to monitor and how to do the monitoring. These issues are discussed in this section.

 

数据挖掘实验室

3.1 Test Set

The approach we used in this study is to identify frequently used sites and focus our study on resources at those sites. While such a test set may not be ``representative′′ of a proxy trace, it provides us with a set of resources that are likely to have the most impact on long-term Web usage. We explored different sources for gathering resource usage information such as Media Metrix [15], Keynote Systems [11] and 100hot.com [18]. We use the home page from a set of web sites identified by 100hot.com as a basis for our study. 数据挖掘研究院

An alternate approach is to gather a set of URLs from a relatively current proxy log trace. This set of URLs can then be tracked using our methodology. This approach has the advantage of focusing on URLs actually being retrieved by users across a number of different servers and content types. However, it has the disadvantage of being biased by the particular user group encompassed by the trace.

We believe there is not a single best test set and that we need to look at different test sets. In the results presented in this paper we use only the first approach, but in subsequent work we obtained proxy traces from NLANR [17] and performed a similar study [22]. Results from this subsequent study are referenced as appropriate in this paper.

 

3.2 Data Retrieval

The methodology of the study is to perform an unconditional HTTP GET for each of the URLs in the test set on a daily basis using the HTTP request headers shown below for the sample URL http://owl.wpi.edu/ (the host and path vary for each request).

 

GET / HTTP/1.0
Pragma: no-cache
Accept: */*
Host: owl.wpi.edu
User-Agent: Mozilla/4.03 [en] (WinNT; I) 

数据挖掘实验室

The time between successive retrievals for a URL may be lengthened or shortened as needed, but for this work we used a retrieval interval of one day. For each retrieved resource, we store response headers and calculate an MD5 checksum on the contents. Contents of HTML and text resources are stored if changed from the previous retrieval. Once a resource is retrieved, it is parsed and all embedded and traversal links are recorded. Embedded images are retrieved and their MD5 checksum is calculated. 数据挖掘实验室

This process is repeated for all traversal links in the original URL in the test set. Hence all traversal links in the home page of each site are retrieved along with the embedded images of each of these links. This approach allows us to not only follow the dynamics of individual URLs, but to follow the dynamics of the set of resources used at a site. 数据挖掘研究院

  数据挖掘研究院

4 Results

This section gives information about the test sets used in our studies and provides answers to questions raised in Section 2.

 

4.1 Test Sets

Four test data sets were constructed using the September, 1998 ratings from 100hot.com. Data from the first test set, ``com1,′′ were gathered on a nightly basis for a two-week period during October, 1998. The com1 test set consists of home pages for 19 Web sites identified as the top 10 sites by the 100hot.com (some sites included multiple homes). The specific sites used in this and other test sets are given in [21]. As the test set name implies, all sites in com1 are from the .com commercial Internet domain, although in a few cases these sites contain links to URLs not in this domain, particularly in other countries.

The three remaining test sets were studied during a two-week period in November, 1998. The ``com2′′ test set consists of 13 URLs from the next most popular sites from 100hot.com. The ``netorg′′ test set was derived from the set of all non .com sites in the 100hot.com top 100. These sites are primarily from the .net and .org domains. The final test set, ``edu,′′ was constructed based on rankings of the .edu domain site usage given by 100hot.com along with WPI′s home page. Because relatively few queries were included in the four test sets, we added a fifth test set ``query′′ to our study. This test set was studied for six days in November and simply included queries to ten search engines, searching for ``search engines.′′ For this test set, the query result was retrieved along with embedded images, but traversal links were not retrieved. 数据挖掘研究院

Summary statistics about all test sets are given in Table 1. While headers from all responses were saved and catalogued, the table focuses on statistics related to caching and content type. Statistics about server software were also gathered, but in our studies we found no specific correlations with server software so these data are not reported.

 

数据挖掘研究院

资料全文下载 数据挖掘实验室

最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?