RSS
热门关键字:  数据挖掘  人工智能  数据仓库  搜索引擎  数据挖掘导论

The Deep Web: Surfacing Hidden Value

来源: 作者:unkonwn 时间:2004-11-26 点击:

Searching on the Internet today can be compared to dragging a net across the surface of the ocean. While a great deal may be caught in the net, there is still a wealth of information that is deep, and therefore, missed. The reason is simple: Most of the Web′s information is buried far down on dynamically generated sites, and standard search engines never find it. 数据挖掘研究院

Traditional search engines create their indices by spidering or crawling surface Web pages. To be discovered, the page must be static and linked to other pages. Traditional search engines can not "see" or retrieve content in the deep Web -- those pages do not exist until they are created dynamically as the result of a specific search. Because traditional search engine crawlers can not probe beneath the surface, the deep Web has heretofore been hidden.

数据挖掘研究院

The deep Web is qualitatively different from the surface Web. Deep Web sources store their content in searchable databases that only produce results dynamically in response to a direct request. But a direct query is a "one at a time" laborious way to search. BrightPlanet′s search technology automates the process of making dozens of direct queries simultaneously using multiple-thread technology and thus is the only search technology, so far, that is capable of identifying, retrieving, qualifying, classifying, and organizing both "deep" and "surface" content.

数据挖掘研究院

If the most coveted commodity of the Information Age is indeed information, then the value of deep Web content is immeasurable. With this in mind, BrightPlanet has quantified the size and relevancy of the deep Web in a study based on data collected between March 13 and 30, 2000. Our key findings include:

数据挖掘实验室

  • Public information on the deep Web is currently 400 to 550 times larger than the commonly defined World Wide Web.
  • The deep Web contains 7,500 terabytes of information compared to nineteen terabytes of information in the surface Web.
  • The deep Web contains nearly 550 billion individual documents compared to the one billion of the surface Web.
  • More than 200,000 deep Web sites presently exist.
  • Sixty of the largest deep-Web sites collectively contain about 750 terabytes of information -- sufficient by themselves to exceed the size of the surface Web forty times.
  • On average, deep Web sites receive fifty per cent greater monthly traffic than surface sites and are more highly linked to than surface sites; however, the typical (median) deep Web site is not well known to the Internet-searching public.
  • The deep Web is the largest growing category of new information on the Internet.
  • Deep Web sites tend to be narrower, with deeper content, than conventional surface sites.
  • Total quality content of the deep Web is 1,000 to 2,000 times greater than that of the surface Web.
  • Deep Web content is highly relevant to every information need, market, and domain.
  • More than half of the deep Web content resides in topic-specific databases.
  • A full ninety-five per cent of the deep Web is publicly accessible information -- not subject to fees or subscriptions.

To put these findings in perspective, a study at the NEC Research Institute (1), published in Nature estimated that the search engines with the largest number of Web pages indexed (such as Google or Northern Light) each index no more than sixteen per cent of the surface Web. Since they are missing the deep Web when they use such search engines, Internet searchers are therefore searching only 0.03% -- or one in 3,000 -- of the pages available to them today. Clearly, simultaneous searching of multiple surface and deep Web sources is necessary when comprehensive information retrieval is needed. 数据挖掘研究院

The Deep Web
Internet content is considerably more diverse and the volume certainly much larger than commonly understood.

数据挖掘实验室

First, though sometimes used synonymously, the World Wide Web (HTTP protocol) is but a subset of Internet content. Other Internet protocols besides the Web include FTP (file transfer protocol), e-mail, news, Telnet, and Gopher (most prominent among pre-Web protocols). This paper does not consider further these non-Web protocols.(2)

数据挖掘研究院

Second, even within the strict context of the Web, most users are aware only of the content presented to them via search engines such as Excite, Google, AltaVista, or Northern Light, or search directories such as Yahoo!, About.com, or LookSmart. Eighty-five percent of Web users use search engines to find needed information, but nearly as high a percentage cite the inability to find desired information as one of their biggest frustrations.(3) According to a recent survey of search-engine satisfaction by market-researcher NPD, search failure rates have increased steadily since 1997.(4a) 数据挖掘研究院

The importance of information gathering on the Web and the central and unquestioned role of search engines -- plus the frustrations expressed by users about the adequacy of these engines -- make them an obvious focus of investigation.

Until Van Leeuwenhoek first looked at a drop of water under a microscope in the late 1600s, people had no idea there was a whole world of "animalcules" beyond their vision. Deep-sea exploration in the past thirty years has turned up hundreds of strange creatures that challenge old ideas about the origins of life and where it can exist. Discovery comes from looking at the world in new ways and with new tools. The genesis of the BrightPlanet study was to look afresh at the nature of information on the Web and how it is being identified and organized. 数据挖掘研究院

How Search Engines Work
Search engines obtain their listings in two ways: Authors may submit their own Web pages, or the search engines "crawl" or "spider" documents by following one hypertext link to another. The latter returns the bulk of the listings. Crawlers work by recording every hypertext link in every page they index crawling. Like ripples propagating across a pond, search-engine crawlers are able to extend their indices further and further from their starting points. 数据挖掘研究院


 
"Whole new classes of Internet-based companies choose the Web as their preferred medium for commerce and information transfer"  

 
The surface Web contains an estimated 2.5 billion documents, growing at a rate of 7.5 million documents per day.(5a) The largest search engines have done an impressive job in extending their reach, though Web growth itself has exceeded the crawling ability of search engines(6a)(7a) Today, the three largest search engines in terms of internally reported documents indexed are Google with 1.35 billion documents (500 million available to most searches),(8) Fast, with 575 million documents (9) and Northern Light with 327 million documents.(10) 数据挖掘研究院

Legitimate criticism has been leveled against search engines for these indiscriminate crawls, mostly because they provide too many results (search on "Web," for example, with Northern Light, and you will get about 47 million hits. Also, because new documents are found from links within other documents, those documents that are cited are more likely to be indexed than new documents -- up to eight times as likely.(5b)

To overcome these limitations, the most recent generation of search engines (notably Google) have replaced the random link-following approach with directed crawling and indexing based on the "popularity" of pages. In this approach, documents more frequently cross-referenced than other documents are given priority both for crawling and in the presentation of results. This approach provides superior results when simple queries are issued, but exacerbates the tendency to overlook documents with few links.(5c)

And, of course, once a search engine needs to update literally millions of existing Web pages, the freshness of its results suffer. Numerous commentators have noted the increased delay in posting and recording new information on conventional search engines.(11a) Our own empirical tests of search engine currency suggest that listings are frequently three or four months -- or more -- out of date.

数据挖掘实验室

Moreover, return to the premise of how a search engine obtains its listings in the first place, whether adjusted for popularity or not. That is, without a linkage from another Web document, the page will never be discovered. But the main failing of search engines is that they depend on the Web′s linkages to identify what is on the Web.

数据挖掘实验室

Figure 1 is a graphical representation of the limitations of the typical search engine. The content identified is only what appears on the surface and the harvest is fairly indiscriminate. There is tremendous value that resides deeper than this surface content. The information is there, but it is hiding beneath the surface of the Web. 数据挖掘研究院

boat with a shallow net

数据挖掘研究院

Figure 1. Search Engines: Dragging a Net Across the Web′s Surface

数据挖掘研究院

Searchable Databases: Hidden Value on the Web
How does information appear and get presented on the Web? In the earliest days of the Web, there were relatively few documents and sites. It was a manageable task to post all documents as static pages. Because all pages were persistent and constantly available, they could be crawled easily by conventional search engines. In July 1994, the Lycos search engine went public with a catalog of 54,000 documents.(12) Since then, the compound growth rate in Web documents has been on the order of more than 200% annually! (13a) 数据挖掘研究院

Sites that were required to manage tens to hundreds of documents could easily do so by posting fixed HTML pages within a static directory structure. However, beginning about 1996, three phenomena took place. First, database technology was introduced to the Internet through such vendors as Bluestone′s Sapphire/Web (Bluestone has since been bought by HP) and later Oracle. Second, the Web became commercialized initially via directories and search engines, but rapidly evolved to include e-commerce. And, third, Web servers were adapted to allow the "dynamic" serving of Web pages (for example, Microsoft′s ASP and the Unix PHP technologies). 数据挖掘研究院

This confluence produced a true database orientation for the Web, particularly for larger sites. It is now accepted practice that large data producers such as the U.S. Census Bureau, Securities and Exchange Commission, and Patent and Trademark Office, not to mention whole new classes of Internet-based companies, choose the Web as their preferred medium for commerce and information transfer. What has not been broadly appreciated, however, is that the means by which these entities provide their information is no longer through static pages but through database-driven designs.

It has been said that what cannot be seen cannot be defined, and what is not defined cannot be understood. Such has been the case with the importance of databases to the information content of the Web. And such has been the case with a lack of appreciation for how the older model of crawling static Web pages -- today′s paradigm for conventional search engines -- no longer applies to the information content of the Internet. 数据挖掘研究院

In 1994, Dr. Jill Ellsworth first coined the phrase "invisible Web" to refer to information content that was "invisible" to conventional search engines.(14) The potential importance of searchable databases was also reflected in the first search site devoted to them, the AT1 engine that was announced with much fanfare in early 1997.(15) However, PLS, AT1′s owner, was acquired by AOL in 1998, and soon thereafter the AT1 service was abandoned.

For this study, we have avoided the term "invisible Web" because it is inaccurate. The only thing "invisible" about searchable databases is that they are not indexable nor able to be queried by conventional search engines. Using BrightPlanet technology, they are totally "visible" to those who need to access them.

Figure 2 represents, in a non-scientific way, the improved results that can be obtained by BrightPlanet technology. By first identifying where the proper searchable databases reside, a directed query can then be placed to each of these sources simultaneously to harvest only the results desired -- with pinpoint accuracy. 数据挖掘研究院

BOAT WITH DEEP NET

数据挖掘实验室

Figure 2. Harvesting the Deep and Surface Web with a Directed Query Engine 数据挖掘实验室

Additional aspects of this representation will be discussed throughout this study. For the moment, however, the key points are that content in the deep Web is massive -- approximately 500 times greater than that visible to conventional search engines -- with much higher quality throughout.

BrightPlanet′s technology is uniquely suited to tap the deep Web and bring its results to the surface. The simplest way to describe our technology is a "directed-query engine." It has other powerful features in results qualification and classification, but it is this ability to query multiple search sites directly and simultaneously that allows deep Web content to be retrieved. 数据挖掘研究院

Study Objectives
To perform the study discussed, we used our technology in an iterative process. Our goal was to: 数据挖掘研究院

  • Quantify the size and importance of the deep Web.
  • Characterize the deep Web′s content, quality, and relevance to information seekers.
  • Discover automated means for identifying deep Web search sites and directing queries to them.
  • Begin the process of educating the Internet-searching public about this heretofore hidden and valuable information storehouse.

Like any newly discovered phenomenon, the deep Web is just being defined and understood. Daily, as we have continued our investigations, we have been amazed at the massive scale and rich content of the deep Web. This white paper concludes with requests for additional insights and information that will enable us to continue to better understand the deep Web.

What Has Not Been Analyzed or Included in Results
This paper does not investigate non-Web sources of Internet content. This study also purposely ignores private intranet information hidden behind firewalls. Many large companies have internal document stores that exceed terabytes of information. Since access to this information is restricted, its scale can not be defined nor can it be characterized. Also, while on average 44% of the "contents" of a typical Web document reside in HTML and other coded information (for example, XML or Javascript),(16) this study does not evaluate specific information within that code. We do, however, include those codes in our quantification of total content (see next section). 数据挖掘研究院

Finally, the estimates for the size of the deep Web include neither specialized search engine sources -- which may be partially "hidden" to the major traditional search engines -- nor the contents of major search engines themselves. This latter category is significant. Simply accounting for the three largest search engines and average Web document sizes suggests search-engine contents alone may equal 25 terabytes or more (17) or somewhat larger than the known size of the surface Web.

A Common Denominator for Size Comparisons
All deep-Web and surface-Web size figures use both total number of documents (or database records in the case of the deep Web) and total data storage. Data storage is based on "HTML included" Web-document size estimates.(13b) This basis includes all HTML and related code information plus standard text content, exclusive of embedded images and standard HTTP "header" information. Use of this standard convention allows apples-to-apples size comparisons between the surface and deep Web. The HTML-included convention was chosen because: 数据挖掘研究院

  • Most standard search engines that report document sizes do so on this same basis.
  • When saving documents or Web pages directly from a browser, the file size byte count uses this convention.
  • BrightPlanet′s reports document sizes on this same basis.

All document sizes used in the comparisons use actual byte counts (1024 bytes per kilobyte). 数据挖掘研究院


 
"Estimating total record count per site was often not straightforward"  

 
In actuality, data storage from deep-Web documents will therefore be considerably less than the figures reported.(18) Actual records retrieved from a searchable database are forwarded to a dynamic Web page template that can include items such as standard headers and footers, ads, etc. While including this HTML code content overstates the size of searchable databases, standard "static" information on the surface Web is presented in the same manner. 数据挖掘研究院

HTML-included Web page comparisons provide the common denominator for comparing deep and surface Web sources.

数据挖掘实验室

Use and Role of BrightPlanet Technology
All retrievals, aggregations, and document characterizations in this study used BrightPlanet′s technology. The technology uses multiple threads for simultaneous source queries and then document downloads. It completely indexes all documents retrieved (including HTML content). After being downloaded and indexed, the documents are scored for relevance using four different scoring algorithms, prominently vector space modeling (VSM) and standard and modified extended Boolean information retrieval (EBIR).(19)

数据挖掘研究院

Automated deep Web search-site identification and qualification also used a modified version of the technology employing proprietary content and HTML evaluation methods. 数据挖掘研究院

Surface Web Baseline
The most authoritative studies to date of the size of the surface Web have come from Lawrence and Giles of the NEC Research Institute in Princeton, NJ. Their analyses are based on what they term the "publicly indexable" Web. Their first major study, published in Science magazine in 1998, using analysis from December 1997, estimated the total size of the surface Web as 320 million documents.(4b) An update to their study employing a different methodology was published in Nature magazine in 1999, using analysis from February 1999.(5d) This study documented 800 million documents within the publicly indexable Web, with a mean page size of 18.7 kilobytes exclusive of images and HTTP headers.(20)

In partnership with Inktomi, NEC updated its Web page estimates to one billion documents in early 2000.(21) We have taken this most recent size estimate and updated total document storage for the entire surface Web based on the 1999 Nature study: 数据挖掘研究院

Total No. of Documents 数据挖掘研究院

Content Size (GBs) (HTML basis) 数据挖掘研究院

1,000,000,000 数据挖掘研究院

18,700

数据挖掘研究院

Table 1. Baseline Surface Web Size Assumptions

These are the baseline figures used for the size of the surface Web in this paper. (A more recent study from Cyveillance(5e) has estimated the total surface Web size to be 2.5 billion documents, growing at a rate of 7.5 million documents per day. This is likely a more accurate number, but the NEC estimates are still used because they were based on data gathered closer to the dates of our own analysis.) 数据挖掘研究院

Other key findings from the NEC studies that bear on this paper include: 数据挖掘研究院

  • Surface Web coverage by individual, major search engines has dropped from a maximum of 32% in 1998 to 16% in 1999, with Northern Light showing the largest coverage.
  • Metasearching using multiple search engines can improve retrieval coverage by a factor of 3.5 or so, though combined coverage from the major engines dropped to 42% from 1998 to 1999.
  • More popular Web documents, that is, those with many link references from other documents, have up to an eight-fold greater chance of being indexed by a search engine than those with no link references.

Analysis of Largest Deep Web Sites
More than 100 individual deep Web sites were characterized to produce the listing of sixty sites reported in the next section.

数据挖掘研究院

Site characterization required three steps: 数据挖掘研究院

  1. Estimating the total number of records or documents contained on that site.
  2. Retrieving a random sample of a minimum of ten results from each site and then computing the expressed HTML-included mean document size in bytes. This figure, times the number of total site records, produces the total site size estimate in bytes.
  3. Indexing and characterizing the search-page form on the site to determine subject coverage.

Estimating total record count per site was often not straightforward. A series of tests was applied to each site and are listed in descending order of importance and confidence in deriving the total document count: 数据挖掘研究院

  1. E-mail messages were sent to the webmasters or contacts listed for all sites identified, requesting verification of total record counts and storage sizes (uncompressed basis); about 13% of the sites shown in Table 2 provided direct documentation in response to this request.
  2. Total record counts as reported by the site itself. This involved inspecting related pages on the site, including help sections, site FAQs, etc.
  3. Documented site sizes presented at conferences, estimated by others, etc. This step involved comprehensive Web searching to identify reference sources.
  4. Record counts as provided by the site′s own search function. Some site searches provide total record counts for all queries submitted. For others that use the NOT operator and allow its stand-alone use, a query term known not to occur on the site such as "NOT ddfhrwxxct" was issued. This approach returns an absolute total record count. Failing these two options, a broad query was issued that would capture the general site content; this number was then corrected for an empirically determined "coverage factor," generally in the 1.2 to 1.4 range (22).
  5. A site that failed all of these tests could not be measured and was dropped from the results listing.

Analysis of Standard Deep Web Sites 数据挖掘研究院
Analysis and characterization of the entire deep Web involved a number of discrete tasks:

  • Qualification as a deep Web site.
  • Estimation of total number of deep Web sites.
  • Size analysis.
  • Content and coverage analysis.
  • Site page views and link references.
  • Growth analysis.
  • Quality analysis.

The methods applied to these tasks are discussed separately below. 数据挖掘研究院

Deep Web Site Qualification
An initial pool of 53,220 possible deep Web candidate URLs was identified from existing compilations at seven major sites and three minor ones.(23) After harvesting, this pool resulted in 45,732 actual unique listings after tests for duplicates. Cursory inspection indicated that in some cases the subject page was one link removed from the actual search form. Criteria were developed to predict when this might be the case. The BrightPlanet technology was used to retrieve the complete pages and fully index them for both the initial unique sources and the one-link removed sources. A total of 43,348 resulting URLs were actually retrieved.

数据挖掘研究院

We then applied a filter criteria to these sites to determine if they were indeed search sites. This proprietary filter involved inspecting the HTML content of the pages, plus analysis of page text content. This brought the total pool of deep Web candidates down to 17,579 URLs.

数据挖掘研究院

Subsequent hand inspection of 700 random sites from this listing identified further filter criteria. Ninety-five of these 700, or 13.6%, did not fully qualify as search sites. This correction has been applied to the entire candidate pool and the results presented.

数据挖掘研究院

Some of the criteria developed when hand-testing the 700 sites were then incorporated back into an automated test within the BrightPlanet technology for qualifying search sites with what we believe is 98% accuracy. Additionally, automated means for discovering further search sites has been incorporated into our internal version of the technology based on what we learned. 数据挖掘研究院

Estimation of Total Number of Sites
The basic technique for estimating total deep Web sites uses "overlap" analysis, the accepted technique chosen for two of the more prominent surface Web size analyses.(6b)(24) We used overlap analysis based on search engine coverage and the deep Web compilation sites noted above (see results in Table 3 through Table 5).

The technique is illustrated in the diagram below:

数据挖掘研究院

OVERLAPPING CIRCLES

Figure 3. Schematic Representation of "Overlap" Analysis

Overlap analysis involves pairwise comparisons of the number of listings individually within two sources, na and nb, and the degree of shared listings or overlap, n0, between them. Assuming random listings for both na and nb, the total size of the population, N, can be estimated. The estimate of the fraction of the total population covered by na is no/nb; when applied to the total size of na an estimate for the total population size can be derived by dividing this fraction into the total size of na. These pairwise estimates are repeated for all of the individual sources used in the analysis.

数据挖掘研究院

To illustrate this technique, assume, for example, we know our total population is 100. Then if two sources, A and B, each contain 50 items, we could predict on average that 25 of those items would be shared by the two sources and 25 items would not be listed by either. According to the formula above, this can be represented as: 100 = 50 / (25/50) 数据挖掘研究院

There are two keys to overlap analysis. First, it is important to have a relatively accurate estimate for total listing size for at least one of the two sources in the pairwise comparison. Second, both sources should obtain their listings randomly and independently from one another. 数据挖掘研究院

This second premise is in fact violated for our deep Web source analysis. Compilation sites are purposeful in collecting their listings, so their sampling is directed. And, for search engine listings, searchable databases are more frequently linked to because of their information value which increases their relative prevalence within the engine listings.(5f) Thus, the overlap analysis represents a lower bound on the size of the deep Web since both of these factors will tend to increase the degree of overlap, n0, reported between the pairwise sources.

数据挖掘研究院

Deep Web Size Analysis
In order to analyze the total size of the deep Web, we need an average site size in documents and data storage to use as a multiplier applied to the entire population estimate. Results are shown in Figure 4 and Figure 5.

数据挖掘实验室

As discussed for the large site analysis, obtaining this information is not straightforward and involves considerable time evaluating each site. To keep estimation time manageable, we chose a +/- 10% confidence interval at the 95% confidence level, requiring a total of 100 random sites to be fully characterized.(25a) 数据挖掘研究院

We randomized our listing of 17,000 search site candidates. We then proceeded to work through this list until 100 sites were fully characterized. We followed a less-intensive process to the large sites analysis for determining total record or document count for the site. 数据挖掘研究院

Exactly 700 sites were inspected in their randomized order to obtain the 100 fully characterized sites. All sites inspected received characterization as to site type and coverage; this information was used in other parts of the analysis.

数据挖掘研究院


 
"The invisible portion of the Web will continue to grow exponentially before the tools to uncover the hidden Web are ready for general use"  

 
The 100 sites that could have their total record/document count determined were then sampled for average document size (HTML-included basis). Random queries were issued to the searchable database with results reported as HTML pages. A minimum of ten of these were generated, saved to disk, and then averaged to determine the mean site page size. In a few cases, such as bibliographic databases, multiple records were reported on a single HTML page. In these instances, three total query results pages were generated, saved to disk, and then averaged based on the total number of records reported on those three pages.

Content Coverage and Type Analysis
Content coverage was analyzed across all 17,000 search sites in the qualified deep Web pool (results shown in Table 6); the type of deep Web site was determined from the 700 hand-characterized sites (results shown in Figure 6).

Broad content coverage for the entire pool was determined by issuing queries for twenty top-level domains against the entire pool. Because of topic overlaps, total occurrences exceeded the number of sites in the pool; this total was used to adjust all categories back to a 100% basis. 数据挖掘实验室

Hand characterization by search-database type resulted in assigning each site to one of twelve arbitrary categories that captured the diversity of database types. These twelve categories are: 数据挖掘实验室

  1. Topic Databases -- subject-specific aggregations of information, such as SEC corporate filings, medical databases, patent records, etc.
  2. Internal site -- searchable databases for the internal pages of large sites that are dynamically created, such as the knowledge base on the Microsoft site.
  3. Publications -- searchable databases for current and archived articles.
  4. Shopping/Auction.
  5. Classifieds.
  6. Portals -- broader sites that included more than one of these other categories in searchable databases.
  7. Library -- searchable internal holdings, mostly for university libraries.
  8. Yellow and White Pages -- people and business finders.
  9. Calculators -- while not strictly databases, many do include an internal data component for calculating results. Mortgage calculators, dictionary look-ups, and translators between languages are examples.
  10. Jobs -- job and resume postings.
  11. Message or Chat .
  12. General Search -- searchable databases most often relevant to Internet search topics and information.

These 700 sites were also characterized as to whether they were public or subject to subscription or fee access.

Site Pageviews and Link References
Netscape′s "What′s Related" browser option, a service from Alexa, provides site popularity rankings and link reference counts for a given URL.(26a) About 71% of deep Web sites have such rankings. The universal power function (a logarithmic growth rate or logarithmic distribution) allows pageviews per month to be extrapolated from the Alexa popularity rankings. (27) The "What′s Related" report also shows external link counts to the given URL. 数据挖掘研究院

A random sampling for each of 100 deep and surface Web sites for which complete "What′s Related" reports could be obtained were used for the comparisons.

Growth Analysis
The best method for measuring growth is with time-series analysis. However, since the discovery of the deep Web is so new, a different gauge was necessary. 数据挖掘研究院

Whois(28) searches associated with domain-registration services (25b) return records listing domain owner, as well as the date the domain was first obtained (and other information). Using a random sample of 100 deep Web sites (26b) and another sample of 100 surface Web sites (29) we issued the domain names to a Whois search and retrieved the date the site was first established. These results were then combined and plotted for the deep vs. surface Web samples. 数据挖掘研究院

Quality Analysis
Quality comparisons between the deep and surface Web content were based on five diverse, subject-specific queries issued via the BrightPlanet technology to three search engines (AltaVista, Fast, Northern Light)(30) and three deep sites specific to that topic and included in the 600 sites presently configured for our technology. The five subject areas were agriculture, medicine, finance/business, science, and law. 数据挖掘研究院

The queries were specifically designed to limit total results returned from any of the six sources to a maximum of 200 to ensure complete retrieval from each source.(31) The specific technology configuration settings are documented in the endnotes.(32) 数据挖掘研究院

The "quality" determination was based on an average of our technology′s VSM and mEBIR computational linguistic scoring methods. (33) (63) The "quality" threshold was set at our score of 82, empirically determined as roughly accurate from millions of previous scores of surface Web documents. 数据挖掘研究院

Deep Web vs. surface Web scores were obtained by using the BrightPlanet technology′s selection by source option and then counting total documents and documents above the quality scoring threshold. 数据挖掘实验室

Results and Discussion
This study is the first known quantification and characterization of the deep Web. Very little has been written or known of the deep Web. Estimates of size and importance have been anecdotal at best and certainly underestimate scale. For example, Intelliseek′s "invisible Web" says that, "In our best estimates today, the valuable content housed within these databases and searchable sources is far bigger than the 800 million plus pages of the ′Visible Web.′" They also estimate total deep Web sources at about 50,000 or so. (35)

Ken Wiseman, who has written one of the most accessible discussions about the deep Web, intimates that it might be about equal in size to the known Web. He also goes on to say, "I can safely predict that the invisible portion of the Web will continue to grow exponentially before the tools to uncover the hidden Web are ready for general use." (36) A mid-1999 survey by About.com′s Web search guide concluded the size of the deep Web was "big and getting bigger." (37) A paper at a recent library science meeting suggested that only "a relatively small fraction of the Web is accessible through search engines."(38)

The deep Web is about 500 times larger than the surface Web, with, on average, about three times higher quality based on our document scoring methods on a per-document basis. On an absolute basis, total deep Web quality exceeds that of the surface Web by thousands of times. Total number of deep Web sites likely exceeds 200,000 today and is growing rapidly.(39) Content on the deep Web has meaning and importance for every information seeker and market. More than 95% of deep Web information is publicly available without restriction. The deep Web also appears to be the fastest growing information component of the Web.

General Deep Web Characteristics
Deep Web content has some significant differences from surface Web content. Deep Web documents (13.7 KB mean size; 19.7 KB median size) are on average 27% smaller than surface Web documents. Though individual deep Web sites have tremendous diversity in their number of records, ranging from tens or hundreds to hundreds of millions (a mean of 5.43 million records per site but with a median of only 4,950 records), these sites are on average much, much larger than surface sites. The rest of this paper will serve to amplify these findings.

数据挖掘研究院

The mean deep Web site has a Web-expressed (HTML-included basis) database size of 74.4 MB (median of 169 KB). Actual record counts and size estimates can be derived from one-in-seven deep Web sites. 数据挖掘实验室

On average, deep Web sites receive about half again as much monthly traffic as surface sites (123,000 pageviews per month vs. 85,000). The median deep Web site receives somewhat more than two times the traffic of a random surface Web site (843,000 monthly pageviews vs. 365,000). Deep Web sites on average are more highly linked to than surface sites by nearly a factor of two (6,200 links vs. 3,700 links), though the median deep Web site is less so (66 vs. 83 links). This suggests that well-known deep Web sites are highly popular, but that the typical deep Web site is not well known to the Internet search public. 数据挖掘实验室

One of the more counter-intuitive results is that 97.4% of deep Web sites are publicly available without restriction; a further 1.6% are mixed (limited results publicly available with greater results requiring subscription and/or paid fees); only 1.1% of results are totally subscription or fee limited. This result is counter intuitive because of the visible prominence of subscriber-limited sites such as Dialog, Lexis-Nexis, Wall Street Journal Interactive, etc. (We got the document counts from the sites themselves or from other published sources.)

数据挖掘研究院

However, once the broader pool of deep Web sites is looked at beyond the large, visible, fee-based ones, public availability dominates. 数据挖掘研究院

60 Deep Sites Already Exceed the Surface Web by 40 Times
Table 2 indicates that the sixty known, largest deep Web sites contain data of about 750 terabytes (HTML-included basis) or roughly forty times the size of the known surface Web. These sites appear in a broad array of domains from science to law to images and commerce. We estimate the total number of records or documents within this group to be about eighty-five billion. 数据挖掘研究院

Roughly two-thirds of these sites are public ones, representing about 90% of the content available within this group of sixty. The absolutely massive size of the largest sites shown also illustrates the universal power function distribution of sites within the deep Web, not dissimilar to Web site popularity (40) or surface Web sites.(41) One implication of this type of distribution is that there is no real upper size boundary to which sites may grow.

Name 数据挖掘研究院

Type 数据挖掘研究院

URL 数据挖掘实验室

Web Size (GBs) 数据挖掘研究院

National Climatic Data Center (NOAA) Public http://www.ncdc.noaa.gov/ol/satellite/satelliteresources.html

366,000 数据挖掘研究院

NASA EOSDIS Public http://harp.gsfc.nasa.gov/~imswww/pub/imswelcome/plain.html

219,600

数据挖掘研究院

National Oceanographic (combined with Geophysical) Data Center (NOAA) Public/Fee http://www.nodc.noaa.gov/, http://www.ngdc.noaa.gov/

32,940 数据挖掘研究院

Alexa Public (partial) http://www.alexa.com/

15,860

数据挖掘研究院

Right-to-Know Network (RTK Net) Public http://www.rtk.net/

14,640

数据挖掘实验室

MP3.com Public http://www.mp3.com/

4,300 数据挖掘研究院

Terraserver Public/Fee http://terraserver.microsoft.com/

4,270 数据挖掘研究院

HEASARC (High Energy Astrophysics Science Archive Research Center) Public http://heasarc.gsfc.nasa.gov/W3Browse/

2,562

数据挖掘研究院

US PTO - Trademarks + Patents Public http://www.uspto.gov/tmdb/, http://www.uspto.gov/patft/

2,440

数据挖掘研究院

Informedia (Carnegie Mellon Univ.) Public (not yet) http://www.informedia.cs.cmu.edu/

1,830

Alexandria Digital Library Public http://www.alexandria.ucsb.edu/adl.html

1,220 数据挖掘研究院

JSTOR Project Limited http://www.jstor.org/

1,220 数据挖掘研究院

10K Search Wizard Public http://www.tenkwizard.com/

769

UC Berkeley Digital Library Project Public http://elib.cs.berkeley.edu/

766

数据挖掘实验室

SEC Edgar Public http://www.sec.gov/edgarhp.htm

610 数据挖掘研究院

US Census Public http://factfinder.census.gov

610

NCI CancerNet Database Public http://cancernet.nci.nih.gov/

488 数据挖掘实验室

Amazon.com Public http://www.amazon.com/

461

IBM Patent Center Public/Private http://www.patents.ibm.com/boolquery

345

数据挖掘研究院

NASA Image Exchange Public http://nix.nasa.gov/

337

InfoUSA.com Public/Private http://www.abii.com/

195

数据挖掘研究院

Betterwhois (many similar) Public http://betterwhois.com/

152 数据挖掘实验室

GPO Access Public http://www.access.gpo.gov/

146

Adobe PDF Search Public http://searchpdf.adobe.com/

最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?