deep web
The deep web (or invisible web or hidden web) is the name given to pages on the World Wide Web that are not part of the surface web that is indexed by common search engines. It consists of pages which are not linked to by other pages, such as Dynamic Web pages. Dynamic Web pages are basically searchable databases that deliver Web pages generated just in response to a query and contain information stored in tables created by programs such as Access, Oracle or SQL databases. The Deep Web also includes sites that require registration or otherwise limit access to their pages, prohibiting search engines from browsing them and creating cached copies.
Non-textual files such as multimedia (image) files, Usenet archives and documents in non-HTML file formats such as PDF and DOC documents used to form a part of deep web, but now are more easily accessible to search engines, especially Google.
The deep web should not be confused with the term dark web or dark internet which refers to machines or network segments not connected to the Internet. While deep web content is accessible to people online but not visible to conventional search engines, dark internet content is not accessible online by either people or search engines. 数据挖掘研究院
Surface web
To better understand the invisible web consider how conventional search engines construct their databases, thus defining the surface web: Programs called spiders or web crawlers start by reading pages on an initial list of websites. Each page they read is indexed and added to the search engine′s database. Any hyperlinks to new pages are added to the list of pages to be indexed. Eventually, all reachable pages have been indexed or the search engine runs out of time or disk space. These reachable pages are the surface web. Pages which do not have a chain of links from a page in the spider′s initial list are invisible to that spider and not part of the surface web it defines. 数据挖掘实验室
In opposition to the ′surface web′ is the ′deep web′. The great majority of the deep web is composed by searchable databases. To understand why these databases are invisible to spiders (and their search engines) consider the following: 数据挖掘研究院
Imagine someone has collected a great amount of information – books, texts, articles, images, etc. – and put them together online in a website, creating a database reachable only via a search field. This database, as most databases, would work like this:
in a search field the user types the keywords he or she wants
this searching facility looks inside the database and retrieves the relevant content
a page of results is presented bringing the links to every important topic related to the user’s query
Once a conventional search engine’s web crawler reaches this site, it will capture the text contained in the main page and in the pages which hyperlinks can be found to (usually “about us”, “contact us”, “privacy policy”, etc.). But the great majority of the information – books, texts, articles or images – that are only reachable by querying the search field, cannot be reached by the web crawler. The robot cannot predict which words it should type inside the search field. Thus the data is invisible to the search engine.
Accessing the deep web
As said before, search engines use web crawlers that follow hyperlinks. Such crawlers typically do not submit queries to databases due to the potential infinitude of queries that can be made to a single database. It has been noted that this can be (partially) overcome by having links to query results, thus increasing Google-style PageRank results for a member of the deep web. 数据挖掘实验室
In 2005, Yahoo! made a small part of the deep web searchable by releasing Yahoo! Subscriptions. This search engine searches through a few subscription-only web sites. 数据挖掘研究院
Some search tools are being designed to retrieve information from the deep web. Their crawlers are set to identify and somehow interact with searchable databases, aiming to provide access to deep web content. Some examples are: InvisibleWeb.com, LexiBot, Lycos Invisible Web Catalog and Incywincy. 数据挖掘研究院
Specialty Search Engines, also called Specialized Search Engines, Vertical Search Engines or even Specific Search Engines are also good tools in order to provide access to the deep web. These engines focus on a specific subject area, on a particular topic, on geographic region, or even a particular file format. This focus permits them to deal better with the deep web, creating robots more suitable to the deep web retrieval needs, or even indexing it humanly. Some examples of Specialty Search Engines with their respective area of coverage are: Scirus (Science), Health on The Net Foundation - HON (Health), and Alacra (Business information). A new trend is the development of a network of Vertical Search Engines, where the user selects the most appropriate Vertical Search Engine for their specific search from a list of topic choices such as those at www.DeepVertical.com Another option is accessing directly the searchable databases. They represent the invisible web itself. A good example is Find Articles (exclusive articles). There are some catalogs listing the major specialized databases, as well as some alternative search engines that focus on finding specialty search engines and databases, such as GoshMe and Topic Hunter.
References
Gary Price & Chris Sherman. The Invisible Web : Uncovering Information Sources Search Engines Can′t See. CyberAge Books, July 2001. ISBN 091096551X
Joe Barker. Invisible Web: What it is, Why it exists, How to find it, and Its inherent ambiguity. UC Berkeley - Teaching Library Internet Workshops, January 2004. Last seen online July 2005 at http://www.lib.berkeley.edu
Michael K. Bergman. The Deep Web: Surfacing Hidden Value. The Journal of Electronic Publishing. August, 2001. Volume 7, Issue 1. http://www.press.umich.edu/jep/07-01/bergman.html
Alex Wright, In Search of the Deep Web, Salon.com, March 2004, http://www.salon.com/tech/feature/2004/03/09/deep_web/index.html 数据挖掘研究院
External links
QProber: Classifying and Searching Hidden-Web Text Databases
MetaQuerier: Exploring and Integrating the Deep Web
Deep Web from the library of SUNY-Albany
The Invisible Web Revealed by Robert J. Lackie of Rider University 数据挖掘研究院

