For almost as long as there has been a Web, there have been Web search engines. So one might reasonably ask why the deep Web has remained out of view for so long. 数据挖掘研究院
Traditionally, Web search engines have grown their databases through simple brute force. All the major search engines survey the Web by dispatching legions of simple programs known as spiders, crawlers, robots or harvesters to trace their way through the endless chains of hyperlinks that tie Web pages together.
That method works well for the static HTML pages and predictable URLs that make up the upper strata of the Web. But the deep Web resides mostly in databases, shielded by a lattice of registration gateways, session cookies and dynamically generated links. Unless an organization consciously chooses to share its data, by opening up an API or Web services feed -- the way Amazon books show up in a Google search -- then the data will likely remain unseen to most users. 数据挖掘研究院
New search engines now under development are exploring methods for penetrating the database barriers. BrightPlanet has developed a formula for brokering queries across multiple deep Web data sources at once, aggregating the results and letting users compare changes to those results over time -- a process known as "differencing."
That capability has attracted considerable interest from certain government agencies that shall remain nameless. "Some of our clients are spooky," says BrightPlanet COO Duncan Wittes. Other BrightPlanet customers include state governments, competitive intelligence researchers, and political campaigns whose "oppo" teams may want not only to search for what a candidate has said but also for what he or she may have "unsaid" over time. 数据挖掘研究院
Soon-to-launch Dipsie is pursuing an alternative approach to unlocking the dynamic Web, by deploying a kind of souped-up spider that penetrates barriers like forms, drop-down lists, dynamically generated URLs and session cookies. Dipsie′s spider works by emulating a "well-formed user" that, from the Web site′s point of view, behaves just like a real flesh-and-mouse user, enabling the spider to cache the kind of data typically visible only to a human user. 数据挖掘实验室
Other search developers, including IBM, Google and Intelliseek, are exploring their own approaches to mining the deep Web. But in the wake of this week′s announcement, Yahoo is now the elephant in the living room.
Yahoo won′t discuss the specifics of how its search algorithms work. But the company does acknowledge that its Content Aggregation Program will give paying customers a more direct pipeline into its search database. Yahoo Search vice president Tim Cadogan says, "Ultimately we want to search the whole Web for free," but he nonetheless sees the CAP program as a way of enabling "direct, structured relationships with content providers" to "deliver a higher-quality search experience for users." 数据挖掘实验室
It takes a fine ear for P.R. nuance to distinguish "higher-quality search experience" from "better results." Yahoo has issued copious disclaimers assuring non-paying customers that they will receive the same algorithmic treatment as paying ones. But the company acknowledges that paying customers will likely benefit from a "quality review" designed to help companies improve their chances of showing up in search results. 数据挖掘实验室
"Cadogan claims that people who send money can′t count on getting better results," says Bray . "Do you believe that? I don′t." 数据挖掘研究院
Every year, the University of California at Davis pays the publisher John Wiley about $14,000 for a subscription to the Journal of Comparative Neurology, which publishes breaking research in its field. That may sound like a steep price tag for what is essentially a magazine subscription, but it′s a tiny dollop of the $20 million the U.C. libraries spend every year on scholarly journals. 数据挖掘研究院
Scientific, technology and medical publishing constitutes an $11 billion industry. And like the rest of the publishing business, scholarly publishers have undergone massive consolidation in the past two decades. Once the province of small university presses and boutique academic imprints, scholarly journals now emanate from giant publishing conglomerates such as Elsevier, Thompson and Blackwells. 数据挖掘研究院
"The well-established subscription model that evolved around print journals is a cash cow," says Peter Lyman, professor at the UC-Berkeley School of Information Management and Systems. "One that the publishers are terrified of damaging accidentally, through online publishing." 数据挖掘研究院
But unlike trade-book publishers, who count on Amazon and Barnes & Noble to move physical units of the latest Harry Potter tome, scholarly publishers rely increasingly on electronic journal subscriptions and paid search services to fuel their revenues. Their customers -- mostly academic institutions and research organizations -- insist on providing Web access to journal content. To meet that demand while protecting their valuable data stores, the large publishers have responded by rolling out private permission-based search gateways to the contents of their journals, usually under highly restrictive license terms and tightly managed IP access.
But those pricey journal databases now compete for attention -- and search queries -- from students and faculty with ready access to Google, Yahoo and the rest. And while the public search engines may not find every article in the journal literature, a growing portion of published research also finds its way out onto the Web. 数据挖掘研究院
For example, when gene researchers identify a new DNA sequence, they usually submit the sequence to the National Institutes of Health′s GenBank -- a public deep Web resource -- before submitting it to journals for publication.
Legislation pending in Congress would ensure that all research funded by federal taxpayers be made available free of charge to the public, over the Internet. Meanwhile, new cooperative academic initiatives like the Public Library of Science and the National Science Digital Library are trying to expand access to scholarly research, opening up more indirect competition for the proprietary publishing systems.
And as more scholarship finds its way onto the Web, page-ranking algorithms are also providing an alternative quality rating system to the traditional scholarly peer review that journals have always employed. 数据挖掘研究院
While page ranking won′t replace the scholarly review process anytime soon, the expansion of public Web search engines will put downward pressure on the premium that publishers can command. "I don′t think [page ranking] is more reliable," says Lyman, "but I do think it′s perceived as legitimate. The cost of creating formally quality-controlled information may drive people to consider lower-cost alternatives." 数据挖掘实验室
Lyman adds, "When the public begins to use and accept non-qualified information -- relying on Google or other things to perform that function, like Technorati -- there are beginning to be quality mechanisms out there that are user-centric or generated by users," 数据挖掘研究院
How will scholarly publishers react to the encroaching competition from deep Web search engines? "The publishing industry is not famous for being progressive, forward thinking or fast moving," Bray says. "But if they ignore [deep Web search], they could find themselves in a situation like the record companies, where someone finds a way to subvert them." - - - - - - - - - - - - 数据挖掘研究院
The deep Web contains some 500 times more data than the surface Web; but to regard the deep Web as simply a bigger and better version of the current Web is to overlook the essential feature of databases, which is structure. Most of the deep Web is structured or semi-structured data, as opposed to the sea of flotsam HTML that bobs across the surface Web. 数据挖掘研究院
"Once you get into the deep Web, all of these data sources often have much more metadata available," says Bray. "This could be a huge opportunity for companies looking at new ways of presenting search results."
Deriving search results from structured data sets will open up new possibilities for search engines. In all likelihood, search engines will gradually abandon the flat listings-style result pattern you see on a typical 12-page Google result. (And who ever gets to the 12th page, anyway?) Not only could deep Web search engines present more useful and manipulable views into structured data but, given some basic lingua franca of structural vocabularies, they could also aggregate those results in endlessly permutable combinations. 数据挖掘研究院
"It′s ridiculous to think that the one-dimensional result list is going to be the universal paradigm for all imaginable searches forever," Bray says. "If you type ′bicycle′ into Google, you get a list of results having to do with bicycles. But that result is, in a very important way, a lie. It ignores the fact that some of these things are about bicycle racing, some are about bicycle manufacturing. It ignores things that Google might not even know about."
As deep Web search engines unearth the structures of large data sets and make those structures visible across organizations, they will create a powerful incentive for organizations to invest in more consistent, predictable structures (a trend already manifest in the growth of Web services and in Yahoo′s search quality guidelines). In exchange for the benefits of increased exposure, these organizations will yield another level of autonomy. 数据挖掘实验室
While government and academic institutions may generate the greatest volume of deep Web content, corporations undoubtedly generate the most monetary value in Web data: customer databases, product catalogs, technical knowledge bases and myriad other data sources with quantifiable business value.
Over the last decade, companies have invested heavily in Web infrastructure, including countless local search engines. While many companies already outsource their public Web site search functions to companies like Google, many also have developed specialized search engines for their own deep Web data, like technical support databases. 数据挖掘研究院
Those investments make plenty of sense when that data won′t readily show up in a public Web search. But as deep Web searchers penetrate these gateways, will companies continue to see the value of investing in their own public interfaces? 数据挖掘研究院
In the near term, deep Web search engines will likely dampen company expenditures on local search initiatives. But in the longer term, the changes may prove more far reaching. "The quality and ubiquity of Web search engines hides the fact that most organizations have really crappy search mechanisms," Bray says. "I think that′s creating a tension within organizations." 数据挖掘研究院
As public search engines continue to supplant the role of organizations′ own information-retrieval systems -- be they search databases, call centers or sales engineers -- once internal-facing systems will assume increasingly outward-facing roles. "When the ability to develop different messages for different audiences is curtailed by universal availability," says Gartner analyst Whit Andrews, "the nature of the message, its format and associated issues become paramount.
No one expects IT departments to go out of business, but the external pressures of deep Web search will almost certainly force long-term changes in the role, structure and autonomy of local IT organizations as they gradually lose direct control over customer transactions. - - - - - - - - - - - - 数据挖掘研究院
Every search query is a unit of desire. Search companies, like all businesses, exist by transforming desire into hard currency. As deep Web search engines insinuate themselves into deeper and deeper levels of organizations, they will not only offload search traffic, they will trigger a series of massive disruptions in the information economy.
If you buy the Cluetrain maxim that "hyperlinks subvert hierarchy," then surely deep Web search engines will amplify that subversion. As search engines extend their reach deeper into and across organizations, the boundaries between those organizations will feel more fluid -- both to consumers and to the organizations themselves. The first thing most of us notice may be better search results. 数据挖掘研究院
Somewhere inside that complex apparatus of desire and fulfillment, a transformation is taking place, one whose effects we can barely foresee. 数据挖掘研究院

