We are seeing an exponential growth of online information and digital libraries are playing a key role in managing this information explosion. A wide variety of digital libraries [9] exist today in terms of the type of information they are managing. On one end of the spectrum we have digital libraries managing unstructured information like Web pages, popularly referred as Search Engines[10][12][4]; and on the other end of the spectrum we have digital libraries that are managing structured information. These digital libraries differ in the services they provide to the endusers and the collection they hold. Google [3] is an example of a Search Engine that harvests Web pages and provides discovery service using a keyword search. ACM Digital Library [1] is an example of digital library, which stores refereed conference proceedings and journal articles along with metadata fields like authors, title, etc. The ACM library provides discovery services over various stored metadata fields.
A number of digital libraries managing structured information exist today. However, there is a lack of a federated service (like Google for the Web sites) that provides a unified interface to all these libraries, which we believe is necessary for faster dissemination. The biggest obstacle for building a federated service is that many digital libraries use different, non-interoperable technologies. One major effort that addresses interoperability is the Open Archive Initiative (OAI) framework [11] to facilitate the discovery of content stored in distributed archives. The OAI framework supports data providers (archives) and service providers. Service providers develop value-added services based on the information collected from cooperating archives. These value added services could take the form of a federated search engine like Arc [6]. A typical data provider would be a digital library without any constraints on how it implemented its services with its own set of publishing tools and policies. The major addition is a layer that will expose its metadata (e.g., fields such as creator and title) in a well-specified format. Normally, one of the fields is a link to the actual document in its collection. 数据挖掘研究院
Assuming that a rapid increase (e.g., several orders of magnitude) in the adoption of OAI-PMH occurs, we now have a different problem: how to efficiently discover, harvest and index the burgeoning OAI-PMH corpus. Currently our research group at Old Dominion University provides a federation service – Arc – pro bono publico. Since harvesting, indexing, and searching are all running on the same server, performance is becoming a bottleneck, and the reliability is low. We are working on a project to improve performance and reliability by exploiting parallelism at all levels: harvesting, indexing and searching. In another paper, we have discussed how we use Grid technology to parallelize harvesting and improve performance on that part of the system [8]. In this paper, we focus on how a cluster of PCs can be used to improve indexing and searching performance 数据挖掘论坛
资料全文下载 数据挖掘实验室