A High Performance Implementation of an OAI-Based Federation

We are seeing an exponential growth of online information and digital libraries are playing a key role in managing this information explosion. A wide variety of digital libraries [9] exist today in terms of the type of information they are managing. On one end of the spectrum we have digital libraries managing unstructured information like Web pages, popularly referred as Search Engines[10][12][4]; and on the other end of the spectrum we have digital libraries that are managing structured information. These digital libraries differ in the services they provide to the endusers and the collection they hold. Google [3] is an example of a Search Engine that harvests Web pages and provides discovery service using a keyword search. ACM Digital Library [1] is an example of digital library, which stores refereed conference proceedings and journal articles along with metadata fields like authors, title, etc. The ACM library provides discovery services over various stored metadata fields.

A number of digital libraries managing structured information exist today. However, there is a lack of a federated service (like Google for the Web sites) that provides a unified interface to all these libraries, which we believe is necessary for faster dissemination. The biggest obstacle for building a federated service is that many digital libraries use different, non-interoperable technologies. One major effort that addresses interoperability is the Open Archive Initiative (OAI) framework [11] to facilitate the discovery of content stored in distributed archives. The OAI framework supports data providers (archives) and service providers. Service providers develop value-added services based on the information collected from cooperating archives. These value added services could take the form of a federated search engine like Arc [6]. A typical data provider would be a digital library without any constraints on how it implemented its services with its own set of publishing tools and policies. The major addition is a layer that will expose its metadata (e.g., fields such as creator and title) in a well-specified format. Normally, one of the fields is a link to the actual document in its collection. 数据挖掘研究院

Assuming that a rapid increase (e.g., several orders of magnitude) in the adoption of OAI-PMH occurs, we now have a different problem: how to efficiently discover, harvest and index the burgeoning OAI-PMH corpus. Currently our research group at Old Dominion University provides a federation service – Arc – pro bono publico. Since harvesting, indexing, and searching are all running on the same server, performance is becoming a bottleneck, and the reliability is low. We are working on a project to improve performance and reliability by exploiting parallelism at all levels: harvesting, indexing and searching. In another paper, we have discussed how we use Grid technology to parallelize harvesting and improve performance on that part of the system [8]. In this paper, we focus on how a cluster of PCs can be used to improve indexing and searching performance 数据挖掘论坛

资料全文下载 数据挖掘实验室

[数据挖掘专家] [数据挖掘研究院] [数据挖掘论坛] [数据挖掘实验室]
上一篇:Mono- and CrossLingual Retrieval Experiments at the Univers
下一篇:Multilingual Retrieval Experiments with MIMOR at the Univers
最新评论共有 0 位网友发表了评论 , 查看所有评论
发表评论( 不能超过250字,需审核,请自觉遵守互联网相关政策法规。 )
匿名?
数据挖掘网站导航 数据挖掘论坛导航
  • 数据挖掘工具
  • 数据挖掘论坛
  • DataCruncher - Cognos
  • MineSet - MathSoft
  • Intelligent Miner - GainSmarts
  • Sqlserver - SAS - Clementine
  • CART - Weka - WizSoft
  • NeuroShell - ModelQuest
  • data mining tools - Darwin
  • 数据挖掘交友
  • 数据挖掘博客
  • 数据挖掘工具
  • 数据挖掘资源
  • 数据挖掘技术算法
  • 数据挖掘相关期刊、会议
  • 研究院联盟合作专区
  • 数据挖掘基础与相关技术
  • 数据挖掘厂商与就业
  • 数据挖掘研究者乐园
  • 知名厂商数据挖掘工具资料
  • 国内数据挖掘实验室
  • Foreign Data Mining Lab
  • 热点关注
  • Larbin网站爬虫简明使用说明
  • 全文检索引擎Lucene源码分析-analysis包
  • Nutch爬虫工作流程及文件格式详细分析
  • Lucene 基础指南(Java版)
  • 关于lucene 结构及内层的研究(一)
  • 实现NUTCH中文分词的代码修改方法
  • 利用Lucene搜索Java源代码
  • Lucene In Action ch 5 笔记 --高级搜索技
  • 第三节 Lucene索引文件格式分析
  • 如何使用Lucene进行全文检索(一)
  • 论坛最新话题
  • Foundations of Statistical Natural Langu
  • Game Theory meet Data Mining: A Recent P
  • System Building: How does it help or hin
  • 数据挖掘与Clementine培训
  • 新手报到
  • 求 SASEM 客户流失预测分析
  • 数据挖掘工程师/搜索研究院—北京——无线
  • 数据挖掘入门介绍(如何着手数据挖掘)
  • Information Overload Survey Results
  • The INEX 2005 Workshop on Element Retrie
  • 相关资讯
  • 什么是luncene
  • 什么是nutch
  • 让Nutch支持中文分词
  • 关于lucene 结构及内层的研究(一)
  • Lucene In Action ch 5 笔记 --高级搜索技
  • 第三节 Lucene索引文件格式分析
  • 第二节 Lucene系统结构分析
  • 第一节 全文检索系统与Lucene简介
  • Lucene的查询语法!
  • 第四节 Lucene索引构建逻辑模块分析
  • 数据挖掘实验室资料
  • 数据挖掘博客地址
  • 数据挖掘实验室网站地址
  • Prepare for Medicare audits by using dat
  • 注册成为SAS用户与爱好者俱乐部会员
  • 水南梅
  • 明日烟
  • 新人报道
  • 下载
  • 厦门服务器托管,450元/月—0592-5177319 高
  • 买空间送域名--0592-5177319 高静