An efficient wrapper generation in DIMS

Given the rapid growth and snccess of publishing information sources on the Web, it is increasingly amactive to wrapper information from distributed, autonomous and heterogeneous Web data sources. Wrapping structured data from web is not a trivial task. Most of the information on the web today is in the form of Hypertext Markup Language (HTML) documents, which are viewed by a browser. HTML documents are sometime written by hand, sometimes with the aid of HTML tools. Given that the form of HTML documents is designed for presentation purposes, not automated extraction. Some Web content is being available in formats more suitable for automated processing, in particular the Extensible Markup Language (XML)[Z]. Despite being a relatively new development, XML has become absolutely essential for enabling data interchange between otherwise incompatible systems. 数据挖掘论坛

Except HTML, XML data sources, there are other types data sources, such as many different datahase data sources and text documents. With the rapid growth of the these types available data sources on the Internet, significant attention has been received on integrating huge heterogeneous information to build a new unified web application. The key technology of these applications is how to homogeneous the difference structures and semantics of the information provided by the enterprise or the Internet. The need to access, retrieves, and manage information from a variety of sources and applications using different data model, representation and interfaces has created a great demand for tools support data and systems integration. Integration of complementary information produced from the popular Web and the rapid growth Intemet, providing useful service for the large amount of the Web users, become a very important and more recent research subject! 数据挖掘实验室

We have developed a DIMS mediator-wrapper system to solve this problem. DIMS accepts XML as data model, provides metadata management to facile maintain distributed, autonomous and heterogeneous data sources. DIMS Mediator provides unified XML views of sources containing collections of disparate heterogeneous information that can he queried through the users′ queries. It takes into account the capabilities of the data source, to avoid sending not executable suh-queries to sources. Each data source is encapsulated within a related wrapper. Ideally, a mapper can provide mapping functionalities as XML views to achieve local mappings of data and metadata.

In this paper, we focus on systemoriented issues in Web data wrapper and describe our approach for building a dependable wrapping process. Our ideas are manifested in DIMS, an XMLbased information integration system and have implementing a prototype to demonstrate important properties of the presented approach.

 

  数据挖掘论坛

资料全文下载

 

[数据挖掘专家] [数据挖掘研究院] [数据挖掘论坛] [数据挖掘实验室]
上一篇:Porting Lucene to .NET Using Visual J#
下一篇:Mono- and CrossLingual Retrieval Experiments at the Univers
最新评论共有 0 位网友发表了评论 , 查看所有评论
发表评论( 不能超过250字,需审核,请自觉遵守互联网相关政策法规。 )
匿名?
数据挖掘网站导航 数据挖掘论坛导航
  • 数据挖掘工具
  • 数据挖掘论坛
  • DataCruncher - Cognos
  • MineSet - MathSoft
  • Intelligent Miner - GainSmarts
  • Sqlserver - SAS - Clementine
  • CART - Weka - WizSoft
  • NeuroShell - ModelQuest
  • data mining tools - Darwin
  • 数据挖掘交友
  • 数据挖掘博客
  • 数据挖掘工具
  • 数据挖掘资源
  • 数据挖掘技术算法
  • 数据挖掘相关期刊、会议
  • 研究院联盟合作专区
  • 数据挖掘基础与相关技术
  • 数据挖掘厂商与就业
  • 数据挖掘研究者乐园
  • 知名厂商数据挖掘工具资料
  • 国内数据挖掘实验室
  • Foreign Data Mining Lab
  • 热点关注
  • Larbin网站爬虫简明使用说明
  • 全文检索引擎Lucene源码分析-analysis包
  • Nutch爬虫工作流程及文件格式详细分析
  • Lucene 基础指南(Java版)
  • 关于lucene 结构及内层的研究(一)
  • 实现NUTCH中文分词的代码修改方法
  • 利用Lucene搜索Java源代码
  • Lucene In Action ch 5 笔记 --高级搜索技
  • 第三节 Lucene索引文件格式分析
  • 如何使用Lucene进行全文检索(一)
  • 论坛最新话题
  • Foundations of Statistical Natural Langu
  • Game Theory meet Data Mining: A Recent P
  • System Building: How does it help or hin
  • 数据挖掘与Clementine培训
  • 新手报到
  • 求 SASEM 客户流失预测分析
  • 数据挖掘工程师/搜索研究院—北京——无线
  • 数据挖掘入门介绍(如何着手数据挖掘)
  • Information Overload Survey Results
  • The INEX 2005 Workshop on Element Retrie
  • 相关资讯
  • 什么是luncene
  • 什么是nutch
  • 让Nutch支持中文分词
  • 关于lucene 结构及内层的研究(一)
  • Lucene In Action ch 5 笔记 --高级搜索技
  • 第三节 Lucene索引文件格式分析
  • 第二节 Lucene系统结构分析
  • 第一节 全文检索系统与Lucene简介
  • Lucene的查询语法!
  • 第四节 Lucene索引构建逻辑模块分析
  • 数据挖掘实验室资料
  • 数据挖掘博客地址
  • 数据挖掘实验室网站地址
  • Prepare for Medicare audits by using dat
  • 注册成为SAS用户与爱好者俱乐部会员
  • 水南梅
  • 明日烟
  • 新人报道
  • 下载
  • 厦门服务器托管,450元/月—0592-5177319 高
  • 买空间送域名--0592-5177319 高静