Multilingual Retrieval Experiments with MIMOR at the Univers

1 Introduction

In the CLEF 2002 campaign, we tested an adaptive fusion system based on the MIMOR model within the GIRT track (Hackl et al. 2002). For CLEF 2003, we applied the same model to multilingual retrieval with four languages. We chose English as our source language because most of the web based translation services offer translations to and/or from English. Our experiments were carried out fully automatically.

数据挖掘工具

2 Fusion in Information Retrieval 数据挖掘交友

Fusion in information retrieval delegates a task to different retrieval engines and considers all the results returned. The single result lists are combined into one final result. Fusion is motivated by the observation that many retrieval systems reach comparable quality, however, the overlap between their ranked lists is often low (Womser-Hacker 1997). The retrieval status values (RSV) are combined by taking the sum, the minimum or the maximum of the results from the individual systems. Linear combinations assign a weight to each method which determines its influence on the final result. These weights may be improved for example by heuristic optimization or learning methods (Vogt & Cottrell 1998). There has been a considerable interest in fusion algorithms in several areas of information retrieval. In web information retrieval, for example, link analysis assigns an overall quality value to all pages based mainly on the number of links which point to that page (Henzinger 2000). This quality measure needs to be fused with the retrieval ranking based on the document’s content (e.g. Plachouras & Ounis 2002). Fusion is also investigated within image retrieval for the combination of evidences which stem from different representations like color, texture, and forms. In XML retrieval fusion is necessary to combine the ranks assigned to a document by the structural analysis and the content analysis (Fuhr & Großjohann 2001). 数据挖掘交友

3 MIMOR as Fusion Framework 数据挖掘研究院

MIMOR (Multiple Indexing and Method-Object Relations) represents a learning approach to the fusion task which is based on results of information retrieval research which show that the overlap between different systems is often small (Womser-Hacker 1997, Mandl & Womser-Hacker 2001). Furthermore, relevance feedback is considered a very promising strategy for improving retrieval quality. As a consequence, the linear combination of different results is optimized through learning from relevance feedback. MIMOR represents an information retrieval system managing poly-representation of queries and documents by selecting appropriate methods for indexing and matching (Mandl & Womser-Hacker 2001). By considering user feedback about the relevance of documents, the model learns and adapts itself by assigning weights to the different basic retrieval engines. MIMOR can also be individualized, however, such personalization in information retrieval is difficult to evaluate within evaluation initiatives. MIMOR could train an individual or group based optimization of the fusion. However, in evaluation studies, a standardized notion of relevance exists. 数据挖掘工具

4 CLEF Retrieval Experiments with MIMOR

The tools we employed this year include Lucene 1.31, MySQL 4.0.122 and JavaTM-based snowball3 analyzers. Most of the data pre-processing was carried out by Perl-scripts. In a first step, customized snowball stemmers were used to stem the collections. Stopwords were also eliminated4. Then, the collections were indexed by Lucene and MySQL. Lucene needed less than half the time that MySQL needed for indexing the collections of 1321 MB. A second step involved the translation of the English topics into French, German and Spanish. The translation was carried out with the free internet services FreeTranslation, Reverso and Linguatec5. The decision to select these tools, was based on a heuristic evaluation of several services. The queries of CLEF 2001 were used to gather data for a comparison of the translations. Examining the different translations, it became apparent that the quality of the machine translations is certainly not quite satisfying. At the same time, the translation systems usually exhibited different weaknesses. Because of that, we decided to use more than one translation system and merge the results. The tools which performed best and showed significantly different results at our evaluation were chosen. The topics were also stemmed with snowball and stopwords were removed. The translated and processed queries for each language were then merged by joining the three translations while eliminating dublettes. We did not try to identify any phrases. 数据挖掘论坛

 

资料全文下载

数据挖掘论坛

[数据挖掘专家] [数据挖掘研究院] [数据挖掘论坛] [数据挖掘实验室]
上一篇:A High Performance Implementation of an OAI-Based Federation
下一篇:Introduction to Text Indexing with Apache Jakarta Lucene
最新评论共有 0 位网友发表了评论 , 查看所有评论
发表评论( 不能超过250字,需审核,请自觉遵守互联网相关政策法规。 )
匿名?
数据挖掘网站导航 数据挖掘论坛导航
  • 数据挖掘工具
  • 数据挖掘论坛
  • DataCruncher - Cognos
  • MineSet - MathSoft
  • Intelligent Miner - GainSmarts
  • Sqlserver - SAS - Clementine
  • CART - Weka - WizSoft
  • NeuroShell - ModelQuest
  • data mining tools - Darwin
  • 数据挖掘交友
  • 数据挖掘博客
  • 数据挖掘工具
  • 数据挖掘资源
  • 数据挖掘技术算法
  • 数据挖掘相关期刊、会议
  • 研究院联盟合作专区
  • 数据挖掘基础与相关技术
  • 数据挖掘厂商与就业
  • 数据挖掘研究者乐园
  • 知名厂商数据挖掘工具资料
  • 国内数据挖掘实验室
  • Foreign Data Mining Lab
  • 热点关注
  • Larbin网站爬虫简明使用说明
  • 全文检索引擎Lucene源码分析-analysis包
  • Nutch爬虫工作流程及文件格式详细分析
  • Lucene 基础指南(Java版)
  • 关于lucene 结构及内层的研究(一)
  • 实现NUTCH中文分词的代码修改方法
  • 利用Lucene搜索Java源代码
  • Lucene In Action ch 5 笔记 --高级搜索技
  • 第三节 Lucene索引文件格式分析
  • 如何使用Lucene进行全文检索(一)
  • 论坛最新话题
  • Foundations of Statistical Natural Langu
  • Game Theory meet Data Mining: A Recent P
  • System Building: How does it help or hin
  • 数据挖掘与Clementine培训
  • 新手报到
  • 求 SASEM 客户流失预测分析
  • 数据挖掘工程师/搜索研究院—北京——无线
  • 数据挖掘入门介绍(如何着手数据挖掘)
  • Information Overload Survey Results
  • The INEX 2005 Workshop on Element Retrie
  • 相关资讯
  • 什么是luncene
  • 什么是nutch
  • 让Nutch支持中文分词
  • 关于lucene 结构及内层的研究(一)
  • Lucene In Action ch 5 笔记 --高级搜索技
  • 第三节 Lucene索引文件格式分析
  • 第二节 Lucene系统结构分析
  • 第一节 全文检索系统与Lucene简介
  • Lucene的查询语法!
  • 第四节 Lucene索引构建逻辑模块分析
  • 数据挖掘实验室资料
  • 数据挖掘博客地址
  • 数据挖掘实验室网站地址
  • Prepare for Medicare audits by using dat
  • 注册成为SAS用户与爱好者俱乐部会员
  • 水南梅
  • 明日烟
  • 新人报道
  • 下载
  • 厦门服务器托管,450元/月—0592-5177319 高
  • 买空间送域名--0592-5177319 高静