about Lucene ppt

Prelude

  • my background..
  • please interrupt with questions
  • blog this talk now so that we can search for it later
    (using a Lucene-based blog search engine, of course)
  • In this course, Paolo and Antonio have presented many techniques.
  • I present real software that uses many of these techniques.

Lucene is


Lucene Architecture

数据挖掘研究院
[draw on whiteboard for reference throughout talk]
数据挖掘研究院


Lucene API

  • org.apache.lucene.document
  • org.apache.lucene.analysis
  • org.apache.lucene.index
  • org.apache.lucene.search


数据挖掘工具


Package: org.apache.lucene.document

  • A Document is a sequence of Fields.
  • A Field is a <name, value> pair.
    • name is the name of the field, e.g., title, body, subject, date, etc.
    • value is text.
  • Field values may be stored, indexed or analyzed (and, now, vectored).

Example

 public Document makeDocument(File f) throws FileNotFoundException { 

数据挖掘研究院


Document doc = new Document();
doc.add(new Field("path", f.getPath(), Store.YES, Index.TOKENIZED));

doc.add(new Field("modified",
DateTools.timeToString(f.lastModified(), Resolution.MINUTE),
Store.YES, Index.UN_TOKENIZED));

// Reader implies Store.NO and Index.TOKENIZED
doc.add(new Field("contents", new FileReader(f)));

return doc;
} 数据挖掘交友

 

Example (continued)

field
stored
indexed
analyzed
path
yes
yes
yes
modified 数据挖掘工具
yes
yes
no
content
no
yes
yes

数据挖掘实验室


Package: org.apache.lucene.analysis

  • An Analyzer is a TokenStream factory.
  • A TokenStream is an iterator over Tokens.
    • input is a character iterator (Reader)
  • A Token is tuple <text, type, start, length, positionIncrement>
    • text (e.g., “pisa”).
    • type (e.g., “word”, “sent”, “para”).
    • start & length offsets, in characters (e.g, <5,4>)
    • positionIncrement (normally 1)
  • standard TokenStream implementations are
    • Tokenizers, which divide characters into tokens and 数据挖掘研究院
    • TokenFilters, e.g., stop lists, stemmers, etc.

Example

public class ItalianAnalyzer extends Analyzer {
数据挖掘研究院

  private Set stopWords = 
StopFilter.makeStopSet(new String[] {"il", "la", "in"};

public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new WhitespaceTokenizer(reader);
 result = new LowerCaseFilter(result);
  result = new StopFilter(result, stopWords);
  result = new SnowballFilter(result, "Italian");
 return result;
 }
}
数据挖掘论坛

Package: org.apache.lucene.index

  • Term is <fieldName, text>
  • index maps Term → <df, <docNum, <position>* >*>
  • e.g., “content:pisa” → <2, <2, <14>>, <4, <2, 9>>>
  • new: term vectors!

Example

IndexWriter writer = new IndexWriter("index", new ItalianAnalyzer());
File[] files = directory.listFiles();
for (int i = 0; i < files.length; i++) {
writer.addDocument(makeDocument(files[i]));
}
writer.close();
数据挖掘工具

Some Inverted Index Strategies

  1. batch-based: use file-sorting algorithms (textbook)
      + fastest to build
      + fastest to search
      - slow to update
  2. b-tree based: update in place (http://lucene.sf.net/papers/sigir90.ps)
      + fast to search
      - update/build does not scale
      - complex implementation
  3. segment based: lots of small indexes (Verity)
      + fast to build
      + fast to update
      - slower to search

Lucene′s Index Algorithm

  • two basic algorithms:
    1. make an index for a single document
    2. merge a set of indices
  • incremental algorithm:
    • maintain a stack of segment indices
    • create index for each incoming document
    • push new indexes onto the stack
    • let b=10 be the merge factor; M=∞ 数据挖掘工具
      for (size = 1; size < M; size *= b) {
        if (there are b indexes with size docs on top of the stack) {
         pop them off the stack;
         merge them into a single index;
         push the merged index onto the stack;
       } else {
         break;
       }
      }
  • optimization: single-doc indexes kept in RAM, saves system calls
  • notes:
    • average b*logb(N)/2 indexes
      • N=1M, b=2 gives just 20 indexes
      • fast to update and not too slow to search
    • batch indexing  w/ M=∞, merge all at end
    • segment indexing w/ M<∞

Indexing Diagram

  • b = 3
  • 11 documents indexed
  • stack has four indexes
  • grayed indexes have been deleted
  • 5 merges have occurred


数据挖掘实验室


Index Compression

For keys in Term -> ... map, use technique from Paolo′s slides:


For values in Term -> ... map, use technique from Paolo′s slides:

数据挖掘研究院



数据挖掘实验室

VInt Encoding Example 数据挖掘论坛

Value 数据挖掘研究院

First byte 数据挖掘研究院

Second byte

Third byte 数据挖掘交友

0 数据挖掘工具

00000000

数据挖掘交友


数据挖掘交友


数据挖掘论坛

1 数据挖掘工具

00000001


数据挖掘研究院


数据挖掘交友

2 数据挖掘论坛

00000010 数据挖掘交友



... 数据挖掘研究院



数据挖掘交友


数据挖掘交友

127

01111111



数据挖掘工具

128 数据挖掘工具

10000000

00000001 数据挖掘交友


129

数据挖掘实验室

10000001

00000001


130 数据挖掘实验室

10000010

00000001

数据挖掘论坛


数据挖掘实验室

... 数据挖掘研究院


数据挖掘论坛


数据挖掘工具


数据挖掘工具

16,383

数据挖掘论坛

11111111 数据挖掘工具

01111111 数据挖掘研究院


16,384 数据挖掘研究院

10000000 数据挖掘工具

10000000

00000001

16,385

数据挖掘交友

10000001

10000000

00000001

数据挖掘论坛

... 数据挖掘研究院


数据挖掘论坛


数据挖掘研究院


数据挖掘交友

数据挖掘工具

This provides compression while still being efficient to decode.

数据挖掘实验室


Package: org.apache.lucene.search

  • primitive queries:
    • TermQuery: match docs containing a Term
    • PhraseQuery: match docs w/ sequence of Terms
    • BooleanQuery: match docs matching other queries.
      e.g., +path:pisa +content:“Doug Cutting” -path:nutch
  • new: SpansQuery 数据挖掘论坛
  • derived queries:
    • PrefixQuery, WildcardQuery, etc.

Example

Query pisa = new TermQuery(new Term("content", "pisa"));
Query babel = new TermQuery(new Term("content", "babel")); 数据挖掘实验室

PhraseQuery leaningTower = new PhraseQuery();
leaningTower.add(new Term("content", "leaning"));
leaningTower.add(new Term("content", "tower"));

BooleanQuery query = new BooleanQuery();
query.add(leaningTower, Occur.MUST);
query.add(pisa, Occur.SHOULD);
query.add(babel, Occur.MUST_NOT);

数据挖掘实验室


数据挖掘工具

Search Algorithms

From Paolo′s slides:

数据挖掘论坛


Lucene′s Disjunctive Search Algorithm

  • described in http://lucene.sf.net/papers/riao97.ps
  • since all postings must be processed
    • goal is to minimize per-posting computation
  • merges postings through a fixed-size array of accumulator buckets
  • performs boolean logic with bit masks
  • scales well with large queries

[ draw a diagram to illustrate? ]


Lucene′s Conjunctive Search Algorithm

From Paolo′s slides:


Algorithm
数据挖掘论坛

  • use linked list of pointers to doc list
  • initially sorted by doc
  • loop
    • if all are at same doc, record hit
    • skip first to-or-past last and move to end of list

Scoring

From Paolo′s slides:

Is very much like Lucene′s Similarity.
数据挖掘工具


Lucene′s Phrase Scoring

  • approximate phrase IDF with sum of terms
  • compute actual tf of phrase
  • slop penalizes slight mismatches by edit-distance

Thanks!

And there′s lots more to Lucene.
Check out http://jakarta.apache.org/lucene/.

数据挖掘论坛

[数据挖掘专家] [数据挖掘研究院] [数据挖掘论坛] [数据挖掘实验室]
上一篇:全文检索系统与Lucene简介
下一篇:Nutch中Analysis包下的NutchAnalysis.jj解读
最新评论共有 0 位网友发表了评论 , 查看所有评论
发表评论( 不能超过250字,需审核,请自觉遵守互联网相关政策法规。 )
匿名?
数据挖掘网站导航 数据挖掘论坛导航
  • 数据挖掘工具
  • 数据挖掘论坛
  • DataCruncher - Cognos
  • MineSet - MathSoft
  • Intelligent Miner - GainSmarts
  • Sqlserver - SAS - Clementine
  • CART - Weka - WizSoft
  • NeuroShell - ModelQuest
  • data mining tools - Darwin
  • 数据挖掘交友
  • 数据挖掘博客
  • 数据挖掘工具
  • 数据挖掘资源
  • 数据挖掘技术算法
  • 数据挖掘相关期刊、会议
  • 研究院联盟合作专区
  • 数据挖掘基础与相关技术
  • 数据挖掘厂商与就业
  • 数据挖掘研究者乐园
  • 知名厂商数据挖掘工具资料
  • 国内数据挖掘实验室
  • Foreign Data Mining Lab
  • 热点关注
  • Larbin网站爬虫简明使用说明
  • 全文检索引擎Lucene源码分析-analysis包
  • Nutch爬虫工作流程及文件格式详细分析
  • Lucene 基础指南(Java版)
  • 关于lucene 结构及内层的研究(一)
  • 实现NUTCH中文分词的代码修改方法
  • 利用Lucene搜索Java源代码
  • Lucene In Action ch 5 笔记 --高级搜索技
  • 第三节 Lucene索引文件格式分析
  • 如何使用Lucene进行全文检索(一)
  • 论坛最新话题
  • Foundations of Statistical Natural Langu
  • Game Theory meet Data Mining: A Recent P
  • System Building: How does it help or hin
  • 数据挖掘与Clementine培训
  • 新手报到
  • 求 SASEM 客户流失预测分析
  • 数据挖掘工程师/搜索研究院—北京——无线
  • 数据挖掘入门介绍(如何着手数据挖掘)
  • Information Overload Survey Results
  • The INEX 2005 Workshop on Element Retrie
  • 相关资讯
  • 什么是luncene
  • 什么是nutch
  • 让Nutch支持中文分词
  • 关于lucene 结构及内层的研究(一)
  • Lucene In Action ch 5 笔记 --高级搜索技
  • 第三节 Lucene索引文件格式分析
  • 第二节 Lucene系统结构分析
  • 第一节 全文检索系统与Lucene简介
  • Lucene的查询语法!
  • 第四节 Lucene索引构建逻辑模块分析
  • 数据挖掘实验室资料
  • 数据挖掘博客地址
  • 数据挖掘实验室网站地址
  • Prepare for Medicare audits by using dat
  • 注册成为SAS用户与爱好者俱乐部会员
  • 水南梅
  • 明日烟
  • 新人报道
  • 下载
  • 厦门服务器托管,450元/月—0592-5177319 高
  • 买空间送域名--0592-5177319 高静