Nutch0.7出来,对于我们这些Lucene爱好者来说,是件兴奋的事情!当时用Nutch0.6做实验已经让我非常兴奋。
Nutch0.7比之0.6又增添了一部分功能,我们开发一些小型搜索引擎可以修改Nutch的一些接口,使之更加用户化。
此节主要讲述Nutch下的Analysis包即package org.apache.nutch.analysis。
搜索引擎去除Spider外基本架构可以简要描述如下
由于Analysis主要处理分词问题。在图中看出需要分词的地方就是所用入库和对用户查询进行分词。
Nutch的Analysis有一下几个文件:(黄色底是NutchAPI文档列的)
CharStream.java:
interface CharStream
*这个接口描述了一个维护字符串行和列位置的字符流。在某种程度上,它还具有备份字符流的 数据挖掘研究院
*能力。此接口的一个实现有JavaccParser产生,用于TokenManager这个实现中。
数据挖掘研究院
* 数据挖掘研究院
*除去backup方法外的其他所有方法都可以被以任何方式实现。backup的正确实现需要对lexer的正确操作。 数据挖掘研究院
*其他的方法被用来获取信息,诸如行序号、列序号、用于Token却没有用于lexer的字符串。
*因此这些方法对应的实现不会影响产生lexer的操作
CommonGrams.java
*对于索引时经常出现的项和词构建n-grams(不是很明白).使用n-grams优化词查询。单个项依旧使用 数据挖掘研究院
*覆盖的n-grams索引。 数据挖掘实验室
FastCharStream.java
*CharStream接口的一个有效实现。注意这并没有进行行数计算,当时追踪了在输入中Token的字符位置 数据挖掘研究院
*这个字符位置是Lucene的API需要的
NutchAnalysis.java
应该说是Analysis包的核心了,主要完成Nutch语言(词典)分析器和查询分析器。具体到语言分析,要去
数据挖掘研究院
Stop Words,然后中文分词,或者英文分词等等。
NutchAnalysisConstants.java
一个接口主要被NutchAnalysis和NutchTokenManager所用,里面含的都是常量。譬如Token的类型(冒号,省略号,阿拉伯数字,短语等等)和TokenImage(就是Token类型的string化) 数据挖掘研究院
NutchAnalysisTokenManager.java
管理Token,为NutchAnalysis所用。
NutchDocumentAnalyzer.java
*NutchDocumentAnalyzer为Nuctch文档服务。使用JavaCC定义的语言(词典)分析器 数据挖掘研究院
*(@link NutchDocumentTokenizer),不含有StopWords列表。从而保持了查询分析的连贯性。 数据挖掘研究院
与NutchAnalysis相对独立
NutchDocumentTokenizer.java
*此分词器用于对Nutch的文档文本进行分词。是JavaCC所产生的词典分析器的实现。 数据挖掘实验室
为NutchDocumentAnalyzer所用。 数据挖掘实验室
ParseException.java
*当查询分析出错时,此异常被抛出。在产生的查询分析器(Parser)中。
数据挖掘研究院
*你可以通过调用方法generateParseException来显示地创建此异常类的对象。 数据挖掘研究院
*只要你保留公共feild(成员变量),你可以修改这个类使其报错机制更加用户化。
数据挖掘实验室
Token.java
Token类内含Token的类型,每个Token在字符串中的起始位置和终止位置等等 数据挖掘研究院
TokenManager.java
一个接口,非常简单,里面只有一个用于返回下一个Token的函数。在Analysis包没有被使用。可能留着以后扩展用 数据挖掘研究院
TokenMgrError.java
主要用于分词出错,进行报错。
数据挖掘实验室
其主要的类别继承图如下:
Class Hierarchy
Interface Hierarchy
Nutch的底层是基于Lucene的,从图中你可以看出主要的两大接口NutchDocumentAnalyzer和NutchDocumentTokenizer都是从Lucene继承过来的。所以更有必要认真研究Lucene的Analysis包。
CharStream.java 数据挖掘研究院
/* Generated By:JavaCC: Do not edit this line. CharStream.java Version 3.0 */
package org.apache.nutch.analysis;
/** 数据挖掘研究院
* This interface describes a character stream that maintains line and
数据挖掘研究院
* column number positions of the characters. It also has the capability 数据挖掘研究院
* to backup the stream to some extent. An implementation of this
数据挖掘研究院
* interface is used in the TokenManager implementation generated by 数据挖掘研究院
* JavaCCParser.
*
数据挖掘研究院
* All the methods except backup can be implemented in any fashion. backup
数据挖掘实验室
* needs to be implemented correctly for the correct operation of the lexer. 数据挖掘研究院
* Rest of the methods are all used to get information like line number, 数据挖掘研究院
* column number and the String that constitutes a token and are not used
数据挖掘研究院
* by the lexer. Hence their implementation won′t affect the generated lexer′s 数据挖掘研究院
* operation.
数据挖掘研究院
*/
/**
*这个接口描述了一个维护字符串行和列位置的字符流。在某种程度上,它还具有备份字符流的 数据挖掘研究院
*能力。此接口的一个实现有JavaccParser产生,用于TokenManager这个实现中。
* 数据挖掘研究院
*除去backup方法外的其他所有方法都可以被以任何方式实现。backup的正确实现需要对lexer的正确操作。 数据挖掘研究院
*其他的方法被用来获取信息,诸如行序号、列序号、用于Token却没有用于lexer的字符串。
*因此这些方法对应的实现不会影响产生lexer的操作。
数据挖掘研究院
*/
interface CharStream {
/**
* Returns the next character from the selected input. The method
* of selecting the input is the responsibility of the class
* implementing this interface. Can throw any java.io.IOException.
*/
char readChar() throws java.io.IOException;
/**
* Returns the column position of the character last read.
* @deprecated
* @see #getEndColumn
*/
int getColumn();
/**
* Returns the line number of the character last read.
* @deprecated
* @see #getEndLine
*/
int getLine();
/**
* Returns the column number of the last character for current token (being
* matched after the last call to BeginTOken).
*/
int getEndColumn();
/**
* Returns the line number of the last character for current token (being
* matched after the last call to BeginTOken).
*/
int getEndLine();
/**
* Returns the column number of the first character for current token (being
* matched after the last call to BeginTOken).
*/
int getBeginColumn();
/**
* Returns the line number of the first character for current token (being
* matched after the last call to BeginTOken).
*/
int getBeginLine();
/**
* Backs up the input stream by amount steps. Lexer calls this method if it
* had already read some characters, but could not use them to match a
* (longer) token. So, they will be used again as the prefix of the next
* token and it is the implemetation′s responsibility to do this right.
*/
void backup(int amount);
/**
* Returns the next character that marks the beginning of the next token.
* All characters must remain in the buffer between two successive calls
* to this method to implement backup correctly.
*/
char BeginToken() throws java.io.IOException;
/**
* Returns a string made up of characters from the marked token beginning
* to the current buffer position. Implementations have the choice of returning
* anything that they want to. For example, for efficiency, one might decide
* to just return null, which is a valid implementation.
*/
String GetImage();
/**
* Returns an array of characters that make up the suffix of length ′len′ for
* the currently matched token. This is used to build up the matched string
* for use in actions in the case of MORE. A simple and inefficient
* implementation of this is as follows :
*
* {
* String t = GetImage();
* return t.substring(t.length() - len, t.length()).toCharArray();
* }
*/
char[] GetSuffix(int len);
/**
* The lexer calls this function to indicate that it is done with the stream
* and hence implementations can free any resources held by this class.
* Again, the body of this function can be just empty and it will not
* affect the lexer′s operation.
*/
void Done();
}
CommonGrams.java 数据挖掘研究院
package org.apache.nutch.analysis;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Token;
import java.io.*;
import java.util.*;
import java.util.logging.Logger;
import org.apache.nutch.util.*;
import org.apache.nutch.searcher.Query.*;
/** Construct n-grams for frequently occuring terms and phrases while indexing.
* Optimize phrase queries to use the n-grams. Single terms are still indexed 数据挖掘研究院
* too, with n-grams overlaid. This is achieved through the use of {@link 数据挖掘研究院
* Token#setPositionIncrement(int)}.*/
/**
*对于索引时经常出现的项和词构建n-grams.使用n-grams优化词查询。单个项依旧使用 数据挖掘研究院
*覆盖的n-grams索引。这可以通过使用{@link Token#setPositionIncrement(int)}达到 数据挖掘实验室
*/
public class CommonGrams {
private static final Logger LOG =
LogFormatter.getLogger("org.apache.nutch.analysis.CommonGrams");
private static final char SEPARATOR = ′-′;
private static final HashMap COMMON_TERMS = new HashMap();
static { init(); }
private CommonGrams() {} // no public ctor
private static class Filter extends TokenFilter {
private HashSet common;
private Token previous;
private LinkedList gramQueue = new LinkedList();
private LinkedList nextQueue = new LinkedList();
private StringBuffer buffer = new StringBuffer();
/** Construct an n-gram producing filter. */
public Filter(TokenStream input, HashSet common) {
super(input);
this.common = common;
}
/** Inserts n-grams into a token stream. */
public Token next() throws IOException {
if (gramQueue.size() != 0) // consume any queued tokens
return (Token)gramQueue.removeFirst();
final Token token = popNext();
if (token == null)
return null;
if (!isCommon(token)) { // optimize simple case
previous = token;
return token;
}
gramQueue.add(token); // queue the token
ListIterator i = nextQueue.listIterator();
Token gram = token;
while (isCommon(gram)) {
if (previous != null && !isCommon(previous)) // queue prev gram first
gramQueue.addFirst(gramToken(previous, gram));
Token next = peekNext(i);
if (next == null)
break;
gram = gramToken(gram, next); // queue next gram last
gramQueue.addLast(gram);
}
previous = token;
return (Token)gramQueue.removeFirst();
}
/** True iff token is for a common term. */
private boolean isCommon(Token token) {
return common != null && common.contains(token.termText());
}
/** Pops nextQueue or, if empty, reads a new token. */
private Token popNext() throws IOException {
if (nextQueue.size() > 0)
return (Token)nextQueue.removeFirst();
else
return input.next();
}
/** Return next token in nextQueue, extending it when empty. */
private Token peekNext(ListIterator i) throws IOException {
if (!i.hasNext()) {
Token next = input.next();
if (next == null)
return null;
i.add(next);
i.previous();
}
return (Token)i.next();
}
/** Construct a compound token. */
private Token gramToken(Token first, Token second) {
buffer.setLength(0);
buffer.append(first.termText());
buffer.append(SEPARATOR);
buffer.append(second.termText());
Token result = new Token(buffer.toString(),
first.startOffset(), second.endOffset(),
"gram");
result.setPositionIncrement(0);
return result;
}
}
/** Construct using the provided config file. */
private static void init() {
try {
Reader reader = NutchConf.get().getConfResourceAsReader
(NutchConf.get().get("analysis.common.terms.file"));
BufferedReader in = new BufferedReader(reader);
String line;
while ((line = in.readLine()) != null) {
line = line.trim();
if (line.startsWith("#") || "".equals(line)) // skip comments
continue;
TokenStream ts = new NutchDocumentTokenizer(new StringReader(line));
Token token = ts.next();
if (token == null) {
LOG.warning("Line does not contain a field name: " + line);
continue;
}
String field = token.termText();
token = ts.next();
if (token == null) {
LOG.warning("Line contains only a field name, no word: " + line);
continue;
}
String gram = token.termText();
while ((token = ts.next()) != null) {
gram = gram + SEPARATOR + token.termText();
}
HashSet table = (HashSet)COMMON_TERMS.get(field);
if (table == null) {
table = new HashSet();
COMMON_TERMS.put(field, table);
}
table.add(gram);
}
} catch (IOException e) {
throw new RuntimeException(e.toString());
}
}
/** Construct a token filter that inserts n-grams for common terms. For use
* while indexing documents. */
public static TokenFilter getFilter(TokenStream ts, String field) {
return new Filter(ts, (HashSet)COMMON_TERMS.get(field));
}
/** Utility to convert an array of Query.Terms into a token stream. */
private static class ArrayTokens extends TokenStream {
private Term[] terms;
private int index;
public ArrayTokens(Phrase phrase) { this.terms = phrase.getTerms(); }
public Token next() {
if (index == terms.length)
return null;
else
return new Token(terms[index].toString(), index, ++index);
}
}
/** Optimizes phrase queries to use n-grams when possible. */
public static String[] optimizePhrase(Phrase phrase, String field) {
//LOG.info("Optimizing " + phrase + " for " + field);
ArrayList result = new ArrayList();
TokenStream ts = getFilter(new ArrayTokens(phrase), field);
Token token, prev=null;
int position = 0;
try {
while ((token = ts.next()) != null) {
if (token.getPositionIncrement() != 0 && prev != null)
result.add(prev.termText());
prev = token;
position += token.getPositionIncrement();
if ((position + arity(token.termText())) == phrase.getTerms().length)
break;
}
} catch (IOException e) {
throw new RuntimeException(e.toString());
}
if (prev != null)
result.add(prev.termText());
// LOG.info("Optimized: ");
// for (int i = 0; i < result.size(); i++) {
// LOG.info(result.get(i) + " ");
// }
return (String[])result.toArray(new String[result.size()]);
}
private static int arity(String gram) {
int index = 0;
int arity = 0;
while ((index = gram.indexOf(SEPARATOR, index+1)) != -1) {
arity++;
}
return arity;
}
/** For debugging. */
public static void main(String[] args) throws Exception {
StringBuffer text = new StringBuffer();
for (int i = 0; i < args.length; i++) {
text.append(args[i]);
text.append(′ ′);
}
TokenStream ts =
new NutchDocumentTokenizer(new StringReader(text.toString()));
ts = getFilter(ts, "url");
Token token;
while ((token = ts.next()) != null) {
System.out.println("Token: " + token);
}
String[] optimized = optimizePhrase(new Phrase(args), "url");
System.out.print("Optimized: ");
for (int i = 0; i < optimized.length; i++) {
System.out.print(optimized[i] + " ");
}
System.out.println();
}
}
FastCharStream.java 数据挖掘实验室
package org.apache.nutch.analysis;
import java.io.*;
/** An efficient implementation of JavaCC′s CharStream interface. <p>Note that 数据挖掘研究院
* this does not do line-number counting, but instead keeps track of the
数据挖掘研究院
* character position of the token in the input, as required by Lucene′s {@link
* org.apache.lucene.analysis.Token} API. */
/** 数据挖掘研究院
*CharStream接口的一个有效实现。注意这并没有进行行数计算,当时追踪了在输入中Token的字符位置
数据挖掘研究院
*这个字符位置是Lucene的API需要的。
*/
数据挖掘研究院
final class FastCharStream implements CharStream {
char[] buffer = null;
int bufferLength = 0; // end of valid chars
int bufferPosition = 0; // next char to read
int tokenStart = 0; // offset in buffer
int bufferStart = 0; // position in file of buffer
Reader input; // source of chars
/** Constructs from a Reader. */
public FastCharStream(Reader r) {
input = r;
}
public final char readChar() throws IOException {
if (bufferPosition >= bufferLength)
refill();
return buffer[bufferPosition++];
}
private final void refill() throws IOException {
int newPosition = bufferLength - tokenStart;
if (tokenStart == 0) { // token won′t fit in buffer
if (buffer == null) { // first time: alloc buffer
buffer = new char[2048];
} else if (bufferLength == buffer.length) { // grow buffer
char[] newBuffer = new char[buffer.length*2];
System.arraycopy(buffer, 0, newBuffer, 0, bufferLength);
buffer = newBuffer;
}
} else { // shift token to front
System.arraycopy(buffer, tokenStart, buffer, 0, newPosition);
}
bufferLength = newPosition; // update state
bufferPosition = newPosition;
bufferStart += tokenStart;