RSS
热门关键字:  数据挖掘  人工智能  数据仓库  搜索引擎  数据挖掘导论

Nutch源码学习系列之一:Analysis包

来源: 作者:unkonwn 时间:2004-12-06 点击:

Nutch0.7出来,对于我们这些Lucene爱好者来说,是件兴奋的事情!当时用Nutch0.6做实验已经让我非常兴奋。
Nutch0.7比之0.6又增添了一部分功能,我们开发一些小型搜索引擎可以修改Nutch的一些接口,使之更加用户化。
此节主要讲述Nutch下的Analysis包即package org.apache.nutch.analysis。
阅读此包是你可以参照Nutch的API文档。http://lucene.apache.org/nutch/apidocs/index.html。或者使用的硬盘上的也可。
搜索引擎去除Spider外基本架构可以简要描述如下
 

由于Analysis主要处理分词问题。在图中看出需要分词的地方就是所用入库和对用户查询进行分词。
Nutch的Analysis有一下几个文件:(黄色底是NutchAPI文档列的)
CharStream.java:
     interface CharStream

 *这个接口描述了一个维护字符串行和列位置的字符流。在某种程度上,它还具有备份字符流的 数据挖掘研究院

 *能力。此接口的一个实现有JavaccParser产生,用于TokenManager这个实现中。

数据挖掘研究院

 * 数据挖掘研究院

 *除去backup方法外的其他所有方法都可以被以任何方式实现。backup的正确实现需要对lexer的正确操作。 数据挖掘研究院

 *其他的方法被用来获取信息,诸如行序号、列序号、用于Token却没有用于lexer的字符串。

 *因此这些方法对应的实现不会影响产生lexer的操作   
CommonGrams.java

     *对于索引时经常出现的项和词构建n-grams(不是很明白).使用n-grams优化词查询。单个项依旧使用 数据挖掘研究院

     *覆盖的n-grams索引。 数据挖掘实验室

FastCharStream.java

     *CharStream接口的一个有效实现。注意这并没有进行行数计算,当时追踪了在输入中Token的字符位置 数据挖掘研究院

     *这个字符位置是LuceneAPI需要的
NutchAnalysis.java

     应该说是Analysis包的核心了,主要完成Nutch语言(词典)分析器和查询分析器。具体到语言分析,要去

数据挖掘研究院

Stop Words,然后中文分词,或者英文分词等等。
NutchAnalysisConstants.java

      一个接口主要被NutchAnalysisNutchTokenManager所用,里面含的都是常量。譬如Token的类型(冒号,省略号,阿拉伯数字,短语等等)和TokenImage(就是Token类型的string化) 数据挖掘研究院

NutchAnalysisTokenManager.java
      管理Token,为NutchAnalysis所用
NutchDocumentAnalyzer.java

      *NutchDocumentAnalyzerNuctch文档服务。使用JavaCC定义的语言(词典)分析器 数据挖掘研究院

   *@link NutchDocumentTokenizer,不含有StopWords列表。从而保持了查询分析的连贯性。 数据挖掘研究院

     NutchAnalysis相对独立
NutchDocumentTokenizer.java

      *此分词器用于对Nutch的文档文本进行分词。是JavaCC所产生的词典分析器的实现。 数据挖掘实验室

       NutchDocumentAnalyzer所用。 数据挖掘实验室

ParseException.java

        *当查询分析出错时,此异常被抛出。在产生的查询分析器(Parser)中。

数据挖掘研究院

        *你可以通过调用方法generateParseException来显示地创建此异常类的对象。 数据挖掘研究院

*只要你保留公共feild(成员变量),你可以修改这个类使其报错机制更加用户化。

数据挖掘实验室

Token.java

        Token类内含Token的类型,每个Token在字符串中的起始位置和终止位置等等 数据挖掘研究院

TokenManager.java

       一个接口,非常简单,里面只有一个用于返回下一个Token的函数。在Analysis包没有被使用。可能留着以后扩展用 数据挖掘研究院

TokenMgrError.java

        主要用于分词出错,进行报错。

数据挖掘实验室

其主要的类别继承图如下:
Class Hierarchy
Interface Hierarchy
Nutch的底层是基于Lucene的,从图中你可以看出主要的两大接口NutchDocumentAnalyzer和NutchDocumentTokenizer都是从Lucene继承过来的。所以更有必要认真研究Lucene的Analysis包。

CharStream.java 数据挖掘研究院

/* Generated By:JavaCC: Do not edit this line. CharStream.java Version 3.0 */
package org.apache.nutch.analysis;

/** 数据挖掘研究院

 * This interface describes a character stream that maintains line and

数据挖掘研究院

 * column number positions of the characters. It also has the capability 数据挖掘研究院

 * to backup the stream to some extent. An implementation of this

数据挖掘研究院

 * interface is used in the TokenManager implementation generated by 数据挖掘研究院

 * JavaCCParser.

 *

数据挖掘研究院

 * All the methods except backup can be implemented in any fashion. backup

数据挖掘实验室

 * needs to be implemented correctly for the correct operation of the lexer. 数据挖掘研究院

 * Rest of the methods are all used to get information like line number, 数据挖掘研究院

 * column number and the String that constitutes a token and are not used

数据挖掘研究院

 * by the lexer. Hence their implementation won′t affect the generated lexer′s 数据挖掘研究院

 * operation.

数据挖掘研究院

 */

/**

 *这个接口描述了一个维护字符串行和列位置的字符流。在某种程度上,它还具有备份字符流的 数据挖掘研究院

 *能力。此接口的一个实现有JavaccParser产生,用于TokenManager这个实现中。

 * 数据挖掘研究院

 *除去backup方法外的其他所有方法都可以被以任何方式实现。backup的正确实现需要对lexer的正确操作。 数据挖掘研究院

 *其他的方法被用来获取信息,诸如行序号、列序号、用于Token却没有用于lexer的字符串。

 *因此这些方法对应的实现不会影响产生lexer的操作。

数据挖掘研究院

*/

interface CharStream {

 /**
   * Returns the next character from the selected input. The method
   * of selecting the input is the responsibility of the class
   * implementing this interface. Can throw any java.io.IOException.
   */
 char readChar() throws java.io.IOException;

 /**
   * Returns the column position of the character last read.
   * @deprecated
   * @see #getEndColumn
   */
 int getColumn();

 /**
   * Returns the line number of the character last read.
   * @deprecated
   * @see #getEndLine
   */
 int getLine();

 /**
   * Returns the column number of the last character for current token (being
   * matched after the last call to BeginTOken).
   */
 int getEndColumn();

 /**
   * Returns the line number of the last character for current token (being
   * matched after the last call to BeginTOken).
   */
 int getEndLine();

 /**
   * Returns the column number of the first character for current token (being
   * matched after the last call to BeginTOken).
   */
 int getBeginColumn();

 /**
   * Returns the line number of the first character for current token (being
   * matched after the last call to BeginTOken).
   */
 int getBeginLine();

 /**
   * Backs up the input stream by amount steps. Lexer calls this method if it
   * had already read some characters, but could not use them to match a
   * (longer) token. So, they will be used again as the prefix of the next
   * token and it is the implemetation′s responsibility to do this right.
   */
 void backup(int amount);

 /**
   * Returns the next character that marks the beginning of the next token.
   * All characters must remain in the buffer between two successive calls
   * to this method to implement backup correctly.
   */
 char BeginToken() throws java.io.IOException;

 /**
   * Returns a string made up of characters from the marked token beginning
  * to the current buffer position. Implementations have the choice of returning
   * anything that they want to. For example, for efficiency, one might decide
   * to just return null, which is a valid implementation.
   */
 String GetImage();

 /**
   * Returns an array of characters that make up the suffix of length ′len′ for
   * the currently matched token. This is used to build up the matched string
   * for use in actions in the case of MORE. A simple and inefficient
   * implementation of this is as follows :
   *
   *   {
   *      String t = GetImage();
   *      return t.substring(t.length() - len, t.length()).toCharArray();
   *   }
   */
 char[] GetSuffix(int len);

 /**
   * The lexer calls this function to indicate that it is done with the stream
   * and hence implementations can free any resources held by this class.
   * Again, the body of this function can be just empty and it will not
   * affect the lexer′s operation.
   */
 void Done();

}

CommonGrams.java 数据挖掘研究院

package org.apache.nutch.analysis;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Token;

import java.io.*;
import java.util.*;
import java.util.logging.Logger;

import org.apache.nutch.util.*;

import org.apache.nutch.searcher.Query.*;

/** Construct n-grams for frequently occuring terms and phrases while indexing.

 * Optimize phrase queries to use the n-grams. Single terms are still indexed 数据挖掘研究院

 * too, with n-grams overlaid. This is achieved through the use of {@link 数据挖掘研究院

 * Token#setPositionIncrement(int)}.*/

 /**

   *对于索引时经常出现的项和词构建n-grams.使用n-grams优化词查询。单个项依旧使用 数据挖掘研究院

   *覆盖的n-grams索引。这可以通过使用{@link Token#setPositionIncrement(int)}达到 数据挖掘实验室

 */

public class CommonGrams {
 private static final Logger LOG =
    LogFormatter.getLogger("org.apache.nutch.analysis.CommonGrams");
 private static final char SEPARATOR = ′-′;
 private static final HashMap COMMON_TERMS = new HashMap();

 static { init(); }

 private CommonGrams() {}                        // no public ctor

 private static class Filter extends TokenFilter {
    private HashSet common;
    private Token previous;
    private LinkedList gramQueue = new LinkedList();
    private LinkedList nextQueue = new LinkedList();
    private StringBuffer buffer = new StringBuffer();

    /** Construct an n-gram producing filter. */
    public Filter(TokenStream input, HashSet common) {
      super(input);
      this.common = common;
    }

    /** Inserts n-grams into a token stream. */
    public Token next() throws IOException {
      if (gramQueue.size() != 0)                  // consume any queued tokens
        return (Token)gramQueue.removeFirst();

      final Token token = popNext();
      if (token == null)
        return null;

      if (!isCommon(token)) {                     // optimize simple case
        previous = token;
       return token;
      }

      gramQueue.add(token);                       // queue the token

      ListIterator i = nextQueue.listIterator();
      Token gram = token;
      while (isCommon(gram)) {
        if (previous != null && !isCommon(previous)) // queue prev gram first
          gramQueue.addFirst(gramToken(previous, gram));

        Token next = peekNext(i);
        if (next == null)
          break;

        gram = gramToken(gram, next);             // queue next gram last
        gramQueue.addLast(gram);
      }

      previous = token;
      return (Token)gramQueue.removeFirst();
    }

    /** True iff token is for a common term. */
    private boolean isCommon(Token token) {
      return common != null && common.contains(token.termText());
    }

    /** Pops nextQueue or, if empty, reads a new token. */
    private Token popNext() throws IOException {
      if (nextQueue.size() > 0)
        return (Token)nextQueue.removeFirst();
      else
        return input.next();
    }

    /** Return next token in nextQueue, extending it when empty. */
    private Token peekNext(ListIterator i) throws IOException {
      if (!i.hasNext()) {
        Token next = input.next();
        if (next == null)
          return null;
        i.add(next);
        i.previous();
      }
      return (Token)i.next();
    }

    /** Construct a compound token. */
    private Token gramToken(Token first, Token second) {
      buffer.setLength(0);
      buffer.append(first.termText());
      buffer.append(SEPARATOR);
     buffer.append(second.termText());
      Token result = new Token(buffer.toString(),
                               first.startOffset(), second.endOffset(),
                               "gram");
      result.setPositionIncrement(0);
      return result;
    }
 }

 /** Construct using the provided config file. */
 private static void init() {
    try {
      Reader reader = NutchConf.get().getConfResourceAsReader
        (NutchConf.get().get("analysis.common.terms.file"));
      BufferedReader in = new BufferedReader(reader);
      String line;
      while ((line = in.readLine()) != null) {
        line = line.trim();
        if (line.startsWith("#") || "".equals(line)) // skip comments
          continue;
        TokenStream ts = new NutchDocumentTokenizer(new StringReader(line));
        Token token = ts.next();
        if (token == null) {
          LOG.warning("Line does not contain a field name: " + line);
          continue;
        }
        String field = token.termText();
        token = ts.next();
        if (token == null) {
          LOG.warning("Line contains only a field name, no word: " + line);
          continue;
        }
        String gram = token.termText();
        while ((token = ts.next()) != null) {
          gram = gram + SEPARATOR + token.termText();
        }
        HashSet table = (HashSet)COMMON_TERMS.get(field);
        if (table == null) {
          table = new HashSet();
          COMMON_TERMS.put(field, table);
        }
        table.add(gram);
      }
    } catch (IOException e) {
      throw new RuntimeException(e.toString());
    }
 }

 /** Construct a token filter that inserts n-grams for common terms. For use
   * while indexing documents. */
 public static TokenFilter getFilter(TokenStream ts, String field) {
    return new Filter(ts, (HashSet)COMMON_TERMS.get(field));
 }

 /** Utility to convert an array of Query.Terms into a token stream. */
 private static class ArrayTokens extends TokenStream {
    private Term[] terms;
    private int index;

    public ArrayTokens(Phrase phrase) { this.terms = phrase.getTerms(); }
   
    public Token next() {
      if (index == terms.length)
        return null;
      else
        return new Token(terms[index].toString(), index, ++index);
    }
 }

 /** Optimizes phrase queries to use n-grams when possible. */
 public static String[] optimizePhrase(Phrase phrase, String field) {
    //LOG.info("Optimizing " + phrase + " for " + field);
    ArrayList result = new ArrayList();
    TokenStream ts = getFilter(new ArrayTokens(phrase), field);
    Token token, prev=null;
    int position = 0;
    try {
      while ((token = ts.next()) != null) {
        if (token.getPositionIncrement() != 0 && prev != null)
          result.add(prev.termText());
        prev = token;
        position += token.getPositionIncrement();
        if ((position + arity(token.termText())) == phrase.getTerms().length)
          break;
      }
    } catch (IOException e) {
      throw new RuntimeException(e.toString());
    }
    if (prev != null)
      result.add(prev.termText());

//     LOG.info("Optimized: ");
//     for (int i = 0; i < result.size(); i++) {
//       LOG.info(result.get(i) + " ");
//     }

    return (String[])result.toArray(new String[result.size()]);

 }

 private static int arity(String gram) {
    int index = 0;
    int arity = 0;
    while ((index = gram.indexOf(SEPARATOR, index+1)) != -1) {
      arity++;
    }
    return arity;
 }

 /** For debugging. */
 public static void main(String[] args) throws Exception {
    StringBuffer text = new StringBuffer();
    for (int i = 0; i < args.length; i++) {
      text.append(args[i]);
      text.append(′ ′);
    }
    TokenStream ts =
      new NutchDocumentTokenizer(new StringReader(text.toString()));
    ts = getFilter(ts, "url");
    Token token;
    while ((token = ts.next()) != null) {
      System.out.println("Token: " + token);
    }
    String[] optimized = optimizePhrase(new Phrase(args), "url");
    System.out.print("Optimized: ");
    for (int i = 0; i < optimized.length; i++) {
      System.out.print(optimized[i] + " ");
    }
    System.out.println();
 }
 
}

FastCharStream.java 数据挖掘实验室

package org.apache.nutch.analysis;

import java.io.*;

/** An efficient implementation of JavaCC′s CharStream interface. <p>Note that 数据挖掘研究院

 * this does not do line-number counting, but instead keeps track of the

数据挖掘研究院

 * character position of the token in the input, as required by Lucene′s {@link

 * org.apache.lucene.analysis.Token} API. */

 /** 数据挖掘研究院

 *CharStream接口的一个有效实现。注意这并没有进行行数计算,当时追踪了在输入中Token的字符位置

数据挖掘研究院

 *这个字符位置是LuceneAPI需要的。

 */

数据挖掘研究院

final class FastCharStream implements CharStream {
 char[] buffer = null;

 int bufferLength = 0;                        // end of valid chars
 int bufferPosition = 0;                      // next char to read
 
 int tokenStart = 0;                            // offset in buffer
 int bufferStart = 0;                           // position in file of buffer

 Reader input;                                   // source of chars

 /** Constructs from a Reader. */
 public FastCharStream(Reader r) {
    input = r;
 }

 public final char readChar() throws IOException {
    if (bufferPosition >= bufferLength)
      refill();
    return buffer[bufferPosition++];
 }

 private final void refill() throws IOException {
    int newPosition = bufferLength - tokenStart;

    if (tokenStart == 0) {                    // token won′t fit in buffer
      if (buffer == null) {                   // first time: alloc buffer
       buffer = new char[2048];              
      } else if (bufferLength == buffer.length) { // grow buffer
       char[] newBuffer = new char[buffer.length*2];
       System.arraycopy(buffer, 0, newBuffer, 0, bufferLength);
       buffer = newBuffer;
      }
    } else {                                // shift token to front
      System.arraycopy(buffer, tokenStart, buffer, 0, newPosition);
    }

    bufferLength = newPosition;                // update state
    bufferPosition = newPosition;
    bufferStart += tokenStart;
最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?