Bow: A Toolkit for Statistical Language Modeling, Text Retri

The library and its front-ends were designed and written by Andrew McCallum, with some contributions from several graduate and undergraduate students.

The name of the library rhymes with `low′, not `cow′.

About the library

The library provides facilities for:

数据挖掘交友

  • Recursively descending directories, finding text files.
  • Finding `document′ boundaries when there are multiple documents per file.
  • Tokenizing a text file, according to several different methods.
  • Including N-grams among the tokens.
  • Mapping strings to integers and back again, very efficiently.
  • Building a sparse matrix of document/token counts.
  • Pruning vocabulary by word counts or by information gain.
  • Building and manipulating word vectors.
  • Setting word vector weights according to Naive Bayes, TFIDF, and several other methods.
  • Smoothing word probabilities according to Laplace (Dirichlet uniform), M-estimates, Witten-Bell, and Good-Turning.
  • Scoring queries for retrieval or classification.
  • Writing all data structures to disk in a compact format.
  • Reading the document/token matrix from disk in an efficient, sparse fashion.
  • Performing test/train splits, and automatic classification tests.
  • Operating in server mode, receiving and answering queries over a socket.

The library does not:

数据挖掘交友

  • Have English parsing or part-of-speech tagging facilities.
  • Do smoothing across N-gram models.
  • Claim to be finished.
  • Have good documentation.
  • Claim to be bug-free.

It is known to compile on most UNIX systems, including Linux, Solaris, SUNOS, Irix and HPUX. Over a year ago, it compiled on WindowsNT (with a GNU build environment); it doesn′t do this any more, but probably could with small fixes. Patches to the code are most welcome. It is developed on a Linux system.

数据挖掘交友

The code conforms to the GNU coding standards. It is released under the Library GNU Public License (LGPL).

Citation

You are welcome to use the code under the terms of the licence for research or commercial purposes, however please acknowledge its use with a citation:

  数据挖掘论坛

   McCallum, Andrew Kachites.  "Bow: A toolkit for statistical language
   modeling, text retrieval, classification and clustering."
   http://www.cs.cmu.edu/~mccallum/bow.  1996.
  

Here is a BiBTeX entry: 数据挖掘实验室

数据挖掘论坛

   @unpublished{McCallumLibbow,
      author = "Andrew Kachites McCallum",
      title = "Bow: A toolkit for statistical language modeling, 
               text retrieval, classification and clustering",
      note = "http://www.cs.cmu.edu/~mccallum/bow",
      year = 1996}
 数据挖掘工具 

Obtaining the Source

Source code for the library can be downloaded from this directory. Different versions are indicated by eight digit sequences that indicate year, month and day. Thus, the most recent version is the one with the largest version number. 数据挖掘实验室

Unfortunately I do not have time to help rainbow′s many users with all their compilation and usage problems. Feel free to send me mail asking for help, but please do not necessarily expect me to have time to help. Most appreciated are bug reports accompanied by fixes.

数据挖掘交友

Bow Library Front-Ends

Provided in the library source distribution, there are currently three executable programs based on the library.
  • Rainbow is an executable program that does document classification. While mostly designed for classification by naive Bayes, it also provides TFIDF/Rocchio, Probabilistic Indexing and K-nearest neighbor.
  • Arrow is an executable program that does document retrieval. It currently only performs simple TFIDF-based retrieval.
  • Crossbow is a an executable program that does document clustering (and also classification).
[数据挖掘专家] [数据挖掘研究院] [数据挖掘论坛] [数据挖掘实验室]
上一篇:统计软件一览
下一篇:Rainbow:document classification
最新评论共有 0 位网友发表了评论 , 查看所有评论
发表评论( 不能超过250字,需审核,请自觉遵守互联网相关政策法规。 )
匿名?
数据挖掘网站导航 数据挖掘论坛导航
  • 数据挖掘工具
  • 数据挖掘论坛
  • DataCruncher - Cognos
  • MineSet - MathSoft
  • Intelligent Miner - GainSmarts
  • Sqlserver - SAS - Clementine
  • CART - Weka - WizSoft
  • NeuroShell - ModelQuest
  • data mining tools - Darwin
  • 数据挖掘交友
  • 数据挖掘博客
  • 数据挖掘工具
  • 数据挖掘资源
  • 数据挖掘技术算法
  • 数据挖掘相关期刊、会议
  • 研究院联盟合作专区
  • 数据挖掘基础与相关技术
  • 数据挖掘厂商与就业
  • 数据挖掘研究者乐园
  • 知名厂商数据挖掘工具资料
  • 国内数据挖掘实验室
  • Foreign Data Mining Lab
  • 热点关注
  • 方差分析软件下载
  • 因子分析
  • 第七章 主成分与因子分析
  • 第五章 相关与回归分析
  • 第八章 聚类分析与判别分析
  • 一段求极值的matlab代码 SGA
  • 第十三章 时间序列分析
  • 利用Excel进行医学统计t检验分析
  • 第六章 试验设计与方差分析 (1)
  • 第九章 典型相关分析
  • 论坛最新话题
  • Foundations of Statistical Natural Langu
  • Game Theory meet Data Mining: A Recent P
  • System Building: How does it help or hin
  • 数据挖掘与Clementine培训
  • 新手报到
  • 求 SASEM 客户流失预测分析
  • 数据挖掘工程师/搜索研究院—北京——无线
  • 数据挖掘入门介绍(如何着手数据挖掘)
  • Information Overload Survey Results
  • The INEX 2005 Workshop on Element Retrie
  • 相关资讯
  • JASA中一组经典的统计学文章
  • 中国文化与中国的统计科学
  • “显著性”的关系和这种关系中的陷阱
  • 有关标准化回归系数的误用
  • 描述性回归与预测性回归
  • 论文撰写中常见的统计学问题及其处理
  • 医学论文中常见的统计学处理问题
  • 心理统计学(Psychological Statistics)
  • 统计学习笔记——因素分析
  • 统计学习笔记-判别分析
  • 数据挖掘实验室资料
  • 数据挖掘博客地址
  • 数据挖掘实验室网站地址
  • Prepare for Medicare audits by using dat
  • 注册成为SAS用户与爱好者俱乐部会员
  • 水南梅
  • 明日烟
  • 新人报道
  • 下载
  • 厦门服务器托管,450元/月—0592-5177319 高
  • 买空间送域名--0592-5177319 高静