IntroductionCLAIR (Computational Linguistics And Information Retrieval) group is happy to present the first release of the Clair library.
The University of Michigan's 数据挖掘研究院
The Clair library is intended to simplify a number of generic tasks in Natural Language Processing (NLP) and Information Retrieval (IR). Its architecture also allows for external software to be plugged in with very little effort.
FunctionalityDownload PrerequisitesMEAD
Native: Tokenization, Summarization, LexRank, Biased LexRank, Document Clustering, Document Indexing, PageRank, Biased Pagerank, Web Graph Analysis, Bioinformatics Text Analysis, Political Science Text Analysis, Network Building, Power Law Distribution Analysis, Network Analysis and Computation (Watts-Strogatz Clustering Coefficient, Cosines, Random Walks), Tf, Idf 数据挖掘研究院
Imported: Stemming, Sentence Segmentation, Web Page Download, Web Crawling, XML Parsing, XML Tree Building, XML Writing 数据挖掘研究院
The current version is available for beta testing. Write to radev@umich.edu to get a beta copy.
You need Perl, some external software, and a number of external modules that you can download from CPAN (see list below).
- Adwait Ratnaparkhi's MxTerminator
- from CPAN: Net::Google, HTML::LinkExtractor, HTML::Parse, Statistics::ChisqIndep, Graph::Directed, BerkeleyDB, Math::MatrixReal, Lingua::Stem, IO::File, POSIX, Math::Random, IO::Handle, IO::File, IPC::Open2, Carp, IO::Pipe, Getopt::Long
- Clair::Cluster
- Clair::Document
- Clair::Network
- Clair::NetworkWrapper
- CIDR::Wrapper
- MEAD::Wrapper
- Findbin
ModulesGetting startedREADME file contains information about how to set up Clairlib. This file is also available is included in the Clairlib tar.gz file.
Unit Teststest_aleextract.txt
Here is the content of a number of the tests included in you distribution.
-
- test_alesearch.txt
- test_biased_lexrank.txt
- test_cidrmead.txt
- test_cidrwrapper.txt
- test_cluster.txt
- test_compare_idf.txt
- test_connection.txt
- test_corpus_download.txt
- test_document.txt
- test_document_idf.txt
- test_gen.txt
- test_generif.txt
- test_html.txt
- test_html_dir.txt
- test_hyperlink.txt
- test_idf.txt
- test_lexrank.txt
- test_lexrank2.txt
- test_lexrank3.txt
- test_lexrank4.txt
- test_lexrank_large.txt
- test_lookupTFIDF.txt
- test_meadwrapper.txt
- test_mega.txt
- test_mmr.txt
- test_network.txt
- test_network_stat.txt
- test_networkwrapper.txt
- test_networkwrapper_docs.txt
- test_networkwrapper_sents.txt
- test_nutchsearch.txt
- test_pagerank.txt
- test_random_walk.txt
- test_stem.txt
- test_stem_dir.txt
- test_web_search.txt
- test_wordcount.txt
- test_wordcount_dir.txt
- test_xmldoc.txt
DocumentationClair lib tutorial
-
- Clair lib module documentation
- Mead tutorial
- Mead module documentation
- NCIBI Tools and Technology Presentation on Clairlib and GIN
AcknowledgmentsAboutClair group at the University of Michigan.
- Project design: Dragomir R. Radev
- Main implementers: Anthony Fader, Mark Hodges
- Additional code by: Adam Winkel, Samuela Pollack, Scott Gifford, Timothy Allison, Gunes Erkan, Patrick Jordan, Aaron Elkiss, Michael Dagitses, Mark Joseph, Joshua Gerrish
This work has been supported in part by grants R01 LM008106 and U54 DA021519 from the National Institutes of Health as well as grant IDM 0329043 "Probabilistic and link-based Methods for Exploiting Very Large Textual Repositories" from the National Science Foundation.
The Clair Library is developed by the
- Project design: Dragomir R. Radev
- Main implementers: Anthony Fader, Mark Hodges
- Additional code by: Adam Winkel, Samuela Pollack, Scott Gifford, Timothy Allison, Gunes Erkan, Patrick Jordan, Aaron Elkiss, Michael Dagitses, Mark Joseph, Joshua Gerrish
- Tf.pm
- Idf.pm
- TFIDFUtils.pm
- WebSearch.pm
- MxTerminator.pm
- Robot2.pm
- Parse.pm
- CorpusDownload.pm
- CIDR/Wrapper.pm
- Essence/IDF.pm
- Essence/Centroid.pm
- Essence/Text.pm
- MEAD/DocsentConverter.pm
- MEAD/Wrapper.pm

