The CMU-Cambridge Statistical Language Modeling toolkit is a suite of UNIX software tools to facilitate the construction and testing of statistical language models. Version 1 was written by Roni Rosenfeld at Carnegie Mellon University. 数据挖掘研究院
The toolkit has now been rewritten by Philip Clarkson and Roni Rosenfeld, and now provides increased functionality and efficiency. Version two is no longer limited to the use of bigram and trigram models, and provides support for n-grams of arbitrary size. It also provides support for several discounting schemes, rather than limiting the user to the Good-Turing discounting strategy used in version one. In addition, the tools used to count word n-grams, vocabulary n-grams and id n-grams have been re-written to increase greatly their speed of operation. Other changes include a more flexible way of handling context cues, the ability to calculate probabilities from ARPA format language models, the ability to force the model to back-off under certain circumstance (for example, if there is an unknown word in the context), and support for gnuzip compressed files as well as files compressed with the compress utility. 数据挖掘研究院
- Download - Download the current version (2.05) of the toolkit
- Paper - A paper describing the toolkit. (P.R. Clarkson and R. Rosenfeld. Statistical Language Modeling Using the CMU-Cambridge Toolkit From Proceedings ESCA Eurospeech 1997)
- Documentation - The HTML documentation which accompanies the toolkit.
- Changes - The changes between different versions of the toolkit.
- Mailing List - A mailing list concerned with the toolkit.
Please note that as of June 1999, I am no longer at Cambridge University, and am therefore unable to provide a great deal of support for the toolkit. I will try to provide answers to any quick questions that come up, however. The e-mail address at the foot of this page should continue to work for the foreseeable future. 数据挖掘研究院

