The purpose of this research is to develop systems that can reliably categorize documents using the Latent Semantic Indexing (LSI) technology [2]. Initial research has indicated that the LSI technology shows great promise in constructing categorization systems that require minimal setup and training. Categorization systems based on the LSI technology do not rely on auxiliary structures (thesauri, dictionaries, etc.) and are independent of the native language being categorized (given the documents can be represented in the UNICODE character set).
Three factors led us to undertake an assessment of LSI for categorization applications. First, LSI has been shown to provide superior performance to other information retrieval techniques in a number of controlled tests [3]. Second, a number of experiments have demonstrated a remarkable similarity between LSI and the fundamental aspects of the human processing of language [6]. Third, LSI is immune to the nuances of the language being categorized, thereby facilitating the rapid construction of multilingual categorization systems.
The emergence of the World Wide Web has led to a tremendous growth in the volume of text documents available to the open source community (e.g., special interest web pages, digital libraries, subscription news sources, and company-wide Intranets). Quite coincidentally, this has led to an equally explosive interest in accurate methods to filter, categorize and retrieve information relevant to the end consumer. Of special emphasis in such systems is the need to reduce the burden on the end consumer and minimize the system administration of the system. 数据挖掘研究院
We will describe the implementation of two successfully deployed systems employing the LSI technology for information filtering (English and Spanish language documents) and document categorization (Arabic language documents). The systems utilize in-house developed tools for constructing and publishing LSI categorization spaces. Various interfaces (e.g., SOAP-based Web service, workflow interfaces, etc.) have been developed that allow the LSI categorization capability to address a variety of customer system configurations. The core LSI technology has been implemented in a modern J2EE based architecture facilitating its deployment on a variety of platforms and operating systems. We will describe some early results on the accuracy and use of the systems.

