|
Recent talk given at Boeing, UIUC, University of Notre Dame, University of Trento (Italy) and University of Siena (Italy) which summarizes the theory and some algorithms.
Text classification is an important problem that has numerous applications. It is commonly stated as follows: Given a set of labeled training documents of n classes, the system uses this training set to build a classifier, which is then used to classify new documents into the n classes. Although this classic model is important, in practice one also encounters another problem. That is, one has a set of documents of a particular topic or class P (positive class), and is given a large set U of mixed (unlabelled) documents that contains documents from class P and also other types of documents (negative documents). One wants to classify the documents in U into documents from P and documents not from P. The key feature of this problem is that there is no labeled negative training data, which makes the traditional text classification techniques inapplicable. This problem is termed, partially supervised classification (PSC). We also call it PU-learning (Learning from Positive and Unlabeled examples). 数据挖掘实验室
The objectives of this project are to design a robust and principled technique to solve PSC, implement a system for PSC, devise a method to evaluate such techniques, and identify methods for determining the minimum number of labeled documents needed to achieve the optimal accuracy in order to reduce manual labeling efforts. The results of this research should be widely useful because the identification of targeted information/documents is of great value in this information age. 数据挖掘研究院
In our work in (Liu et al. 2002), it was shown theoretically that P and U provide sufficient information for learning, and the problem can be posed as a constrained optimization problem. This theoretical result provides a good guidance for designing practical algorithems. Some of our algorithms are reported in (Liu et al 2003), (Lee and Liu 2003) and (Li and Liu 2003). Since research in this direction only started recently, many important issues still need to be addressed in order to gain a better understanding of the problem.
数据挖掘交友
Read the following paper first: It summarizes most existing methods, proposed a new biased-SVM technique and also performed a comprehensive evaluation.
- Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee and and Philip Yu. "Building Text Classifiers Using Positive and Unlabeled Examples." Proceedings of the Third IEEE International Conference on Data Mining (ICDM-03), Melbourne, Florida, November 19-22, 2003. [PDF]
Publications
数据挖掘论坛
- Xiaoli Li, Bing Liu. "Learning from Positive and Unlabeled Examples with Different Data Distributions." To appear in European Conference on Machine Learning (ECML-05), 2005. [PDF]
- Bing Liu Xiaoli Li, Wee Sun Lee and and Philip Yu. "Text Classification by Labeling Words." To appear in Proceedings of The Nineteenth National Conference on Artificial Intelligence (AAAI-2004), July 25-29, 2004, San Jose, California. [PDF]
数据挖掘研究院
- Xiaoli Li, and Bing Liu. "Dealing with Different Distributions in Learning from Positive and Unlabeled Web Data." WWW-2004 poster paper. [PDF]
数据挖掘工具
- Gao Cong, Wee Sun Lee, Haoran Wu, Bing Liu. "Semi-supervised Text Classification Using Partitioned EM." DASFAA 2004: 482-493. [PDF]
数据挖掘实验室
- Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee and and Philip Yu. "Building Text Classifiers Using Positive and Unlabeled Examples." Proceedings of the Third IEEE International Conference on Data Mining (ICDM-03), Melbourne, Florida, November 19-22, 2003. [PDF]
- Xiaoli Li, Bing Liu. Learning to classify text using positive and unlabeled data. Proceedings of Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03), Aug 9-15, 2003, Acapulco, Mexico.
数据挖掘论坛
- Wee Sun Lee, Bing Liu. Learning with Positive and Unlabeled Examples using Weighted Logistic Regression. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), August 21-24, 2003, Washington, DC USA.
数据挖掘交友
- Bing Liu, Wee Sun Lee, Philip S Yu and Xiaoli Li. Partially Supervised Classification of Text Documents. Proceedings of the Nineteenth International Conference on Mach ine Learning (ICML-2002), 8-12, July 2002, Sydney, Australia.
Software
|