|
How to obtain information from the web according to your own preference is a challange nowadays since there is an overload of information making it difficult to find the needed one quickly. People like to search for information by topic, author or language, etc, however, the current information acquisition technology including the SVM-decision tree and unsupervised clustering based SVM cannot satisfy these requirements. Although the best way to solve the problem is through supervised clustering, it may not produce desirable clustering without additional information provided by the user. In this case, we could only adjust algorithm or similarity measurement. Compared with adjusting the algorithm, modifying the similarity measurement is more visually oriented. Usually, people cannot specify the similarity easily but supplement examples to substantiate it. Thus it is the best way to learn the similarity measurement for clustering. The common way is to use a binary classifier: By taking all pairs of items in all training sets and then describing each pair in a feature vector. Positive examples are considered as the same class, negative examples, different ones.When a new set of items run though the classifier, whether a pair should be in the same class could be decided by the output value (positive or negative). But the approach assumes that all the pairs are i.i.d. and cannot take advantages of dependency between item pairs. To avoid this problem, some researchers have adopted the Conditional Random Fields (CRFs) which uses various clustering functions without requiring the independence of attributes, but they cannot optimize the clusters with respect to loss function. T. Finley and T. Joachims have introduced Supervised Clustering with Support Vector Machines to overcome the problem [Finley & Joachims, 2005]. However it is not specialized and cannot be applied to the domain knowledge. We introduce the ontology and semantic similarity into SVM as a similar measurement in this paper.
资料全文下载
|