Swoogle: A Semantic Web Search Engine Swoogle is a crawler-based indexing and retrieval system for Semantic Web documents in RDF or OWL. It is being developed by the Computer Science and Electrical Engineering Department of the University of Maryland Batlimore County. It extracts metadata and computes relations between documents. Discovered documents are also indexed by an information retrieval system to compute the similarity among a set of documents and to compute rank as a measure of the importance of a Semantic Web document.
The Semantic Web, currently in the form of RDF and OWL documents, is essentially a parallel universe to the Web of online HTML documents. A Semantic Web document (SWD) is known for its semantic content. Since no conventional search engines can take advantage of semantic features, a search engine customized for SWDs, especially for ontologies, is necessary to access, explore and query the Web’s RDF and OWL documents.
数据挖掘论坛 A prototype Semantic Web search engine called
Swoogle, facilitates the development of the Semantic Web, for finding appropriate ontologies, and helping users specify terms and qualify type (class or property). In addition, ranking mechanism sorts ontologies by their importance.
In order to help users to integrate Semantic Web data distributed on the Web, Swoogle enables querying SWDs with constraints on the classes and properties. By collecting meta-data about the Semantic Web, Swoogle reveals interesting structural properties such as how the Semantic Web is connected, how ontologies are referenced, and how an ontology is modified externally.
Swoogle is designed as a system that will scale up, in order to handle millions of documents. Moreover, Swoogle also enables rich query constraints on semantic relations. The Swoogle architecture consists of a database that stores metadata about the SWDs. Two distinct web crawlers discover SWDs and components to compute semantic relationships among the SWDs. Also, an indexing and retrieval engine, a simple user interface for query and agent/web service APIs provide useful services.
数据挖掘工具 The algorithm, Ontology Rank, inspired by Google’s Page Rank algorithm is used to rank search results. This algorithm takes advantage of the fact that the graph formed by SWDs has a richer set of relations. In other word, the edges in this graph have explicit semantics. Some are defined or derivable from the RDF and OWL languages and others by common ontologies (e.g.,FOAF).
Semantic Web DocumentsSemantic Web languages based on RDF allow one to make statements that define general terms (classes and properties). A Semantic Web Document (SWD) is a document in a semantic Web language that is accessible to software agents. A SWD is an atomic information exchange object in the Semantic Web.
Two kinds of documents form Semantic Web ontologies (SWOs) and Semantic Web databases (SWDBs). A document is a SWO when a significant proportion of the statements it makes, define new terms (e.g., new classes and properties) or extends the definition of terms defined in other SWDs by adding new properties or constraints. A document is considered as a SWDB when it does not define or extend a significant number of terms. A SWDB can introduce individuals and make assertions about them or make assertions about individuals defined in other SWDs. For example, the SWD
http://xmlns.com/foaf/0.1/index.rdf is considered a SWO in that its 466 statements (i.e. triples) define 12 classes and 51 properties but introduces no individuals. The SWD
http://umbc.edu/~finin/foaf.rdf is considered to be a SWDB since it defines or extends no terms but defines three individuals and makes statements about them.
数据挖掘工具 Swoogle Architecture
Swoogle′s architecture can be broken into four major components: SWD discovery, metadata creation, data analysis, and interface. This architecture is data centric and extensible. These components work independently and interact with one another through a database.

.
The SWD discovery component discovers potential SWDs throughout theWeb. The metadata creation component caches a snapshot of a SWD and generates objective metadata about SWDs at both the syntax level and the semantic level. The data analysis component uses the cached SWDs and the created metadata to derive analytical reports, such as classification of SWOs and SWDBs, rank of SWDs, and the Information Retreival (IR) index for the SWDs. The interface component focuses on providing data service.
Finding SWDs
Finding URLs of SWDs is a straightforward approach to search through a conventional search engine. It is not possible for Swoogle to parse all documents on the Web to see if they are SWDs, however, the crawlers employ a number of heuristics for finding SWDs starting with a Google crawler which searches URLs using the Google Web Service.
Relations among SWDs
Looking at the entire Semantic Web, it is hard to capture and analyze relations at the RDF node level. Therefore, Swoogle focuses on SWD level relations which generalize RDF node level relations.
Google PageRank
Google introduced
PageRank evaluates the relative importance of Web documents. Given a document
A, A′s PageRank is computed by equation:
PR(A) = PRdirect(A) + PRlink(A)
数据挖掘实验室 PRdirect(A) = (1 ¡ d)
PRlink(A) = d ³PR(T1)
C(T1) +:::+PR(Tn)
C(Tn) ´
where T1; : : : ; Tn are Web documents that link to A; C(Ti) is the total outlinks of Ti; and d is a damping factor, which is typically set to 0:85. The intuition of PageRank is to measure the probability that a random surfer will visit a page. Equation 2 captures the probability that a user will arrive at a given page either by directly addressing it via PRdirect(A), or by following one of the links pointing to it via PRlink(A).
Ranking SWDs
Given SWDs A and B, Swoogle classifies inter-SWD links into four categories: (i) imports(A,B), A imports all content of B; (ii) uses-term(A,B), A uses some of terms defined by B without importing B; (iii) extends(A,B), A extends the definitions of terms defined by B; and (iv) asserts(A,B), A makes assertions about the individuals defined by B.
These relations should be treated as a surfer observes imports(A,B) while visiting A, follow this link because B is semantically part of A. Similarly, the surfer may follow extends(A,B) relation because it can understand the defined term completely only when it browses both A and B. Therefore, the assigned weight is different which shows the probability of following that kind of link, to the four categories of inter-SWD relations. The RDF node level relations to SWD level relations, counts the number of references. The more terms in B referenced by A, the more likely a surfer will follow the link from A to B.
数据挖掘实验室 Based on the above, given SWD a, Swoogle computes its raw rank using:

The hypothetical Rational Random Surfer(RRS) retain PageRank′s direct visit component; the rational surfer can jump to SWDs directly with a certain probability d. However, in the link-following component, the link is chosen with unequal probability {f(x;a)/f(x)}, where x is the current SWDB.
Indexing and Retreiving SWDs
Central to a Semantic Web search engine is the problem of indexing and searching SWDs. It is useful to apply IR techniques to documents not entirely markup. To apply search to both the structured and unstructured components of a document it is conceivable that there will be some text documents that contain embedded markup.
Information retrieval techniques have some value characteristics, such as researched methods for ranking matches, computing similarity between documents, and employing relevance feedback. These compliment and extend the retrieval functions inherent in Swoogle.
数据挖掘论坛
Currently, the most popular kinds of documents are FOAF files and RSS files. Swoogle is intended to support services needed by software agents and programs via web service interfaces. Using Swoogle, one can find all of the Semantic Web documents that use a set of properties or classes.
Conclusion
Currently Google does not work well with Semantic Web Documents, since they are expect documents to contain unstructured text composed of words. Google can′t take advantage of the Semantic Web because it doesn′t utilize its structure. Powerful search and indexing systems are highly needed by the Semantic Web researchers to help them find and analyze SWDs.
Swoogle is a prototype crawler-based indexing and retrieval system for the Semantic Web Documents, i.e., web documents written in RDF or OWL. It runs multiple crawlers to discover SWDs through meta-search and following links, analyzes SWDs and produce metadata as well as computes ranks.
One of the interesting properties computed for each semantic web document is its rank, a measure of the documents importance on the SemanticWeb. The current version of Swoogle has discovered and analyzed over 11,000 semantic web documents. A second version has been designed and partially implemented that will also capture more metadata on classes and properties and is designed to support millions of documents.
Reference:
“Swoogle: A Semantic Web Search and Metadata Engine,” Li Ding Tim Finin Anupam Joshi Yun Peng R. Scott Cost Joel Sachs Rong Pan Pavan Reddivari Vishal Doshi, Department of Computer Science and Electronic Engineering, University of Maryland, USA 2004.
Developing Semantic Web Services, H. Peter Alesso and Craig F. Smith, A. K. Peters, Ltd., ISBN: 1568812124, 2004.