Swoogle: A Semantic Web Search Engine

Swoogle: A Semantic Web Search Engine Swoogle is a crawler-based indexing and retrieval system for Semantic Web documents in RDF or OWL. It is being developed by the Computer Science and Electrical Engineering Department of the University of Maryland Batlimore County. It extracts metadata and computes relations between documents. Discovered documents are also indexed by an information retrieval system to compute the similarity among a set of documents and to compute rank as a measure of the importance of a Semantic Web document.
The Semantic Web, currently in the form of RDF and OWL documents, is essentially a parallel universe to the Web of online HTML documents. A Semantic Web document (SWD) is known for its semantic content. Since no conventional search engines can take advantage of semantic features, a search engine customized for SWDs, especially for ontologies, is necessary to access, explore and query the Web’s RDF and OWL documents.
数据挖掘论坛
A prototype Semantic Web search engine called Swoogle, facilitates the development of the Semantic Web, for finding appropriate ontologies, and helping users specify terms and qualify type (class or property). In addition, ranking mechanism sorts ontologies by their importance.

 


In order to help users to integrate Semantic Web data distributed on the Web, Swoogle enables querying SWDs with constraints on the classes and properties. By collecting meta-data about the Semantic Web, Swoogle reveals interesting structural properties such as how the Semantic Web is connected, how ontologies are referenced, and how an ontology is modified externally.

Swoogle is designed as a system that will scale up, in order to handle millions of documents. Moreover, Swoogle also enables rich query constraints on semantic relations. The Swoogle architecture consists of a database that stores metadata about the SWDs. Two distinct web crawlers discover SWDs and components to compute semantic relationships among the SWDs. Also, an indexing and retrieval engine, a simple user interface for query and agent/web service APIs provide useful services. 数据挖掘工具

The algorithm, Ontology Rank, inspired by Google’s Page Rank algorithm is used to rank search results. This algorithm takes advantage of the fact that the graph formed by SWDs has a richer set of relations. In other word, the edges in this graph have explicit semantics. Some are defined or derivable from the RDF and OWL languages and others by common ontologies (e.g.,FOAF).

Semantic Web Documents

Semantic Web languages based on RDF allow one to make statements that define general terms (classes and properties). A Semantic Web Document (SWD) is a document in a semantic Web language that is accessible to software agents. A SWD is an atomic information exchange object in the Semantic Web.

Two kinds of documents form Semantic Web ontologies (SWOs) and Semantic Web databases (SWDBs). A document is a SWO when a significant proportion of the statements it makes, define new terms (e.g., new classes and properties) or extends the definition of terms defined in other SWDs by adding new properties or constraints. A document is considered as a SWDB when it does not define or extend a significant number of terms. A SWDB can introduce individuals and make assertions about them or make assertions about individuals defined in other SWDs. For example, the SWD http://xmlns.com/foaf/0.1/index.rdf is considered a SWO in that its 466 statements (i.e. triples) define 12 classes and 51 properties but introduces no individuals. The SWD http://umbc.edu/~finin/foaf.rdf is considered to be a SWDB since it defines or extends no terms but defines three individuals and makes statements about them. 数据挖掘工具

Swoogle Architecture

Swoogle′s architecture can be broken into four major components: SWD discovery, metadata creation, data analysis, and interface. This architecture is data centric and extensible. These components work independently and interact with one another through a database.

.

The SWD discovery component discovers potential SWDs throughout theWeb. The metadata creation component caches a snapshot of a SWD and generates objective metadata about SWDs at both the syntax level and the semantic level. The data analysis component uses the cached SWDs and the created metadata to derive analytical reports, such as classification of SWOs and SWDBs, rank of SWDs, and the Information Retreival (IR) index for the SWDs. The interface component focuses on providing data service.

Finding SWDs

Finding URLs of SWDs is a straightforward approach to search through a conventional search engine. It is not possible for Swoogle to parse all documents on the Web to see if they are SWDs, however, the crawlers employ a number of heuristics for finding SWDs starting with a Google crawler which searches URLs using the Google Web Service.

Relations among SWDs

Looking at the entire Semantic Web, it is hard to capture and analyze relations at the RDF node level. Therefore, Swoogle focuses on SWD level relations which generalize RDF node level relations.

Google PageRank

Google introduced PageRank evaluates the relative importance of Web documents. Given a document

A, A′s PageRank is computed by equation:
PR(A) = PRdirect(A) + PRlink(A)
数据挖掘实验室
PRdirect(A) = (1 ¡ d)
PRlink(A) = d ³PR(T1)
C(T1) +:::+PR(Tn)
C(Tn) ´

where T1; : : : ; Tn are Web documents that link to A; C(Ti) is the total outlinks of Ti; and d is a damping factor, which is typically set to 0:85. The intuition of PageRank is to measure the probability that a random surfer will visit a page. Equation 2 captures the probability that a user will arrive at a given page either by directly addressing it via PRdirect(A), or by following one of the links pointing to it via PRlink(A).

Ranking SWDs

Given SWDs A and B, Swoogle classifies inter-SWD links into four categories: (i) imports(A,B), A imports all content of B; (ii) uses-term(A,B), A uses some of terms defined by B without importing B; (iii) extends(A,B), A extends the definitions of terms defined by B; and (iv) asserts(A,B), A makes assertions about the individuals defined by B.

These relations should be treated as a surfer observes imports(A,B) while visiting A, follow this link because B is semantically part of A. Similarly, the surfer may follow extends(A,B) relation because it can understand the defined term completely only when it browses both A and B. Therefore, the assigned weight is different which shows the probability of following that kind of link, to the four categories of inter-SWD relations. The RDF node level relations to SWD level relations, counts the number of references. The more terms in B referenced by A, the more likely a surfer will follow the link from A to B. 数据挖掘实验室

Based on the above, given SWD a, Swoogle computes its raw rank using:

 


The hypothetical Rational Random Surfer(RRS) retain PageRank′s direct visit component; the rational surfer can jump to SWDs directly with a certain probability d. However, in the link-following component, the link is chosen with unequal probability {f(x;a)/f(x)}, where x is the current SWDB.

Indexing and Retreiving SWDs

Central to a Semantic Web search engine is the problem of indexing and searching SWDs. It is useful to apply IR techniques to documents not entirely markup. To apply search to both the structured and unstructured components of a document it is conceivable that there will be some text documents that contain embedded markup.

Information retrieval techniques have some value characteristics, such as researched methods for ranking matches, computing similarity between documents, and employing relevance feedback. These compliment and extend the retrieval functions inherent in Swoogle.

数据挖掘论坛



Currently, the most popular kinds of documents are FOAF files and RSS files. Swoogle is intended to support services needed by software agents and programs via web service interfaces. Using Swoogle, one can find all of the Semantic Web documents that use a set of properties or classes.

Conclusion

Currently Google does not work well with Semantic Web Documents, since they are expect documents to contain unstructured text composed of words. Google can′t take advantage of the Semantic Web because it doesn′t utilize its structure. Powerful search and indexing systems are highly needed by the Semantic Web researchers to help them find and analyze SWDs.

Swoogle is a prototype crawler-based indexing and retrieval system for the Semantic Web Documents, i.e., web documents written in RDF or OWL. It runs multiple crawlers to discover SWDs through meta-search and following links, analyzes SWDs and produce metadata as well as computes ranks.

One of the interesting properties computed for each semantic web document is its rank, a measure of the documents importance on the SemanticWeb. The current version of Swoogle has discovered and analyzed over 11,000 semantic web documents. A second version has been designed and partially implemented that will also capture more metadata on classes and properties and is designed to support millions of documents.

Reference:

“Swoogle: A Semantic Web Search and Metadata Engine,” Li Ding Tim Finin Anupam Joshi Yun Peng R. Scott Cost Joel Sachs Rong Pan Pavan Reddivari Vishal Doshi, Department of Computer Science and Electronic Engineering, University of Maryland, USA 2004.

Developing Semantic Web Services, H. Peter Alesso and Craig F. Smith, A. K. Peters, Ltd., ISBN: 1568812124, 2004.
[数据挖掘专家] [数据挖掘研究院] [数据挖掘论坛] [数据挖掘实验室]
上一篇:An Adaptive Model for Optimizing Performance of an Increment
下一篇:Advanced Text Indexing with Lucene
最新评论共有 0 位网友发表了评论 , 查看所有评论
发表评论( 不能超过250字,需审核,请自觉遵守互联网相关政策法规。 )
匿名?
数据挖掘网站导航 数据挖掘论坛导航
  • 数据挖掘工具
  • 数据挖掘论坛
  • DataCruncher - Cognos
  • MineSet - MathSoft
  • Intelligent Miner - GainSmarts
  • Sqlserver - SAS - Clementine
  • CART - Weka - WizSoft
  • NeuroShell - ModelQuest
  • data mining tools - Darwin
  • 数据挖掘交友
  • 数据挖掘博客
  • 数据挖掘工具
  • 数据挖掘资源
  • 数据挖掘技术算法
  • 数据挖掘相关期刊、会议
  • 研究院联盟合作专区
  • 数据挖掘基础与相关技术
  • 数据挖掘厂商与就业
  • 数据挖掘研究者乐园
  • 知名厂商数据挖掘工具资料
  • 国内数据挖掘实验室
  • Foreign Data Mining Lab
  • 热点关注
  • Mercator: A Scalable, Extensible Web Cra
  • 什么是垂直搜索引擎(之二)
  • Writing a web crawler
  • 互联网搜索的未来
  • 国家版权局版权司副司长许超:关于搜索引擎
  • 百度数分钟内闪电裁员 企业软件事业部遭抛
  • 我对垂直搜索引擎的几点认识
  • Google Patent Filings by the Dozen
  • Manageability - Open Source Web Crawlers
  • 微软卡位第三代搜索技术 认为Google将很快
  • 论坛最新话题
  • Foundations of Statistical Natural Langu
  • Game Theory meet Data Mining: A Recent P
  • System Building: How does it help or hin
  • 数据挖掘与Clementine培训
  • 新手报到
  • 求 SASEM 客户流失预测分析
  • 数据挖掘工程师/搜索研究院—北京——无线
  • 数据挖掘入门介绍(如何着手数据挖掘)
  • Information Overload Survey Results
  • The INEX 2005 Workshop on Element Retrie
  • 相关资讯
  • 谷歌宣布进军可替代能源 计划投资4.4万亿美
  • 搜索大战成Web 2.0操作系统之争
  • 7月美国搜索市场环比增长2% 雅虎微软成输家
  • 网页面向搜索引擎的搜索引擎优化
  • 史上最具技术创新的10大搜索引擎
  • Google如何预测下一届美国总统
  • 微软1亿美元收购语义搜索引擎Powerset
  • 很黄很暴力:人肉搜索引擎
  • OpenSocial只不过是Google公关骗局
  • 数据之美 百度GOOGLE统计的秘密
  • 数据挖掘实验室资料
  • 数据挖掘博客地址
  • 数据挖掘实验室网站地址
  • Prepare for Medicare audits by using dat
  • 注册成为SAS用户与爱好者俱乐部会员
  • 水南梅
  • 明日烟
  • 新人报道
  • 下载
  • 厦门服务器托管,450元/月—0592-5177319 高
  • 买空间送域名--0592-5177319 高静