RSS
热门关键字:  数据挖掘  人工智能  数据仓库  搜索引擎  数据挖掘导论

Deep Web Data Extraction

来源: 作者: 时间:2008-06-05 点击:

数据挖掘研究院

In analogy to search engines over the "crawlable" web, we argue that one way to unlock the Deep Web is to employ a fully automated approach to extracting, indexing, and searching the query-related information-rich regions from dynamic web pages. For this miniproject, we focus on the first of these: extracting data from the Deep Web.

Extracting the interesting information from a Deep Web site requires many things: including scalable and robust methods for analyzing dynamic web pages of a given web site, discovering and locating the query-related information-rich content regions, and extracting itemized objects within each region. By full automation, we mean that the extraction algorithms should be designed independently of the presentation features or specific content of the web pages, such as the specific ways in which the query-related information is laid out or the specific locations where the navigational links and advertisement information are placed in the web pages.

数据挖掘研究院

There are many possible 7001-miniprojects. Feel free to talk to either of us for more details. Here are a few possibilities to consider:

1. Develop a Web-based demo for clustering pages of a similar type from a single Deep Web source. For example, AllMusic produces three types of pages in response to a user query: a direct match page (e.g. for Elvis Presley), a list of links to match pages (e.g. a list of all artists named Jackson), and a page with no matches. As a first-step to extracting the relevant data from each page, you may develop techniques to separate out the pages that contain query matches from pages that contain no matches, and perhaps, rank each group based on some metric of quality.

最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?