RSS
热门关键字:  数据挖掘  人工智能  数据仓库  搜索引擎  数据挖掘导论

Deep Web Data Extraction

来源: 作者: 时间:2008-06-05 点击:

2. Design a system for extracting interesting data from a collection of pages from a Deep Web source. You might define a set of regular expression that can identify dates, prices, or names. Develop a small program that converts a page into a type structure. For example, given a DOM model of a web page, identify all of the types that you have defined, and replace the string tokens with XML tags identifying the types. Replace all non-type tokens with a generic type, and return the tree as a full type structure). Alternatively, you may suggest your own approach for extracting data.

数据挖掘研究院

3. Develop a system to recognize names in page. Given a list of names and a web page, identify possible matches in the page. Based on the structure of the page and the distribution of recognized names, identify strings that may also be names based on their location in the DOM tree heirarchy representing the page.

数据挖掘实验室

4. Write a survey paper about current approaches for understanding and analyzing the Deep Web. Be sure to include many of your own comments on the viability of the approaches you review.

5. Or, feel free to suggest a miniproject of your own. 数据挖掘研究院

Background: Knowledge of Java or Python would be helpful. Some knowledge of information retrieval and machine learning may be useful but is not required.

Deliverables: You should submit a report that clearly describes what you have learned and what you have accomplished. The report should include useful references. You should also provide any source code you may have written to validate your ideas.

最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?