2. Design a system for extracting interesting data from a collection of pages from a Deep Web source. You might define a set of regular expression that can identify dates, prices, or names. Develop a small program that converts a page into a type structure. For example, given a DOM model of a web page, identify all of the types that you have defined, and replace the string tokens with XML tags identifying the types. Replace all non-type tokens with a generic type, and return the tree as a full type structure). Alternatively, you may suggest your own approach for extracting data.
3. Develop a system to recognize names in page. Given a list of names and a web page, identify possible matches in the page. Based on the structure of the page and the distribution of recognized names, identify strings that may also be names based on their location in the DOM tree heirarchy representing the page.
4. Write a survey paper about current approaches for understanding and analyzing the Deep Web. Be sure to include many of your own comments on the viability of the approaches you review.
5. Or, feel free to suggest a miniproject of your own.
Background: Knowledge of Java or Python would be helpful. Some knowledge of information retrieval and machine learning may be useful but is not required.
Deliverables: You should submit a report that clearly describes what you have learned and what you have accomplished. The report should include useful references. You should also provide any source code you may have written to validate your ideas.

