Webstemmer - How it works?

Introduction

Webstemmer uses the following assumptions for analyzing web pages:

  • Most pages share (at most a handful of) common layout structures.
  • The location of the main text is consistent with each page layout.
  • Even if the main text changes, its layout structure doesn′t change.
  • HTML tags used in banners and navigation texts don′t change within the same page layout.

Webstemmer tries to identify a page layout by finding invariant features that are preserved across different pages. Then it tries to remove unchanged parts (which are mostly banners and navigation links) within the same page layout.

数据挖掘实验室

1. Cluster pages based on their layout structure.

数据挖掘交友

2. Align pages which have the same layout.

数据挖掘实验室

3. Remove the common parts.

数据挖掘交友


Analyzing Layouts - analyze.py

Layout analysis is to extract a "pattern" of HTML pages by clustering pages which have similar structures (layouts) to each other. To perform clustering, one has to compare multiple HTML pages and compute its similarity by comparing the features of each page. Group the pages which is similar to each other and extract common features as the pattern of the group (cluster). Here is the basic algorithm to do this in analyze.py program. 数据挖掘研究院

Step 1. Parsing the page

First analyze.py decomposes each HTML page into a sequence of "layout blocks". Though HTML elements normally forms a tree structure, we only take HTML block elements such as <p>, <div>, and <h1> and <title> element. We interpret an HTML page as a sequence of layout blocks. Layout blocks are used as features to compute the similarity of different pages. Here is an example of a HTML page and its layout blocks:


Fig 1. Extracting a sequence of layout blocks from HTML. 数据挖掘工具

Step 2. Computing similarity

The similarity between two pages is determined by the edit distance of the layout block sequences. This is similar to comparing two strings (i.e. diff). First obtain the maximum common sequence of layout blocks and compute the ratio of shared layout blocks between two sequences. This way, we can tolerate a redundant element which is accidentaly inserted in the middle of the sequences.

数据挖掘论坛


Fig 2. Comparing layout block sequences.

数据挖掘论坛

  数据挖掘实验室

Step 3. Clustering and generating a layout pattern

To perform clustering, one need to compute the similarity scores for all possible combinations of N pages, namely N x N pairs. The algorithm used in the analyze.py program tries to reduce the number of combinations, so the actual number of similarities might be less than N x N. A pages whose similarity exceeds a certain threshold is grouped into the same cluster. This threshold is given by ′-t′ option in the analayze.py program. After obtaining clusters, extract the common layout blocks contained in all the pages in a group (cluster). These layout blocks form a "layout pattern" which is the output of this program. 数据挖掘实验室


Fig 3. Generating a layout pattern. 数据挖掘工具

Step 4. Removing banners and navigation links

The next step is to find banners and/or navigation links from each layout pattern. Each layout block is given a value called "diffscore", which indicates how much the texts in a block varies to each other. A layout block whose diffscore is less than a certain threshold is considered as static, and is therefore removed in the later stage, because most ads or banners do not change within the same layout. The threshold of the diffscore is given by ′-T′ option in extract.py program.


Fig 4. Finding the common texts between different pages.

数据挖掘论坛

Step 5. Discovering the title and main text

After filtering out static texts, we try to discover a layout block which is the page title. The title block is determined by comparing the text contained in that block and one of anchor texts (a string surrounded by a link tag <a> ... </a>) which refer to that page. In most of the news sites, it is known that the anchor text of a link to an article represents the title of that article. So compute the similarity between texts from each layout block and the anchor texts, and choose the block which has the hightest similarity (Arrow 1. in Fig. 5). This is also based on the edit distance of strings. The title block is determined in the clustering process (i.e. analyze.py), whereas the range of main texts can be adjusted dynamically in the text extraction.

There is an alternative way to find a title block. When there is no anchor text available for that page, we consider the layout block that is most similar to one of the main text and appears before that text block, as a title (Arrow 2. in Fig. 5). This was the only way of finding a title in the previous versions of Webstemmer. But this is a "fallback" method now, because the accuracy of title detection is not as good as the first method. The threshold of the similarity between title and main text is given by ′-T′ option in analyze.py.

In order to find the main text of each page, we compute the "mainscore" for each layout block, which shows how much significant text is included within that block. A layout block that is not a title and whose mainscore is above a certain threshold, is considered as a main text block. Since main texts are often split into several layout blocks, we take every layout block whose mainscore exceeds the threshold. This threshold is given by ′-M′ option in extract.py program.

The sequence of layout blocks in each cluster is stored in a layout pattern file. Each layout block in a cluster has a diffscore and mainscore. Each cluster has the score of significence, indicating how much that cluster is important. This is based on the number of significant characters (excluding anchor texts) in the main text and the number of pages in the cluster. The detail of the patter file format is described in


Fig 5. Discovering the title and main text from the page.

Anatomy of Pattern Files

analyze.py outputs patterns as a text file. It has one layout pattern per line. Each pattern contains several values formatted by Python′s repr() function. An empty line and a line which begins with a ′#′ is regarded as a comment. Each pattern follows a couple of comment lines which shows the score of the pattern (indicating how likely the page is an article) and original page IDs that belong to that pattern when learning.

The following sample pattern was obtained from cnn.com. (For reader′s convenience, a pattern is split into multiple lines, which is not allowed in an actual pattern file.)


数据挖掘交友

### version=0.6.0                (Webstemmer Version)
### fname=′cnn.200511210103.zip′ (Source file used for learning)
### cluster_threshold=0.970000   (Clustering threshold of page similarity)
### title_threshold=0.600000     (Threshold for detecting a title)
### pages=74                     (Total number of pages used for learning)

# 3759.96885103 <200511210103/www.cnn.com/2005/EDUCATION/11/18/principal.shaming.ap/index.html> (pattern 1)
#       (Pages which belong to this cluster)
#       200511210103/www.cnn.com/2005/EDUCATION/11/18/principal.shaming.ap/index.html
#       200511210103/www.cnn.com/2005/WORLD/meast/11/20/iran.nuclear.ap/index.html
#       200511210103/www.cnn.com/2005/EDUCATION/11/18/student.progress.ap/index.html
#       200511210103/www.cnn.com/2005/TRAVEL/DESTINATIONS/11/18/homer.exhibit.ap/index.html
#       200511210103/www.cnn.com/2005/US/11/20/lost.in.morgue.ap/index.html
(3759.9688510285455,                                            (Overall score of the cluster)
 ′200511210103/www.cnn.com/2005/EDUCATION/11/18/principal.shaming.ap/index.html′, (Cluster ID)
 9,                                                             (Index of the title layout block in the sequence)
 [                                                              (List of layout blocks)
  # (diffscore, mainscore, block feature)
  (0.54130434782608694, 24.899999999999999, ′title′), 
  (0.0, 0.0, ′table:align=right/tr:valign=middle/td:class=cnnceilb′),
  (0.0, 0.0, ′table:id=cnnceil/tr/td:class=cnnceilw′),
  (0.052631578947368418, 0.0, ′ul:id=nav/li/div′),
  (0.0, 0.0, ′ul:id=nav/li:class=money/div′),
  (0.0,0.0, ′ul:id=nav/li:class=sports/div′),
  (0.047314578005115092, 0.0, ′ul:id=nav/li/div′),
  (0.0, 0.0, ′ul:id=nav/li:class=autos/div′),
  (0.0, 0.0, ′td:id=cnnnavbar:rowspan=2/div:class=cnnnavbot/div′),
  (0.76451612903225807, 23.699999999999999, ′tr:valign=top/td:id=cnnarticlecontent/h1′),
  (0.085365853658536592, 1.3999999999999999, ′div:class=cnninteractiveelementscontainer/div:class=cnnieheader/h3′),
  (0.73124999999999996, 23.399999999999999, ′div:class=cnniebox/div:class=cnnieboxcontent/div:class=cnnemailalertoptionrow′),
  (0.0, 0.0, ′div:class=cnniebox/div:class=cnnieboxcontent/div:class=cnnalertsbuttonrow′),
  (0.0, 0.0, ′div:class=cnninteractiveelementscontainer/div:class=cnniebox/div:class=cnnieboxcontent cnnalertsfooterrow′),
  (0.99176016830294533, 2262.8000000000002, ′tr:valign=top/td:id=cnnarticlecontent/p′),
  (0.0, 0.0, ′td:id=cnnarticlecontent/div:class=cnnstorycontrib/p′),
  (0.0, 0.0, ′tr:valign=top/td/div:class=cnnstorytools′),
  (0.0, 0.0, ′table:id=cnnstorytoolstimebox/tr/td′),
  (0.37226277372262773, 0.0, ′tr:valign=top/td/div:class=cnnbinnav′),
  (0.7265625, 0.0, ′table:class=cnnstoryrelatedstopstory/tr/td′),
  (0.66804979253112029, 0.0, ′td/div:class=cnnstorybinsublk/div′),
  (0.0, 0.0, ′tr:valign=top/td/div:class=cnnbinnav′),
  (0.0, 0.0, ′table:class=cnnstoryrelatedstopstory/tr/td′),
  (0.0, 0.0, ′td/div:class=cnnstorybinsublk/div′),
  (0.0, 0.0, ′tr/td/div:class=cnn4pxlpad′),
  (0.0, 0.0, ′table/tr/td′),
  (0.0, 0.0, ′table/tr/td:align=right:class=cnn7pxrpad′),
  (0.0, 0.0, ′table:id=cnnfoot/tr:valign=top/td′),
  (0.0, 0.0, ′table/tr/td:valign=top′),
  (0.0, 0.0, ′table/tr:class=cnnnopad/td′)
  ]
)
  

Conclusion

Well, in most web pages texts might be extracted in a much easier way, like picking up lines with a certain number of words or punctuations... So the attempt explained within this page is basically nonsense. However, it just looks cool. 数据挖掘研究院

数据挖掘交友


Extracting Texts - extract.py

The extract.py program takes a layout pattern file generated by analyze.py and extract the main text and its title from HTML pages. First it decomposes each HTML page into layout blocks in the same way as the analyzer. Then it tries to find the most similar sequence of layout blocks from the pattern file. If the similarity exceeds a certain threshold (given by ′-t′ option), it is considered that the page has the same layout pattern as the one in the pattern file. 数据挖掘工具

You can add an extra restriction by using "strict mode" (-S option) which prevents incomplete matching. In the strict mode, the program rejects a layout pattern if any of its layout block is missing in the page, no matter how many layout blocks are overlapping. This allows you to identify page layout improved accuracy. However, in some newspapers (such as several U.S. newspapers, which use slightly different layouts on each day,) you might get lower recall, which means the number of matching pages might be decreased.

After determining the page layout pattern, the program extracts texts from the layout blocks selected by the user-specified thresholds, and output them as either TITLE:, MAIN: or SUB: according to its diffscore and mainscore value.

数据挖掘论坛


Fig 6. Searching a layout pattern which matches the given page. 数据挖掘论坛


Anatomy of Pattern Files.
[数据挖掘专家] [数据挖掘研究院] [数据挖掘论坛] [数据挖掘实验室]
上一篇:Generic Information Retrieval System
下一篇:XWRAP Elite Home
最新评论共有 0 位网友发表了评论 , 查看所有评论
发表评论( 不能超过250字,需审核,请自觉遵守互联网相关政策法规。 )
匿名?
数据挖掘网站导航 数据挖掘论坛导航
  • 数据挖掘工具
  • 数据挖掘论坛
  • DataCruncher - Cognos
  • MineSet - MathSoft
  • Intelligent Miner - GainSmarts
  • Sqlserver - SAS - Clementine
  • CART - Weka - WizSoft
  • NeuroShell - ModelQuest
  • data mining tools - Darwin
  • 数据挖掘交友
  • 数据挖掘博客
  • 数据挖掘工具
  • 数据挖掘资源
  • 数据挖掘技术算法
  • 数据挖掘相关期刊、会议
  • 研究院联盟合作专区
  • 数据挖掘基础与相关技术
  • 数据挖掘厂商与就业
  • 数据挖掘研究者乐园
  • 知名厂商数据挖掘工具资料
  • 国内数据挖掘实验室
  • Foreign Data Mining Lab
  • 热点关注
  • 什么是信息抽取?
  • 信息抽取相关词语定义
  • 什么是信息抽取(Information Extraction )
  • 网上信息抽取技术纵览 参考文献
  • MUC Evaluations and dataset
  • 基于WEB资源的信息抽取技术
  • Jakarta POI - Java API To Access Microso
  • 网上信息抽取技术纵览 第二章信息抽取技术
  • XWRAP Elite Home
  • Generic Information Retrieval System
  • 论坛最新话题
  • Foundations of Statistical Natural Langu
  • Game Theory meet Data Mining: A Recent P
  • System Building: How does it help or hin
  • 数据挖掘与Clementine培训
  • 新手报到
  • 求 SASEM 客户流失预测分析
  • 数据挖掘工程师/搜索研究院—北京——无线
  • 数据挖掘入门介绍(如何着手数据挖掘)
  • Information Overload Survey Results
  • The INEX 2005 Workshop on Element Retrie
  • 相关资讯
  • MUC Evaluations and dataset
  • 信息抽取相关词语定义
  • 什么是信息抽取?
  • Jakarta POI - Java API To Access Microso
  • 什么是信息抽取(Information Extraction )
  • XWRAP Elite Home
  • Webstemmer - How it works?
  • Generic Information Retrieval System
  • TIPSTER Text Program
  • Phase III Overview
  • 数据挖掘实验室资料
  • 数据挖掘博客地址
  • 数据挖掘实验室网站地址
  • Prepare for Medicare audits by using dat
  • 注册成为SAS用户与爱好者俱乐部会员
  • 水南梅
  • 明日烟
  • 新人报道
  • 下载
  • 厦门服务器托管,450元/月—0592-5177319 高
  • 买空间送域名--0592-5177319 高静