RSS
热门关键字:  数据挖掘  人工智能  数据仓库  搜索引擎  数据挖掘导论

New Web Spam Dataset available

来源: 作者: 时间:2008-01-22 点击:


I am glad to announce the availability of a new dataset:
WEBSPAM-UK2007. This is a large collection of annotated spam/nonspam
hosts labeled by a group of volunteers. The base data is a set of
105,896,555 pages in 114,529 hosts in the .UK domain downloaded by the
Laboratory of Web Algorithmics of the University of Milano. The
assessment was done by a group of volunteers.

http://www.yr-bcn.es/webspam/datasets/uk2007/

* * *

For the purpose of the Web Spam Challenge 2008, the labels are being
released in two sets. SET1, containing roughly 2/3 of the assessed
hosts will be given for training, while SET2 containing the remaining
1/3, will be held for testing. More information about the Web Spam
Challenge 2008, co-located with AIRWeb 2008 will be available soon:

http://airweb.cse.lehigh.edu/2008/

http://webspam.lip6.fr/

* * *

Please let us know of any questions and/or comments you have about
this new dataset.

Thank you,
最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?