I am glad to announce the availability of a new dataset:
WEBSPAM-UK2007. This is a large collection of annotated spam/nonspam
hosts labeled by a group of volunteers. The base data is a set of
105,896,555 pages in 114,529 hosts in the .UK domain downloaded by the
Laboratory of Web Algorithmics of the University of Milano. The
assessment was done by a group of volunteers.
http://www.yr-bcn.es/webspam
* * *
For the purpose of the Web Spam Challenge 2008, the labels are being
released in two sets. SET1, containing roughly 2/3 of the assessed
hosts will be given for training, while SET2 containing the remaining
1/3, will be held for testing. More information about the Web Spam
Challenge 2008, co-located with AIRWeb 2008 will be available soon:
http://airweb.cse.lehigh.edu
http://webspam.lip6.fr/
* * *
Please let us know of any questions and/or comments you have about
this new dataset.
Thank you,

