RSS
热门关键字:  数据挖掘  数据仓库  商业智能  人工智能  搜索引擎

Anonymous web data from www.microsoft.com

来源: 作者:unkonwn 时间:2004-12-11 点击:

Anonymous web data from www.microsoft.com

Data Type

relational, multivariate

Abstract

This dataset records which areas (Vroots) of www.microsoft.com each user visited in a one-week timeframe in Feburary 1998.

数据挖掘实验室

Sources

Original Owner and Donor

Jack S. Breese, David Heckerman, Carl M. Kadie
Microsoft Research, Redmond WA, 98052-6399, USA
breese@microsoft.com, heckerma@microsoft.com, carlk@microsoft.com
 数据挖掘研究院 
Date Donated: November 30, 1998

Data Characteristics

The data was created by sampling and processing the www.microsoft.com logs. The data records the use of www.microsoft.com by 38000 anonymous, randomly-selected users. For each user, the data lists all the areas of the web site (Vroots) that the user visited in a one week timeframe.

数据挖掘研究院

Users are identified only by a sequential number, for example, User #14988, User #14989, etc. The file contains no personally identifiable information. The 294 Vroots are identified by their title (e.g. "NetShow for PowerPoint") and URL (e.g. "/stream"). The data comes from one week in February, 1998.

Each instance represents an anonymous, randomly selected user of the web site. Each attribute is an area ("vroot") of the www.microsoft.com web site.

Missing Attribute Values: The data is very sparse, so vroot visits are explicit, nonvisits are implicit (missing).

数据挖掘研究院

Summary Statistics

Training Instances 32711
Testing Instances 5000
Attributes 294
Mean vroot visits per case 3.0

Data Format

The data is in an ASCII-based sparse-data format called "DST". Each line of the data file starts with a letter which tells the line′s type. The three line types of interest are:
-- Attribute lines:
For example, ′A,1277,1,"NetShow for PowerPoint","/stream"′
Where:
  ′A′ marks this as an attribute line, 
  ′1277′ is the attribute ID number for an area of the website (called a Vroot),
  ′1′ may be ignored, 
  ′"NetShow for PowerPoint"′ is the title of the Vroot, 
  ′"/stream"′ is the URL relative to "http://www.microsoft.com"

Case and Vote Lines:
For each user, there is a case line followed by zero or more vote lines.
For example:
  C,"10164",10164
  V,1123,1
  V,1009,1
  V,1052,1
Where:
  ′C′ marks this as a case line, 
  ′10164′ is the case ID number of a user, 
  ′V′ marks the vote lines for this case, 
  ′1123′, 1009′, 1052′ are the attributes ID′s of Vroots that a user visited. 
  ′1′ may be ignored.
  

Past Usage

J. Breese, D. Heckerman., C. Kadie. Empirical Analysis of Predictive Algorithms for Collaborative Filtering. Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, Madison, WI, July, 1998. 数据挖掘研究院

数据挖掘实验室

 

数据挖掘研究院

资料全文下载 数据挖掘研究院

资料全文下载

数据挖掘实验室

最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?