Anonymous web data from www.microsoft.com
Data Type
relational, multivariateAbstract
This dataset records which areas (Vroots) of www.microsoft.com each user visited in a one-week timeframe in Feburary 1998.
Sources
Original Owner and Donor
Jack S. Breese, David Heckerman, Carl M. Kadie Microsoft Research, Redmond WA, 98052-6399, USA breese@microsoft.com, heckerma@microsoft.com, carlk@microsoft.com 数据挖掘研究院Date Donated: November 30, 1998
Data Characteristics
The data was created by sampling and processing the www.microsoft.com logs. The data records the use of www.microsoft.com by 38000 anonymous, randomly-selected users. For each user, the data lists all the areas of the web site (Vroots) that the user visited in a one week timeframe.
Users are identified only by a sequential number, for example, User #14988, User #14989, etc. The file contains no personally identifiable information. The 294 Vroots are identified by their title (e.g. "NetShow for PowerPoint") and URL (e.g. "/stream"). The data comes from one week in February, 1998.
Each instance represents an anonymous, randomly selected user of the web site. Each attribute is an area ("vroot") of the www.microsoft.com web site.Missing Attribute Values: The data is very sparse, so vroot visits are explicit, nonvisits are implicit (missing).
Summary Statistics
| Training Instances | 32711 |
| Testing Instances | 5000 |
| Attributes | 294 |
| Mean vroot visits per case | 3.0 |
Data Format
The data is in an ASCII-based sparse-data format called "DST". Each line of the data file starts with a letter which tells the line′s type. The three line types of interest are:-- Attribute lines: For example, ′A,1277,1,"NetShow for PowerPoint","/stream"′ Where: ′A′ marks this as an attribute line, ′1277′ is the attribute ID number for an area of the website (called a Vroot), ′1′ may be ignored, ′"NetShow for PowerPoint"′ is the title of the Vroot, ′"/stream"′ is the URL relative to "http://www.microsoft.com" Case and Vote Lines: For each user, there is a case line followed by zero or more vote lines. For example: C,"10164",10164 V,1123,1 V,1009,1 V,1052,1 Where: ′C′ marks this as a case line, ′10164′ is the case ID number of a user, ′V′ marks the vote lines for this case, ′1123′, 1009′, 1052′ are the attributes ID′s of Vroots that a user visited. ′1′ may be ignored.
Past Usage
J. Breese, D. Heckerman., C. Kadie. Empirical Analysis of Predictive Algorithms for Collaborative Filtering. Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, Madison, WI, July, 1998. 数据挖掘研究院

