News
August 30, 2005: KDD-Cup presentation slides from the KDD conference 数据挖掘研究院
August 30, 2005: Winning Teams
August 30, 2005: Labeled Query Data 数据挖掘实验室
August 10, 2005: Solution Evaluation Result 数据挖掘研究院
Introduction
The KDD-Cup 2005 Knowledge Discovery and Data Mining competition will be held in conjunction with the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. The task is selected to be interesting to participants from both academia and industry. In particular, we encourage the participation of students. This year′s competition is about classifying internet user search queries. We are looking forward to an interesting competition and encourage your participation.Contest Rules
Agreement
By sending the registration email, you indicate your full and unconditional agreement and acceptance of these contest rules.Eligibility
The contest is open to any party planning to attend KDD 2005. A person can participate in only one group. Multiple submissions per group are allowed, since we will not provide feedback at the time of submission. Only the last submission before the deadline will be evaluated and all other submissions will be discarded.Integrity
The contestant takes the responsibility of obtaining any permission to use any algorithms/tools/data that are intellectual property of third party.Winner Selection
There will be three prizes awarding "Query Categorization Precision Award" , "Query Categorization Performance Award", and "Query Categorization Creativity Award". One winner will be selected for each award. 数据挖掘研究院The winners will be determined according to the following method. All participants are ranked according to their overall performance and average precision on the test set. Participants will also be ranked based on their creativity of their methodologies.
Winner of "Query Categorization Performance Award" is the participant who has the best average performance rank in terms of F1 defined below.
Among the participants who have top 10 F1 scores, we will honor the winner of "Query Categorization Precision Award" to be the one who has the best average precision.
Winner of "Query Categorization Creativity Award" is the participant whose model has a top 20 average rank in terms of F1 defined below and is highly outstanding at its creative ideas judged by the KDD Cup co-chairs and a group of search experts. The scalability and level of automation of the model will also be considered in the judgment.
An honorable mention will be awarded for each prize.
Tasks
This year′s competition focuses on query categorization.Your task is to categorize 800,000 queries into 67 predefined categories. The meaning and intention of search queries is subjective. A search query "Saturn" might mean Saturn car to some people and Saturn the planet to others. We will use multiple human editors to classify a subset of queries selected from the total set given to you. The collection of human editors is assumed to have the most complete knowledge about internet as compared with any individual end user. A portion of the editor labeled queries is given to you (CategorizedQuerySample.txt in the zip file for downloading) and the rest will be held back for evaluation. You will not know which queries will be used for evaluation and are asked to categorize all queries given.
You should tag each query with up to 5 categories. If the submission does not contain all search queries, those not included will be treated as having no category tags.
数据挖掘研究院
Please follow the instruction under section "format" when you submit your result.
The evaluation will run on the held back queries and rank your results by how closely they match to the results from human editors. Here are the set of measures we will use to evaluate results submitted by the contestants:
You will be asked to submit your algorithms (please see "Submission of Categorization" for details). The interestingness, scalability, and efficiency of the algorithms will also be judged. New ideas in handling search queries and internet content will be valued and most innovative ideas will be selected by KDD Cup co-chairs.
KDD Cup 2005
| May 2, 2005 | Tasks and datasets available online |
| July 12, 2005 | Submissions of query categorization results due (by midnight PST) |
| July 15, 2005 | Submissions of detailed algorithm due (by midnight PST) |
| August 21-24, 2005 | KDD 2005 Conference |
Datasets
The data set is 800,000 search queries from end user internet search activities. Data is in a text file, one query per line.Registration
Before downloading the datasets, you should register. This will give us a way to contact you in case it is necessary. We will keep your data private, and registering does not indicate any commitment to participation.To register, please send us an email with:
Email of contact person,
Organization
Download
Download data set in zip format (7.5MB)Format
The file you downloaded is an archive that is compressed with WinZip format. Most decompression programs (e.g. winzip, RAR) can decompress these formats. If you run into problems, send us email. The archive should contain three files:- Queries.txt: 数据挖掘研究院
800K search queries. Each line is one query. - CategorizedQuerySample.txt:
This is a sample file containing 111 queries and the manual categorization information. Each line starts with one query followed by its top 5 categories labeled by human experts, separated by tab. There may be fewer than 5 categories for some of the queries. - Categories.txt:
Contains 67 the predefined categories. Each line contains one category name.
"auto price ShoppingStores & Products LivingCar & Garage ShoppingBuying Guides & Researching ShoppingBargains & Discounts InformationCompanies & Industries" 数据挖掘研究院
"auto price" is the query. 数据挖掘研究院
"ShoppingStores & Product", "LivingCar & Garage", "ShoppingBuying Guides & Researching", "ShoppingBargains & Discounts", and "InformationCompanies & Industries" are the category labels for this query. 数据挖掘研究院
Elements in each line are separated by tab " ".
Submission of Categorization
The FTP server for uploading your submissions is open. The address of the ftp server is: ftp://kddcup.kdd2005.com (for use with web browsers) and kddcup.kdd2005.com (for use with ftp clients, a good FTP client is SmartFTP if you need one).
The submission files should follow the below filename scheme: 数据挖掘研究院
For categorization results: lastname-firstname-result-year-month-day.zip 数据挖掘研究院
For algorithm description: lastname-firstname-algorithm-year-month-day.zip 数据挖掘研究院
You should use common accepted compressed format (zip, rar, gz, tar.gz, or arj). 数据挖掘研究院
The file of categorization results should be ANSI plain text file. You should use ".txt" as the file name suffix. The format is the same as CategorizedQuerySample.txt:
Please use CategorizedQuerySample.txt as an example of your submission of categorization result. It is allowed that you have fewer than 5 category labels for some of the queries. If you submit more than 5 category labels in one line, we will only consider first 5 labels for that query. Elements in each line are separated by tab " ". Each line ends with a line feed (" ") or a carriage return immediately followed by a line feed (" "). 数据挖掘研究院
Please strictly follow the file format specified above. Results submitted with incorrect format risk being wrongly evaluated.
You should also have another file describing your algorithm. The description states the methodology, the logic and the reasons behind your algorithm. If you do not want to share the details of your techniques, you can just give a high level outline of your approach and please indicate "a brief summary" at the beginning of your description. In this case, you will not participate in "Query Categorization Creativity Award". 数据挖掘研究院
The description file stem should be "readme". The file extension can be txt, pdf, doc, or ps. The description should be no more than 5 pages, with font size not smaller than 10, single line and single column.
After a file has been uploaded, it cannot be overwritten, read or edited. You can submit multiple versions and we will take the last submission from each participant. If you need to change your submission within the same day, you can add a version number after the date in the file name, such as: Li-Ying-result-2005-07-01-01.zip 数据挖掘实验室
Please also be reminded to submit early to avoid the last minute congestion on the FTP server.
Frequently Asked Questions and News
News
Solution Evaluation Result
The following table contains the evaluation results for the submitted solutions we received. The solutions are listed in random order. The organizer will send the "Submission ID" to each individual participant. Once you receive your "Submission ID" for your solution, you can use it to access your evaluation result in the following table.
|
Submission ID |
Precision |
F1 |
|
1 |
0.145099 |
0.146839 |
|
2 |
0.116583 |
0.139732 |
|
3 |
0.339435 |
0.309754 |
|
4 |
0.110885 |
0.124228 |
|
5 |
0.31068 |
0.085639 |
|
6 |
0.254815 |
0.246264 |
|
7 |
0.263953 |
0.306359 |
|
8 |
0.454068 |
0.405453 |
|
9 |
0.264312 |
0.306612 |
|
10 |
0.334048 |
0.342248 |
|
11 |
0.107045 |
0.116521 |
|
12 |
0.196117 |
0.207787 |
|
13 |
0.326408 |
0.357127 |
|
14 |
0.317308 |
0.312812 |
|
15 |
0.271791 |
0.26545 |
|
16 |
0.050918 |
0.060285 |
|
17 |
0.264009 |
0.218436 |
|
18 |
0.206167 |
0.247854 |
|
19 |
0.136541 |
0.127008 |
|
20 |
0.127784 |
0.126848 |
|
21 |
0.340883 |
0.34009 |
|
22 |
0.414067 |
0.444395 |
|
23 |
0.237661 |
0.250293 |
|
24 |
0.244565 |
0
最新评论共有 0 位网友发表了评论
查看所有评论
发表评论
热点关注
|

