RSS
热门关键字:  数据挖掘  人工智能  数据仓库  搜索引擎  数据挖掘导论
当前位置 :| 首页>人工智能>知识工程>

KDD Cup 2005

来源: 作者:unkonwn 时间:2004-12-08 点击:

News

Introduction

The KDD-Cup 2005 Knowledge Discovery and Data Mining competition will be held in conjunction with the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. The task is selected to be interesting to participants from both academia and industry. In particular, we encourage the participation of students. This year′s competition is about classifying internet user search queries. We are looking forward to an interesting competition and encourage your participation.

  数据挖掘研究院

Contest Rules

数据挖掘研究院

Agreement

By sending the registration email, you indicate your full and unconditional agreement and acceptance of these contest rules.

Eligibility

The contest is open to any party planning to attend KDD 2005. A person can participate in only one group. Multiple submissions per group are allowed, since we will not provide feedback at the time of submission. Only the last submission before the deadline will be evaluated and all other submissions will be discarded.

Integrity

The contestant takes the responsibility of obtaining any permission to use any algorithms/tools/data that are intellectual property of third party.

Winner Selection

There will be three prizes awarding "Query Categorization Precision Award" , "Query Categorization Performance Award", and "Query Categorization Creativity Award". One winner will be selected for each award. 数据挖掘研究院

The winners will be determined according to the following method. All participants are ranked according to their overall performance and average precision on the test set. Participants will also be ranked based on their creativity of their methodologies.

Winner of "Query Categorization Performance Award" is the participant who has the best average performance rank in terms of F1 defined below.

Among the participants who have top 10 F1 scores, we will honor the winner of "Query Categorization Precision Award" to be the one who has the best average precision.

Winner of "Query Categorization Creativity Award" is the participant whose model has a top 20 average rank in terms of F1 defined below and is highly outstanding at its creative ideas judged by the KDD Cup co-chairs and a group of search experts. The scalability and level of automation of the model will also be considered in the judgment.

An honorable mention will be awarded for each prize.

 

Tasks

数据挖掘研究院

This year′s competition focuses on query categorization.

Your task is to categorize 800,000 queries into 67 predefined categories. The meaning and intention of search queries is subjective. A search query "Saturn" might mean Saturn car to some people and Saturn the planet to others. We will use multiple human editors to classify a subset of queries selected from the total set given to you. The collection of human editors is assumed to have the most complete knowledge about internet as compared with any individual end user. A portion of the editor labeled queries is given to you (CategorizedQuerySample.txt in the zip file for downloading) and the rest will be held back for evaluation. You will not know which queries will be used for evaluation and are asked to categorize all queries given.

You should tag each query with up to 5 categories. If the submission does not contain all search queries, those not included will be treated as having no category tags.
数据挖掘研究院
Please follow the instruction under section "format" when you submit your result.

The evaluation will run on the held back queries and rank your results by how closely they match to the results from human editors. Here are the set of measures we will use to evaluate results submitted by the contestants:








You will be asked to submit your algorithms (please see "Submission of Categorization" for details). The interestingness, scalability, and efficiency of the algorithms will also be judged. New ideas in handling search queries and internet content will be valued and most innovative ideas will be selected by KDD Cup co-chairs.

 

KDD Cup 2005

数据挖掘研究院

May 2, 2005 Tasks and datasets available online
July 12, 2005 Submissions of query categorization results due (by midnight PST)
July 15, 2005 Submissions of detailed algorithm due (by midnight PST)
August 21-24, 2005 KDD 2005 Conference

数据挖掘研究院

Datasets

数据挖掘研究院

The data set is 800,000 search queries from end user internet search activities. Data is in a text file, one query per line.

Registration

Before downloading the datasets, you should register. This will give us a way to contact you in case it is necessary. We will keep your data private, and registering does not indicate any commitment to participation.

To register, please
send us an email with:
Name of contact person,
Email of contact person,
Organization

Download

Download data set in zip format (7.5MB)

Format

The file you downloaded is an archive that is compressed with WinZip format. Most decompression programs (e.g. winzip, RAR) can decompress these formats. If you run into problems, send us email. The archive should contain three files:
  • Queries.txt: 数据挖掘研究院
    800K search queries. Each line is one query.
  • CategorizedQuerySample.txt:
    This is a sample file containing 111 queries and the manual categorization information. Each line starts with one query followed by its top 5 categories labeled by human experts, separated by tab. There may be fewer than 5 categories for some of the queries.
  • Categories.txt:
    Contains 67 the predefined categories. Each line contains one category name.
To give an example, the first line in CategorizedQuerySample.txt looks like this:

"auto price        ShoppingStores & Products     LivingCar & Garage
ShoppingBuying Guides & Researching            ShoppingBargains &
Discounts            InformationCompanies & Industries" 数据挖掘研究院 

"auto price" is the query. 数据挖掘研究院

"ShoppingStores & Product", "LivingCar & Garage", "ShoppingBuying Guides & Researching", "ShoppingBargains & Discounts", and "InformationCompanies & Industries" are the category labels for this query. 数据挖掘研究院

Elements in each line are separated by tab " ".

数据挖掘研究院

数据挖掘研究院

Submission of Categorization

The FTP server for uploading your submissions is open. The address of the ftp server is: ftp://kddcup.kdd2005.com (for use with web browsers) and kddcup.kdd2005.com (for use with ftp clients, a good FTP client is SmartFTP if you need one).

数据挖掘研究院

The submission files should follow the below filename scheme: 数据挖掘研究院

For categorization results: lastname-firstname-result-year-month-day.zip 数据挖掘研究院

Example: Li-Ying-result-2005-07-01.zip

For algorithm description: lastname-firstname-algorithm-year-month-day.zip 数据挖掘研究院

Example: Li-Ying-algorithm-2005-07-01.zip

You should use common accepted compressed format (zip, rar, gz, tar.gz, or arj). 数据挖掘研究院

The file of categorization results should be ANSI plain text file. You should use ".txt" as the file name suffix. The format is the same as CategorizedQuerySample.txt:

<Query> <Category_1> <Category_2> <Category_3> <Category_4> <Category_5>

Please use CategorizedQuerySample.txt as an example of your submission of categorization result. It is allowed that you have fewer than 5 category labels for some of the queries. If you submit more than 5 category labels in one line, we will only consider first 5 labels for that query. Elements in each line are separated by tab " ". Each line ends with a line feed (" ") or a carriage return immediately followed by a line feed (" "). 数据挖掘研究院

Please strictly follow the file format specified above. Results submitted with incorrect format risk being wrongly evaluated.

数据挖掘研究院

You should also have another file describing your algorithm. The description states the methodology, the logic and the reasons behind your algorithm. If you do not want to share the details of your techniques, you can just give a high level outline of your approach and please indicate "a brief summary" at the beginning of your description. In this case, you will not participate in "Query Categorization Creativity Award". 数据挖掘研究院

The description file stem should be "readme". The file extension can be txt, pdf, doc, or ps. The description should be no more than 5 pages, with font size not smaller than 10, single line and single column.

数据挖掘研究院

After a file has been uploaded, it cannot be overwritten, read or edited. You can submit multiple versions and we will take the last submission from each participant. If you need to change your submission within the same day, you can add a version number after the date in the file name, such as: Li-Ying-result-2005-07-01-01.zip 数据挖掘实验室

Please also be reminded to submit early to avoid the last minute congestion on the FTP server.

数据挖掘研究院

Frequently Asked Questions and News

数据挖掘研究院

News

数据挖掘研究院

Solution Evaluation Result

The following table contains the evaluation results for the submitted solutions we received. The solutions are listed in random order. The organizer will send the "Submission ID" to each individual participant. Once you receive your "Submission ID" for your solution, you can use it to access your evaluation result in the following table.

Submission ID

数据挖掘研究院

Precision 数据挖掘研究院

F1 数据挖掘研究院

1

数据挖掘研究院

0.145099 数据挖掘研究院

0.146839

2 数据挖掘实验室

0.116583 数据挖掘研究院

0.139732 数据挖掘实验室

3 数据挖掘研究院

0.339435 数据挖掘研究院

0.309754 数据挖掘实验室

4 数据挖掘研究院

0.110885 数据挖掘研究院

0.124228 数据挖掘研究院

5 数据挖掘研究院

0.31068 数据挖掘研究院

0.085639

6

0.254815 数据挖掘研究院

0.246264 数据挖掘研究院

7

数据挖掘研究院

0.263953 数据挖掘实验室

0.306359

数据挖掘研究院

8

数据挖掘研究院

0.454068

0.405453

数据挖掘研究院

9

数据挖掘实验室

0.264312

0.306612

数据挖掘研究院

10 数据挖掘实验室

0.334048

0.342248

11 数据挖掘研究院

0.107045

0.116521

数据挖掘实验室

12

数据挖掘研究院

0.196117 数据挖掘研究院

0.207787 数据挖掘研究院

13 数据挖掘研究院

0.326408 数据挖掘研究院

0.357127 数据挖掘研究院

14

数据挖掘研究院

0.317308

0.312812 数据挖掘研究院

15

0.271791 数据挖掘研究院

0.26545

数据挖掘研究院

16

数据挖掘实验室

0.050918 数据挖掘研究院

0.060285 数据挖掘研究院

17

数据挖掘实验室

0.264009 数据挖掘实验室

0.218436 数据挖掘研究院

18 数据挖掘研究院

0.206167 数据挖掘实验室

0.247854

19 数据挖掘研究院

0.136541 数据挖掘实验室

0.127008

数据挖掘研究院

20

数据挖掘研究院

0.127784 数据挖掘研究院

0.126848 数据挖掘研究院

21

数据挖掘研究院

0.340883 数据挖掘研究院

0.34009

22

0.414067

0.444395 数据挖掘研究院

23

数据挖掘研究院

0.237661 数据挖掘研究院

0.250293

数据挖掘研究院

24 数据挖掘研究院

0.244565

数据挖掘研究院

最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?