Time-Dependent Semantic Similarity Measure of Queries Using

1 Introduction

A real dilemma for most of the existing Web search engines is that users request for accurate search results while they only provide queries of limited length, which is usually less than two on average according to [19]. Recently, a lot of work has been done in the Web search community to expand the query terms with similar keywords for refining the search results [3,6,19,20,21]. The basic idea is to use the click-through data, which record the interactions between users and a search engine, as the feedback to learn the similarity between the query keywords. The existing work can be generally classified into two categories: document term based query expansion and query term based query expansion. The document term based query expansion, which was firstly introduced in [6] by Cui et al., is to measure similarity between search queries and document terms based on the query log data of search engines. The basic idea is that if a set of documents is often selected for the same queries, then the terms in these documents are strongly related to the queries. Thus the probabilistic similarity between query terms and document terms can be calculated based on the query log data. The query term based query expansion [3,21] is to measure similarity between query terms using the similarity propagation of Web pages being clicked. The intuition behind is that Web pages are similar if they are visited by users issuing similar queries, and queries are similar if the corresponding users visit similar Web pages.

数据挖掘实验室

1 Motivation

Although click-through data has received considerable attention on measuring similarity between query terms in the Web research community, most of existing work ignored an important fact, i.e., the similarity between query terms often evolves over time. Here the similarities between query terms are usually obtained by the similarity propagation between queries, pages, and their co-occurrences [21]. The dynamic nature of query terms similarity over time can be attributed to many factors, such as seasons, holidays, and special events, etc. This dynamic nature of query terms similarity is usually embedded in the click-through data implicitly. Traditional methods without temporal consideration have limitations to measure such underlying semantic similarity of query terms accurately.

数据挖掘研究院

Therefore, it becomes a challenging and important task to develop an effective model for similarity measure from the click-through data that can take advantage of the dynamic nature of queries and pages over time. In particular, two challenging issues are needed to be solved as follows: 数据挖掘论坛

  • How to exploit the click-through data for semantic similarity measure of queries in terms of temporal consideration?

     

  • How to develop an effective model that can reflect both explicit content similarity and implicit semantics between queries? 数据挖掘交友

To address the challenging issues, let us first illustrate an example to compare two different approaches of measuring similarity between two queries over time in Figure 1. In the figure, the dotted line represents the first method that measures the similarity value at each time point based on all the click-through data available at that time. We refer to this method as the incremented approach. Different from the incremented approach, the other approach, as shown in the solid line, measures the similarity only by the click-through data given at that time interval. We refer to this method as the interval-based approach.

From the comparison, we see that the interval-based approach can better reflect the temporal factor than the incremented approach. For instance, in the interval-based approach, the similarity values of the second and third month are as high as 0.8, while the similarity values in the fourth and fifth month are as low as 0.2. But in the incremented approach, the similarity values from the second month to the fifth month are respectively 0.6, 0.68, 0.55, and 0.48. This shows that the similarity values in the incremented approach are relatively fixed, which usually cannot evidently reflect the dynamic nature of similarities between the query terms. 数据挖掘交友

Figure 1: Incremented and interval-based methods for similarity measure



Hence, it is important to develop a similarity model that exploits the click-through data effectively in a time-dependent way. To this end, we propose a novel time-dependent framework to exploit the click-through data for semantic similarity measure between queries. An overview of our proposed framework is given in the following. 数据挖掘工具

2 Overview

In this paper we suggest a time-dependent framework to measure the semantic similarity between Web search queries from the click-through data. More specifically, we propose a novel time-dependent query term semantic similarity model, which can exploit the click-through data more effectively than traditional approaches such as the incremented approach shown in Figure 1. Our time-dependent model monitors the terms′ similarity over the temporal dimension and attempts to discover the evolution pattern with the click-through data. Note that in this paper, the term evolution pattern of the query terms′ similarity refers to how the similarity values vary over time. 数据挖掘研究院

The basic idea of our solution is to construct a model from the click-through data by partitioning the click-through data into sequences of sub-groups with respect to certain predefined calendar schema and calendar patterns. For example, to monitor query similarity that may change on a daily basis, the click-through data is partitioned into sequences of subgroups, where each sequence consists of click-through data of an individual day. Then, using proposed semantic similarity measure methodology based on the marginalized kernel function, a daily based temporal similarity model can be constructed. As a result, the time-dependent model can accurately reflect the daily based patterns such as large or small values of query similarities during the specific days.

Our proposed model can be used for more accurate and efficient query expansion for Web search engines. Our empirical results with the real-world click-through data collected from a commercial search engine show that our proposed model can model the evolution of query terms similarity accurately. In summary, the contributions of our work in this paper can be summarized as follows: 数据挖掘工具

  • To the best of our knowledge, we proposed the first time-dependent model to calculate the query terms similarity by exploiting the dynamic nature of click-through data.

    数据挖掘研究院

      数据挖掘工具

  • A probabilistic framework for constructing the time-dependent query term similarity model is proposed with the marginalized kernel, which measures both explicit content similarity and implicit semantics from the click-through data.

     

  • Extensive experiments have been done to evaluate the proposed similarity model using a large collection of click-through data collected from a commercial search engine.

The rest of the paper is organized as follows. In Section 2, we review related work of calculating the query terms similarity from the click-through data and temporal analysis of the click-through data. Section 3 presents our time-dependent framework for semantic similarity measure of query terms in the context of query expansion, in which the time-dependent model is formulated by a marginalized kernel. Section 4 provides our empirical evaluation, in which some empirical examples are shown from real-world click-through data and extensive experiments are conducted to evaluate the performance of our model. Section 5 discusses limitations and future work. Section 6 concludes this work. 数据挖掘实验室

2 Related Work

In this section, we review related work mainly in the following two aspects: query expansion with click-through data and temporal analysis of click-through data.

数据挖掘研究院

Query expansion with click-through data is motivated by the well-known relevance feedback techniques, which modify the queries based on users′ relevance judgments of the retrieved documents [1,7,10]. Typically, expansion terms are extracted based on the frequencies or co-occurrences of the terms from the relevant documents. However, it is difficult to obtain sufficient feedbacks since users are usually reluctant to provide such feedback information. Even though the pseudo-relevance feedback approach can partially alleviate the lack of feedbacks, it still suffers from the failure of the assumption that a frequent term from the top-ranked relevant documents will tend to co-occur with all query terms, which may not always hold [2,20].

数据挖掘研究院

The click-through data has been studied for query expansion in the past [3,6,19,21]. The existing work can be categorized into two groups. The first group is to expand queries with similar queries based on the assumption that similarity between queries may be deduced from the common documents the users visited by issuing those queries [3,19,21]. The second group is to expand queries with similar terms in the corresponding documents being visited in the history [6]. In addition to query expansion, click-through data has been used to learn the rank function as well [9,12]. 数据挖掘交友

More recently, several efforts began to analyze the temporal and dynamic nature of the click-through data [4,5,13,16]. In [4], Beitzel et al. proposed the first approach to show the changes of popularities on an hourly basis. With the categorization information of the Web queries, the results show that query traffic from particular topical categories differs from both the query stream as a whole and queries in other categories. Moreover, Shen et al. [13] proposed to investigate the transitions among the topics of pages visited by a sample of Web search users. They constructed a model to predict the transitions in the topics for individual users and groups of users. Vlachos et al. [16] suggested to identify similar queries based on the historical demand patterns, which are represented as time series using the best Fourier coefficients and the energy of the omitted components. Similarly, Chien and Immorlica [5] proposed to find semantically similar queries using the temporal correlation. 数据挖掘交友

However, compared with the existing approaches above, our approach differs from them in the following important aspects. Instead of only discovering Web queries that have specific characteristics, such as burst and periodic queries in [16], our scheme models the similarity of queries based on the predefined calendar schema and calendar patterns. This makes our scheme more general in modelling the temporal feature of queries, which is difficult to be achieved by existing techniques. Further, our scheme also critically differs from the approach in [5], which attempts to figure out a unified similarity or correlation between two terms that may not exist for every case and may change rapidly in different situations of real-world environment. Moreover, different from other previous work on categorized Web queries [4,13], which monitor the differences and transitions between queries from different categories in terms of the frequency of being issued, we focus on exploring queries in terms of their semantic similarity over time. Finally and importantly, in both [16,5] and [4], only the frequency of the queries are considered, while in our approach both the queries and the corresponding pages being clicked are engaged, together with the relationships between queries and pages. Different from previous approaches, we propose a new time-dependent semantic similarity model formulated by the marginalized kernels in a probabilistic framework to explore explicit content similarity and implicit semantics from the click-through data very effectively.

Figure 2: A Framework of Time-Dependent Semantic Similarity Measure Model between Queries


3 The Time-Dependent Framework

In this section, we propose a probabilistic framework to construct the time-dependent query similarity model. By exploiting the click-through data over time, a semantic similarity measure model is proposed based on the marginalized kernel technique [15]. Figure 2 illustrates the framework of our proposed semantic similarity model. The details of our framework will be discussed as the following. First, some preliminary definitions are given to describe click-through data and calendar pattern formally. Following these definitions, a probabilistic approach is proposed to measure content similarity between query terms. Then, we study an efficient clustering technique on the click-through data to acquire the implicit cluster semantics of the data. Lastly, the time-dependent semantic similarity model is formulated by the marginalized kernel function, which measures both explicit content similarity and implicit cluster semantics effectively from the click-through data.

数据挖掘交友

1 Click-Through Data & Calendar Pattern

Click-through data, representing query log of Web search engines, keep the records of interactions between Web users and the searching engines. Similar to the transaction data in the supermarket, the click-through data consist of a sequence of users sessions in some format like $<$IP address, query, clicked page, time$>$. In the literature, there are two different ways to represent the click-through data. One way is to represent the click-through data as session databases where each session represents a pair of a query and a page that the user issued and clicked (hereafter, we call such pairs as query-page pairs) as shown in Table 1 [6,19]. Note that the IP addresses are not shown in the table due to the privacy issue. Recently, some variants of this representation have been proposed by taking into account the rank of the page and the ID of the corresponding users [9,14]. The other representation method is to use a bipartite graph, like the example in Figure 3, where the queries and pages are represented as two sets of nodes and the query-page pair co-occurrence relationships are represented as the edges between the corresponding nodes [3,21].


Table 1: Transaction representation of click-through data.
IP address Query Page Time
xxx.xxx.xxx BMW http://www.bmw.com 14:02 02112005
xxx.xxx.xxx SVM http://svm.first.gmd.de 15:31 03252005
xxx.xxx.xxx MSN http://www.MSN.com 21:14 02142005

数据挖掘交友




Figure 3: Bipartite graph representation of click-through data.



We use the session database approach to represent the click-through data. Each session is represented as a triple $<
vec{q}, p, vec{t}>$, where $vec{q}$ represents the query being issued, $p$ represents the pages being clicked, and $vec{t}$ represents the timestamp. Note that we use the vector representation for both the queries and timestamps. The reason is that, in this paper, the queries will be represented as a vector of related web pages that are led to by the queries. For the timestamps, with a fix calendar schema, they are represented as vectors as well. The key distinguishing feature of our query similarity measure is that rather than only using the query-page pairs, the corresponding timestamps are used as well. From the real query log data obtained from a commercial Web search engine, we observe that the granularity of the timestamps can be in mini-second. However, to analyze the temporal patterns of queries, the time granularity can be application dependent and query dependent. To make our framework flexible, we allow users to specify any types of calendar-based patterns they are interested in depending on their domain knowledge and application requirements. Formally, the calendar schema and calendar-based pattern are defined as follows: 数据挖掘工具

DEFINITION 1   Calendar Schema: A calendar schema, $S = (R, C)$, is a relational schema $R$ with a constraint $C$, where $R= (f_n: D_n,
f_{n-1}: D_{n-1}, cdots, f_1: D_1)$, $C$ is a Boolean valid constraint on $D_n 	imes D_{n-1} 	imes cdots 	imes D_1$ that specifies which combinations of the values in $D_n 	imes D_{n-1} 	imes cdots 	imes D_1$ are valid. $hfillBox$ 数据挖掘工具

Here each attribute $f_i$ in the relation schema is a calendar unit name such as year, month, week, day, hour, etc. Each domain $D_i$ is a finite subset of positive integers. $C$ is a Boolean function specifying which combinations of the values in $D_n 	imes D_{n-1} 	imes cdots 	imes D_1$ are valid. For example, we may have calendar schema (year: {2000, 2001, 2002}, month: {1, 2, 3, $cdots$ ,12}, day: {1, 2, 3, $cdots$ , 31}) with the constraint that evaluate $<y, m, d>$ to be ``true" only if the combination gives a valid date. For instance, $<	extit{2000, 2, 15}>$ is valid while $<	extit{2000, 2, 30}>$ is invalid. The reason to use the calendar schema here is to exclude invalid time interval due to the combinations of calendar units. Moreover, by modifying the constraint, users can further narrow down valid time intervals for application and query dependent reasons. Hereafter, we use $ast$ to represent any integer value that is valid based on the constraint. For instance, if we use $ast$ to represent months in with the above mentioned calendar schema, then $ast$ refers to any integer value from 1 to 12.

数据挖掘工具

DEFINITION 2   Calendar Pattern: Given a calendar
schema $S = (R, C)$, a calendar pattern, denoted as $CAP$, is a tuple on $R$ of the form $<d_n, d_{n-1}, cdots, d_1>$ where $d_i in D_i cup { ast
}$. $hfillBox$

数据挖掘研究院

For example, given the calendar schema $<year, month, day>$, the calendar pattern $<ast, 1, 1>$ refers to the time intervals ``the first day of the first month of every year". Similarly, the calendar pattern $<2002, ast, 1>$ represents the time intervals ``the first day of every month in year 2002." 数据挖掘研究院

2 Probabilistic Similarity Measure

Different from the previous query expansion approaches that use the query log and the actual Web pages to extract similarity between terms in the query space and terms in the document space [6,13], we only employ the query log. We propose a dual approach of the existing information retrieval model by representing each query term as a vector of documents, namely $vec{q} = < w_1, w_2, cdots, w_n >$ in which $w_i$ represents the projection weight on the $i^{th}$ page. In our paper, this weight is calculated from the Page Frequency (PF) and Inverted Query Frequency (IQF), which are formally defined as follows: 数据挖掘实验室

DEFINITION 3   PF.IQF: Given a query $vec q$ and a Web page $p_i$ that has been clicked by users who issued $vec q$ via the search engine. Then, the Page Frequency (PF) and the Inverted Query Frequency (IQF) are defined as:
数据挖掘工具

egin{displaymath}PF(vec q,p_i ) = frac{{f(vec q,p_i )}}{{sum
olimits_j^{}...
...) = log frac{{vertvec
qvert}}{{vert< vec q,p_i >vert}}end{displaymath}

$hfillBox$

Here $f(vec q,p_i )$ is the number of times that page $p_i$ has been clicked by users who issued the query $vec q$. ${sum
olimits_j^{} {f(vec q,p_j )}}$ refers to the total number of times that the pages have been clicked by users who issued the query $vec q$. $vertvec qvert$ refers to the total number of times that the query $vec q$ has been issued and $vert< vec q,p_i >vert$ refers to the times that the page $p_i$ has been clicked by users who issued $vec q$. As a result, the weight of page $p_i$ is calculated as $w_i$ = $PF(vec q, p_i) 	imes IQF(p_i)$. 数据挖掘论坛

Based on the document weight vector representation, the similarity between two queries in content can be defined by a cosine kernel function as follows.

DEFINITION 4   Content Similarity Measure: Given two queries $vec q_1$ and $vec q_2$, their probabilistic similarity in content, denoted as $K_{cos} (vec q_1, vec q_2)$, is defined as:
egin{displaymath}
K_{cos } (vec q_1 ,vec q_2 ) = frac{{vec q_1 ^T vec q...
...ertvertvec q_1 vertvertcdotvertvertvec q_2 vertvert}}end{displaymath}
$hfillBox$

Note that, in the above similarity measure, all occurrences of queries and Web pages are considered to be equally important and the timestamps are not used.

数据挖掘研究院

3 Time-Dependent Semantic Similarity Model

Based on the content similarity measure and the calendar patterns proposed in the previous section, we now present the framework of the time-dependent query semantic similarity model. Before going into the details of our framework, let us first define the relationship between the timestamp and the calendar pattern to facilitate our following discussions as follows: 数据挖掘论坛

DEFINITION 5   Contained: Given a calendar pattern
$<d_n, d_{n-1}, cdots, d_1>$ denoted as $CAP_i$ with the corresponding calendar schema $S = (R, C)$. A timestamp $vec t$ is represented as $<d_n^prime$, $d_{n-1}^prime$, $cdots$, $d_1^prime>$ according to $R$. $vec t$ is contained in $CAP_i$, denoted as $vec t$ $prec$ $CAP_i$, if and only if $forall$ $1leq l leq n$, $d_l^prime$ $in d_l$. $hfillBox$
数据挖掘论坛

  数据挖掘论坛

For example, given a calendar pattern $<ast, 2, 12>$ with the calendar schema $<week, day  of  the  week, hour>$, then the timestamp 2005-09-30 12:28 is not contained in this calendar pattern as it is not the second day of the week, while the timestamp 2005-09-26 12:08 is.

 

  数据挖掘论坛

DEFINITION 6   Click-Through Subgroup (CTS):
Given a calendar schema $S$ and a set of calendar patterns {$CAP_1$, $CAP_2$, $cdots$, $CAP_m$}, the click-through data can be segmented into a sequence of click-through subgroups (CTSs) $<$ $CTS_1$, $CTS_2$, $cdots$, $CTS_m$ $>$, where all query-page pairs $<$ $vec q$, $p_i$, $vec t_i$ $>$ $in$ $CTS_l$, $vec t_i$ $prec$ $CAP_l$, $1 leq l leq m$. $hfillBox$
数据挖掘实验室

With the above definitions, in general, given a collection of click-through data, we can first partition the data into sequences of CTSs based on the user-defined calendar pattern and the corresponding timestamps. For example, given a weekly based calendar schema $<week, day>$ and a list of calendar patterns $<*,
1>$, $<*, 2>$, $cdots$, $<*, 7>$, the click-through data will be partitioned into sequences of $7$ CTSs $<$ $CTS_1$, $CTS_2$, $cdots$, $CTS_7$ $>$ , where $CTS_i$ represents the group of click-through data whose timestamps are contained in the $i^{th}$ day of the week. 数据挖掘研究院

After that, the query similarities are computed within each subgroup and are aligned into a sequence to show the patterns of historical change. At the same time, a model is generated, with which we can obtain the query similarities by inputting queries and timestamps. Given the above example, we can obtain the query similarity on each day of the week. Moreover, we can monitor how the query similarity changes over time within each week in a daily basis. Also, given two queries and the day of a week, the query similarity can be returned. Then, the process iterates for presenting the results and collecting the click-through data with users interactions. Hereafter, we focus on how to construct the time-dependent query similarity model based on sequences of click-through subgroups.

数据挖掘工具

In order to learn the implicit semantics embedded in the click-through data, we first apply clustering techniques on the data to find the cluster information in each click-through subgroup. When the cluster results are obtained, we then formulate our semantic similarity model by the marginalized kernel technique that can unify both the explicit content similarity and the implicit cluster semantics very effectively. Before the discussion of our semantic similarity model, we first discuss how to cluster the click-through data efficiently. Let us first give a preliminary definition. 数据挖掘研究院

DEFINITION 7   Clusters of Click-Through Pages:
Given a click-through subgroup, we can obtain clusters of click-through pages $Omega$={$c_1$, $c_2$

[数据挖掘专家] [数据挖掘研究院] [数据挖掘论坛] [数据挖掘实验室]
上一篇:在网络信息海洋中淘金??从KDD到KDW
下一篇:Optimizing Scoring Functions and Indexes for Proximity Searc
最新评论共有 0 位网友发表了评论 , 查看所有评论
发表评论( 不能超过250字,需审核,请自觉遵守互联网相关政策法规。 )
匿名?
数据挖掘网站导航 数据挖掘论坛导航
  • 数据挖掘工具
  • 数据挖掘论坛
  • DataCruncher - Cognos
  • MineSet - MathSoft
  • Intelligent Miner - GainSmarts
  • Sqlserver - SAS - Clementine
  • CART - Weka - WizSoft
  • NeuroShell - ModelQuest
  • data mining tools - Darwin
  • 数据挖掘交友
  • 数据挖掘博客
  • 数据挖掘工具
  • 数据挖掘资源
  • 数据挖掘技术算法
  • 数据挖掘相关期刊、会议
  • 研究院联盟合作专区
  • 数据挖掘基础与相关技术
  • 数据挖掘厂商与就业
  • 数据挖掘研究者乐园
  • 知名厂商数据挖掘工具资料
  • 国内数据挖掘实验室
  • Foreign Data Mining Lab
  • 热点关注
  • Web数据挖掘的研究现状及发展
  • Web数据挖掘技术综述
  • 百度申请精确广告专利 欲抑制Google步伐
  • Web数据自动采集及其应用研究
  • 信息安全中的数据挖掘
  • 面向Web的数据挖掘
  • Extended Log File Format
  • 基于XML的Web数据挖掘在数字图书馆中的应用
  • XML与Web数据挖掘
  • Web数据挖掘
  • 论坛最新话题
  • Foundations of Statistical Natural Langu
  • Game Theory meet Data Mining: A Recent P
  • System Building: How does it help or hin
  • 数据挖掘与Clementine培训
  • 新手报到
  • 求 SASEM 客户流失预测分析
  • 数据挖掘工程师/搜索研究院—北京——无线
  • 数据挖掘入门介绍(如何着手数据挖掘)
  • Information Overload Survey Results
  • The INEX 2005 Workshop on Element Retrie
  • 相关资讯
  • Any Extract (AE) 网站在线编辑
  • 信息安全中的数据挖掘
  • 基于XML的Web数据挖掘在数字图书馆中的应用
  • Web数据挖掘技术综述
  • Web数据挖掘
  • 北大计算机所万小军博士接连在国际一流学术
  • Refereed Papers on WWW2007
  • WWW2007 tutorials
  • WWW2007 workshops
  • Why ’08 Matters for the Web
  • 数据挖掘实验室资料
  • 数据挖掘博客地址
  • 数据挖掘实验室网站地址
  • Prepare for Medicare audits by using dat
  • 注册成为SAS用户与爱好者俱乐部会员
  • 水南梅
  • 明日烟
  • 新人报道
  • 下载
  • 厦门服务器托管,450元/月—0592-5177319 高
  • 买空间送域名--0592-5177319 高静